2. MAIN CHALLENGES OF MACHINE LEARNING

Machine Learning involves two important things: an algorithm and data. These are the things that can go wrong. Insufficient quantity of training data is a drawback. For instance, if you show an apple to a kid, he will be able to recognise apples of all sizes and shapes. Machine learning is not quite there as of yet. The algorithm would need to be trained on a few thousand pictures of apples before it can recognise apples of all shapes and sizes.

At times it can be observed that predictions presented by the ML algorithm doesn't quite match the reality. This problem occurs when the training data used was not representative of the cases you want to generalise. This hold true whether one uses instance - based learning or model - based learning algorithms. Using data representative of the cases you want to generalise is trickier than it sounds because small sample will introduce sampling noise (i.e., non-representative data as a result of chance  ). On the other hand, very large samples can be non-representative if the sampling method is flawed. This is called sample bias.

Often data scientists spend a significant amount of their time on cleaning the training data. This is because if the training data is full of errors, outliers and noise (e.g., due to poor quality measurements), it will be harder for the algorithm to detect underlaying patterns. This means we will have a system with low performance rate.

There is a very popular saying in the field of data-science: 'Garbage in - Garbage out'. The system will be efficient with a low rate of error only if the training data contains enough relevant features and not too many irrelevant features. A critical part of ML is to come up with a good set of features to train on. This process is called feature engineering, involves:

  • Feature selection: selecting the most useful features to train on among existing features.
  • Feature extraction: combining existing features to produce a more useful one (Dimensionality reduction)
  • Creating new features by gathering new data.
So far, I have discussed about consequences of bad data, now lets discuss about bad algorithms. Overgeneralising is something we humans do all the time. Unfortunately, machines can fall in the same trap if we are not careful. This phenomenon is called overfitting in machine learning. This means that the model performs well on the training data bud does not generalise well. Complex models such as Deep Neural Networks can detect subtle patterns in the data but if the training dataset is noisy or it is too small (which introduces sampling noise), then the model is likely to detect patterns in the noise itself. These patterns will obviously not generalise very well to the new instances.

Overfitting happens when the model is too complex relative to the amount and noisiness of the training data. Possible solution to overfitting is to select a model with fewer parameters (e.g., selecting a linear model than a high-degree polynomial model), reducing the number of attributes in training data or by constraining the model. You can also try to gain more training data or perhaps reduce the noise in the training data manually. Constraining a model to make it simpler and reduce risk of overfitting is called regularisation. The amount of regularisation to apply during learning can be controlled by a hyper-parameter.

A hyper-parameter is a parameter of a learning algorithm and not of the model. If you set the regularisation hyper-parameter to a large value, you might end up getting a flat model (a slope close to zero); in this scenario, the algorithm will almost certainly not overfit the data but it will not generalise well and hence the predictions will make no sense. Tuning the hyper-parameters is an important part of building a reliable Machine Learning System.

Underfitting is the exact opposite of overfitting. It occurs when the model is too simple to learn the underlying data. To overcome this challenge, you can select a more powerful algorithm with more parameters. You can also try to feed better features to the learning algorithm (Feature Engineering). Another way to overcome the problem of underfitting is to reduce the constraints on the hyper-parameter (e.g., reducing the regularisation hyper-parameter).

Now, to sum up, for instance you have written a ML programme. After training it you notice that your model performs well on the training data but it generalises poorly to new instances. What is happening? Can you name three possible solutions? 

Comments

Popular posts from this blog

3. INTRODUCTION TO ML WITH PYTHON USING kNN

1. AN OVERVIEW OF MACHINE LEARNING

5. kNN REGRESSION