5. kNN REGRESSION

Hello world!

This blog is about using kNN algorithm for regression. Please refer to my earlier blogs to get an insight about kNN.
Supervised machine learning is one of the most commonly used and successful types of machine learning. We used supervised learning in the previous blog to classify flower iris into several species using physical measurements of the flower.

Supervised learning is used when we want to predict a certain outcome of from a given input, and we have examples of input/output pairs. We build a machine learning model from the input/output pairs, which comprises our training set. Our goal in machine learning is to make accurate predictions for new never-before-seen data. Supervised learning often requires human effort to build the training set, but later it automates and speeds up an otherwise laborious or infeasible task.



There are two major types of supervised machine learning problems, called classification and regression. In classification, the goal is to predict a class label, which is a choice from a predefined list of possibilities. Classification is sometimes separated into two types namely: binary classification or multi-class classification. Binary classification is similar to answer a yes or no question (Example: spam e-mail or not spam e-mail) and multi-class classification has more than a yes or no (Example: Iris classification in blog 3 and 4). For regression task, the goal is to predict a floating number in programming terms or perhaps a real number in terms of mathematics. An easy way to distinguish between classification and regression problem is to ask if there is some kind of continuity in the output. If there is continuity between possible outcomes, then the problem is regression problem. The more complex our model will be, the better we will be able to predict on the training data. However, if our model becomes too complex, we start focusing too much on each individual data point in our training set, and our model will not generalize well with our new data. There is always a sweet spot in between that will generalize well. We can find it by running our model in loop with different values and plotting a graph.

For the regression variant of the k-nearest neighbor algorithm, let's start by using single nearest neighbor. I have added 3 test data points as green stars on the x-axis. The prediction using a single neighbor is just the target value of the nearest neighbor. These are shown as blue stars in the following figure.

We can use more than a single nearest neighbor. That enhances the model for more accurate predictions. When using multiple nearest neighbors, the prediction is the average, or mean of the relevant neighbors. The following plot shows k = 3.

The k-Nearest neighbor algorithms for regression is implemented in the KNeighborsRegressor class in scikit-learn. The use is similar to that of KNeighborClassifier.

Import the following libraries:


We need pyplot to plot the scatter graphs and the dataset is contained in the mglearn library.

Type in the following code snippet to build the model and run it in a loop:

The code snippet here splits the dataset and runs it in a loop to determine the optimum level of k - neighbor for enhancing accuracy. The plt.show() finally creates the plot to help us determine the best value of k. Following is a screenshot of the plot:
From the above plot we can already infer that as we increase the value of k, the model prediction deteriorates. According to the plot, at k = 3, we get the best model as the prediction is around 82%.

For our one dimensional dataset, we can see what the predictions look like for all possible feature values. To do this, we create a test dataset consisting of many points on the x-axis, which corresponds to the single feature.



We can infer from the above plot, that using only one nearest neighbor has significant influence on the prediction as the predicted value goes through all the data-points in the dataset. This leads to very unsteady prediction. Considering more neighbors lead to a smoother prediction, but these do-not fit the training data very well.

In conclusion, there are two important parameters of the kNN: the number of neighbors and how you measure the distance between the data-points. In practise, using a small number of neighbors works well most of the time but you should always run the model in loop to determine the best value of k. 



kNN model is very easy to understand and build and often gives reasonable performance without a lot of adjustment. A major drawback of the model is it can be very slow in predicting when the dataset is very large (be it number of features or sample size). This approach often does not perform well on datasets with many features (hundreds or more), and it is particularly poor with datasets where most features are 0 (sparse - data) most of the time. As a result, this algorithm is not used very often in the industry because of its drawbacks of being slow and inability to handle many features at a time. 



Ever wondered about investing in stock market? Well, here is a challenge: build a stock price prediction model using kNN. You might be surprised with the results and if you can build it, you already have your reward.

Comments

Post a Comment

Popular posts from this blog

3. INTRODUCTION TO ML WITH PYTHON USING kNN

1. AN OVERVIEW OF MACHINE LEARNING