4. DEEPER INTO kNN (CLASSIFICATION USING kNN)

Hello world!
In this blog post I am going to discuss about kNN in details and finally make a model and optimize the accuracy. In my last blog post (Blog: 3), I created a model to predict iris of flowers with 97% accuracy. In this blog post I will dive deeper into the model and optimize it for more accuracy in prediction.
k - nearest neighbor is a simple algorithm that stores all available cases and classifies the new data or case based on similarity measure. For example, if you are similar to your neighbours, you are one of them. If apple is similar to orange or melons or banana rather than monkey or chimpanzee then most likely Apple will be in group of fruits.

In general, kNN is used in search application where you are looking for similar items. You might be wondering, what is k in kNN? Well, k denotes the number of nearest neighbour in voting class of the new data or the testing data. For example, if k = 3 the labels of the three closest classes are checked and the most common class is assigned to the testing data. Basically euclidean distance measurement is used to determine the class of the given data.

Euclidean distance is measured by:

Consider the following example, you have the details of Monster, Activ, Pepsi and Vodka. You want to determine to which group does Maaza belong to.


Drink
Sweetness
Fizz
Type
Distance from Maaza
Monster
8
8
Energy
6
Activ
9
1
Health
1.4
Pepsi
4
8
Cold
7.2
Vodka
2
1
Hard
6.08
Maaza
8
2
?


You can use the Euclidean distance measurement to determine the distance of maaza from all other drinks. As you can see, maaza is closest to Activ (distance = 1.4), you can infer maaza is a health drink.

The kNN algorithm is arguably the simplest machine learning algorithm. Building the model consists only of storing the training dataset. To make a prediction for a new data point, the algorithm finds the closest datapoint in the training set - its nearest neighbor.



The biggest use of kNN is in recommender system.  It works like a person behind the till at a shop. When you ask for a product, it shows you the product along with similar products that might interest you. Amazon has a great kNN recommender system as a targeted marketing strategy that generates 35% of its revenue. Netflix uses it to suggest you videos. kNN is also used in concept search that is searching semantically similar documents and classifying documents containing similar topics.

For this blog, we will use the iris data from scikit learn. To learn more about the dataset please read my previous blog (chapter 3. Introduction to ML with python using kNN)

To import the dataset type the following commands:

from sklearn.datasets import load_iris
iris_dataset = load_iris()

Next we will import the necessary functions, create a data-frame df and visualize the data in a graph:


The above code should show you the following scatter plot:


From the above scatter plot we can see that setosa is quite well seperated from versicolor and virginica. Thus we might have a certain degree of inaccuracy in the prediction. 

Well, one of the best way to inspect data is to visualise it using scatter plot. A scatter plot puts one feature along x-axis and the other along y-axis, and draws a dot for each data point. Unfortunately, our computer screens are two-dimensional (2 D) which allows us to plot only two (or maybe three) features at a time. It is difficult to plot datasets with more than three features at a time this way. One way to go around this problem is to do pair plot, which looks at all possible pairs of features. If you have a small number of features, such as we have four here, this is quite reasonable. You should keep in mind, however, that a pair plot does not show the interaction of all the features at once.

Type the following code to generate a pair plot:


In case you don't have mglearn installed, use $pip install mglearn to install the package using terminal. The above code will generate the following graph:


From the above plots, we can see that the three classes seem to be well separated. This means that the machine learning model will likely be able to learn to separate them. We could not infer it from the simple 2 dimensional scatter plot.

In the last blog, I described how to create a simple kNN model. In this blog, we will create the same model but we will run the value of k in loop to check how it effects the prediction accuracy and try to optimise the model prediction accuracy. We will do this by running the kNN algorithm in a loop. The training data set contains 112 sets so at most we can iterate it to 112. The below code iterates it to 10 but you can always change it to 113 to get an iteration of 112. This way, we will be able to look at how k value effects the accuracy.



Now, below is a figure of test accuracy of k value from 1 to 113:

We can observe, the test accuracy decreases with increasing value of k, because considering fewer neighbors corresponds to more complex models. At some points it is over-fitting the data and at some points it is under-fitting the data. When we have k value as 1, the prediction on training set is perfect. but when more nearest neighbors are considered, the training accuracy keeps dropping. The test set accuracy for using a single neighbor is lower than when using more neighbors indicating that using the single nearest neighbor leads to a model that is too complex. The best performance is around when the value of k is 7. At this value we get an accuracy of 100%.


While using kNN, we should keep the following things in mind, we should try to use an odd number for the value of k for even number of classes and k should not be a multiple of the number of classes. One of the major drawback of kNN is complexity in searching the nearest neighbors. For very large datasets, kNN is not a very good solution but kNN can produce good results.
Next blog will be about kNN regression. See you until then.

Comments

Popular posts from this blog

3. INTRODUCTION TO ML WITH PYTHON USING kNN

1. AN OVERVIEW OF MACHINE LEARNING

5. kNN REGRESSION