3. INTRODUCTION TO ML WITH PYTHON USING kNN



Machine learning is about extracting knowledge from data. In early days of "intelligent" applications, many systems used hand coded if - else decision statements to process data or adjust to user statements. Quite possibly the most important part in machine learning is understanding the data you are working with and how it relates to the problem you want to solve. No machine learning algorithm can make predictions on data for data for which it has no information. For instance, no machine learning algorithm can tell you the gender of a person just based on the last name of the person. It is because the data just does not contain the information. You might get lucky if you have the person's first name as it is generally easy to determine someone's gender based on first name. Therefore, it is necessary you understand the data before you begin building your model. Each and every ML algorithms are different are used for different purposes. The best practise is to keep in mind a few questions before building a model:

  • What is the question that you are trying to answer? Do you think the data collected can answer that question?
  • What is the best way to phrase your question as a machine learning problem?
  • Have you collected enough data to represent the problem you want to solve?
  • What features of the data did you extract, and will these enable the right prediction?
  • How will you measure success of the model?
  • How will the ML solution interact with other parts of your research or business products?
When going deep into technical aspects of ML, it is easy to loose sight of the ultimate goal. We won't discuss all the questions in details in this blog post, I would still encourage you to keep in mind all the assumptions that you might be making.

The libraries that we will be using for this model are numpy, scipy, Matplotlib, ipython, scikit-learn and pandas. We will be using Jupyter notebook for building the model. You can easily download Jupyter notebook from anaconda website. Just download the whole package and Jupyter notebook should appear in Anaconda navigator. If you have Python installed but don't have the packages installed, you can install them with a single line of command in terminal using pip:

$ pip install numpy scipy matplotlib ipython scikit-learn pandas pillow

Once you have installed all the packages, you can check the versions installed. This will also be a confirmation that the packages were installed successfully. Just type the code below in Jupyter Notebook:


A First Application: Classifying Iris Species

In this section, we will build a simple machine learning application and build our first model. 
Let us assume your crush is a hobby botanist and is interested in distinguishing the species of some iris flower that she has found. She collected some measurements associated with each iris: the length and width of the petals and the length and width of the sepals, all measured in centimetres. She also has the measurements of some irises that have been previously identified by an expert botanist as belonging to the species setosa, versicolor or virginica. For these measurements of irises, she is certain of which species irises belongs to. Let's assume that your crush will encounter only these species in the wild.

Now, to impress her, you try to figure out ways to solve this problem. Eureka, you can use Machine learning for this as she has data about the petals and sepals. As you have measurements of the correct species, it is supervised learning. This is a classification problem, as the prediction will be either of the three species. 

The data we will be using for this comes included in the Scikit-learn dataset. We can load it by calling the load_iris function and assigning it to a variable:
from sklearn.datasets import load_iris
iris_dataset = load_iris()

To see the keys of the data set, we can use the following commands:
To see the data, type in the following command:
Let us see the target data, target names and the length of observation of the dataset:
You can observe, the target data contains 150 dataset 0 (setosa), 1 (versicolor) and 2 (virginica).

We want to build a model that will be able to predict the species if iris for a new dataset. To do this first we split the dataset into training and testing. It is a thumb rule to take 75% of data for training and 25% data for testing. In Scikit-learn, the data is denoted by X and the labels are denoted with y. There is a predefined function in Scikit-learn that shuffles the data into two part:

Time to build out first KNN model and test it for the dataset:
In the first line, we import the test_train_split function from Scikit-learn following which we assign the X_train, X_test, y_train and y_test variables to the splitted test and train dataset. To make sure we get the same result every time we use this model, we will fix the seed of the pseudorandom number by assigning random_number = 0.
Since we will be using KNN model, we import KNeighborsClassifier from Scikit-learn and train the model. According to the test result accuracy, we have achieved an accuracy of 97% which indeed is not bad. 

Please comment if you were able to run the code or if in any doubt. Next blog post we will dive deeper into KNN model algorithm and see how it actually works and what are the parameters. See you soon.





Comments

Post a Comment

Popular posts from this blog

1. AN OVERVIEW OF MACHINE LEARNING

5. kNN REGRESSION