Saturday, February 3, 2018

K Nearest Neighbour

K Nearest Neighbor is considered as one of the simplest classification algorithm. And can be effectively used to solve a numerous of problems.KNN is also used for solving regression problems. It falls under the category of supervised learning.

Let us understand this with an example:-

We are provided with the list of runs scored by Rohit Sharma in home as well as away game. This is represented in a graphical away.




 The Square   here depicts the runs scored by Sharma in the away game while the triangle

 is the runs scored by him in the home game. So, we are given a score  star and we need to

predict it whether this score belongs to a home game or away game. So, we need to take reference

from the  nearest scores (neighbors) and ask them to vote. just by looking here ,We can say that this

star belongs to the away game.

K  primarily defines the number of member who can vote for the new member. And decides whether

they belong to their group or not.

Example:-

Suppose we have to find the category of a tea, whether it is normal tea or ginger tea.

We have taken a sample, and started to categorize it based upon its taste.

K = 5, 3 voted for normal tea and 2 for ginger.

Hence, the sample belongs to normal one.

So, you can say that it is a complete democratic set up.

What will happen if we give the value of k lesser or greater than a certain value?

Choosing the optimal value for K is best done by first inspecting the data. In general, a large K value

is more precise as it reduces the overall noise but there is no guarantee. Cross-validation is another

way to retrospectively determine a good K value by using an independent dataset to validate the K

value. Historically, the optimal K for most datasets has been between 3-10. So, the selection of K

should be taken with utmost care.

Creating a simple KNN classifier using scikit :-

Consider we have 10 data points with their respective labels.

a = [[1],[2],[3],[4],[8],[30],[27],[34],[65],[43]]

b =  [0,0,0,0,0,0,1,1,1,1]

Here a defines the data points while b signifies their respective labels.

We are given few other data points and we need to provide their predicted classifier.

So, let’s plunge into the some dirty coding stuff.

Code in Python:-

from sklearn.neighbors import KNeighborsClassifier
#This will import the KNN from the sklearn python package.

a = [[1],[2],[3],[4],[8],[30],[27],[34],[65],[43]]
#This is the data which is provided.

b =  [0,0,0,0,0,0,1,1,1,1]
# The label for the dataset .

knn =KNeighborsClassifier(n_neighbors=3)
#defined the the number of neighbors for Voting =3.

knn.fit(a,b)
# the data and label are fit into the algorithm.

print(knn.predict(22))
# the value which we need to predict.

Output: - [1]
It means the data point 22 belongs to the label 1.

You Can download the code from my github id :-

https://github.com/sangam92/Machine-Learning-tutorials

Further Reading :-

http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

No comments:

Post a Comment

Hadoop - What is a Job in Hadoop ?

In the field of computer science , a job just means a piece of program and the same rule applies to the Hadoop ecosystem as wel...