K Nearest Neighbor is considered as one of the simplest classification algorithm. And can be effectively used to solve a numerous of problems.KNN is also used for solving regression problems. It falls under the category of supervised learning.
Let us understand this with an example:-
We are provided with the list of runs scored by Rohit Sharma in home as well as away game. This is represented in a graphical away.
The Square here depicts the runs scored by Sharma in the away game while the triangle
is the runs scored by him in the home game. So, we are given a score star and we need to
predict it whether this score belongs to a home game or away game. So, we need to take reference
from the nearest scores (neighbors) and ask them to vote. just by looking here ,We can say that this
star belongs to the away game.
K primarily defines the number of member who can vote for the new member. And decides whether
they belong to their group or not.
Example:-
Suppose we have to find the category of a tea, whether it is normal tea or ginger tea.
We have taken a sample, and started to categorize it based upon its taste.
K = 5, 3 voted for normal tea and 2 for ginger.
Hence, the sample belongs to normal one.
So, you can say that it is a complete democratic set up.
What will happen if we give the value of k lesser or greater than a certain value?
Choosing the optimal value for K is best done by first inspecting the data. In general, a large K value
is more precise as it reduces the overall noise but there is no guarantee. Cross-validation is another
way to retrospectively determine a good K value by using an independent dataset to validate the K
value. Historically, the optimal K for most datasets has been between 3-10. So, the selection of K
should be taken with utmost care.
Creating a simple KNN classifier using scikit :-
Consider we have 10 data points with their respective labels.
a = [[1],[2],[3],[4],[8],[30],[27],[34],[65],[43]]
b = [0,0,0,0,0,0,1,1,1,1]
Here a defines the data points while b signifies their respective labels.
We are given few other data points and we need to provide their predicted classifier.
So, let’s plunge into the some dirty coding stuff.
Code in Python:-
from sklearn.neighbors import KNeighborsClassifier
#This will import the KNN from the sklearn python package.
a = [[1],[2],[3],[4],[8],[30],[27],[34],[65],[43]]
#This is the data which is provided.
b = [0,0,0,0,0,0,1,1,1,1]
# The label for the dataset .
knn =KNeighborsClassifier(n_neighbors=3)
#defined the the number of neighbors for Voting =3.
knn.fit(a,b)
# the data and label are fit into the algorithm.
print(knn.predict(22))
# the value which we need to predict.
Output: - [1]
It means the data point 22 belongs to the label 1.
You Can download the code from my github id :-
https://github.com/sangam92/Machine-Learning-tutorials
Further Reading :-
http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
Let us understand this with an example:-
We are provided with the list of runs scored by Rohit Sharma in home as well as away game. This is represented in a graphical away.
The Square here depicts the runs scored by Sharma in the away game while the triangle
is the runs scored by him in the home game. So, we are given a score star and we need to
predict it whether this score belongs to a home game or away game. So, we need to take reference
from the nearest scores (neighbors) and ask them to vote. just by looking here ,We can say that this
star belongs to the away game.
K primarily defines the number of member who can vote for the new member. And decides whether
they belong to their group or not.
Example:-
Suppose we have to find the category of a tea, whether it is normal tea or ginger tea.
We have taken a sample, and started to categorize it based upon its taste.
K = 5, 3 voted for normal tea and 2 for ginger.
Hence, the sample belongs to normal one.
So, you can say that it is a complete democratic set up.
What will happen if we give the value of k lesser or greater than a certain value?
Choosing the optimal value for K is best done by first inspecting the data. In general, a large K value
is more precise as it reduces the overall noise but there is no guarantee. Cross-validation is another
way to retrospectively determine a good K value by using an independent dataset to validate the K
value. Historically, the optimal K for most datasets has been between 3-10. So, the selection of K
should be taken with utmost care.
Creating a simple KNN classifier using scikit :-
Consider we have 10 data points with their respective labels.
a = [[1],[2],[3],[4],[8],[30],[27],[34],[65],[43]]
b = [0,0,0,0,0,0,1,1,1,1]
Here a defines the data points while b signifies their respective labels.
We are given few other data points and we need to provide their predicted classifier.
So, let’s plunge into the some dirty coding stuff.
Code in Python:-
from sklearn.neighbors import KNeighborsClassifier
#This will import the KNN from the sklearn python package.
a = [[1],[2],[3],[4],[8],[30],[27],[34],[65],[43]]
#This is the data which is provided.
b = [0,0,0,0,0,0,1,1,1,1]
# The label for the dataset .
knn =KNeighborsClassifier(n_neighbors=3)
#defined the the number of neighbors for Voting =3.
knn.fit(a,b)
# the data and label are fit into the algorithm.
print(knn.predict(22))
# the value which we need to predict.
Output: - [1]
It means the data point 22 belongs to the label 1.
You Can download the code from my github id :-
https://github.com/sangam92/Machine-Learning-tutorials
Further Reading :-
http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
No comments:
Post a Comment