In this blog, I will introduce a simple and yet powerful machine learning algorithm that is widely used. I will explain where it can be applied and can’t, based on the business objective at the hand.
Table of contents
- What is KNN?
- Decision boundaries.
- How to choose k in KNN?
- Pros and cons.
- Bias and Variance trade-off.
K-NN is a simple and intuitive classifier which is a baseline for more complex models like neural networks and SVM. It can be used for both classification and regression problems. It is mostly used for classification purposes in the industry.
What is KNN?
Let’s start by introducing some mathematical notations and definitions. We use X to denote independent variable(aka. feature, predictor) and Y to the dependent variable(aka. target, class) which we are trying to predict.
KNN is a supervised algorithm where we are given a set of labeled data (x, y). Our goal is to capture the relation between x and y by fitting a function H: X–> Y so that we are confident in predicting whenever we get an unseen data point. We should be able to predict with good accuracy.
KNN is a non-parametric algorithm that makes no assumptions about underlying data like other supervised algorithms. It is an instance-based algorithm that generates results using instances. Concretely, it means that it only uses training data for prediction.
KNN is a non-parametric and instance-based algorithm
KNN is said to be a lazy algorithm as initially it only loads the data into memory and starts learning in the testing phase whenever a query point is given. It computes the euclidean distance between the query point and k number of neighbors. Then, takes the majority vote among the obtained instances and predicts the label of the query point. We can use other distance measures like Manhattan, Hamming distance but euclidean is mostly used.
KNN needs minimal training but computationally expensive testing.
Steps to implement KNN
- Load the data into memory.
- Initialize the value of k.
- Fetch a query point and iterate from through all data points
- Compute the Euclidean distance between the query point and all other points.
- Sort the computed distances in ascending order in an array.
- Fetch the top k rows from the array.
- Apply the majority vote and get the most frequent class.
- Return the predicted class.
From above we see that for low values of k, the decision boundary is jagged and it is overfitting. For a decent value of k, it is a smooth curve.
How to choose k?
Generally, we initialize an array of odd numbers for k value. We choose odd numbers to avoid a tie in majority voting. We get different training and validation errors for different k values. We then select the value which gives the least error and we do this by plotting a curve between k and errors. This method is known as the elbow method as the curve resembles elbow. We use cross-validation to select the best k value.
It is evident from the above plot that, when k value is low we are overfitting i.e error is zero. The error reaches minima as we increase the k value. But error again increases if we further increase the k value and it leads to underfitting i.e, error is increasing drastically.
Pros and Cons
- It is a super interpretable model when the number of dimensions is less.
- But as the number of dimensions increases, it is less interpretable.
- It assumes nothing about the underlying data.
- It works well on multiclass data sets too.
- If we have more data, then it is computationally a lot expensive to store them in memory for the testing phase.
- As the number of dimensions increases, KNN may not work well because of the curse of dimensionality. (Euclidean distance is dependent on dimensions)
- It should not be used in low latency requirements as it takes much time for predicting the output.
- It is heavily affected by imbalanced datasets as one class dominates over the other.
Bias and variance trade-off
We should select a good k value so that there are less bias and less variance.
- Rescaling the features will improve accuracy. We generally normalize the data to achieve this.
- As it is affected by the curse of dimensionality, reducing dimensions using dimensionality reduction techniques improves the performance.
- We should choose the distance metric according to the problem like cosine similarity, Hamming distance when we have text data.
- Approximate nearest neighbors like the k-d tree can be used to store the data to reduce the testing time(Better for d<20). We can also use locality sensitive hashing (LHS) for higher dimensions.
In this blog, we have learned how KNN works for classification. It works similarly for regression problems by using the mean of k- instances rather majority voting system. It is very easy to understand and implement in Python using scikit learn library.
- A Complete Guide to K-Nearest-Neighbors with Applications in Python and R – here
- Applied AI course lectures – here
Post your queries in the comment section, I will be happy to answer them.