In this post, we will see Naive Bayes which is one of the most common algorithms used for text classification. It is one of the easiest algorithms to start ML.
Table of contents
- Bayes theorem.
- What is the Naive Bayes algorithm?
- The training phase of Naive Bayes.
- The testing phase of Nayes Bayes.
- Avoiding numerical stability issues.
- Bias and Variance trade-off.
- Implementation in Python.
- Feature importance.
- Pros and Cons.
We call the algorithm ‘Naive’ because of the assumption it makes that each feature is independent of other features and in reality it is not true. The ‘Bayes’ part is given after the great statistician, philosopher, Thomas Bayes who created the Bayes theorem.
Note: Naive Bayes is a supervised learning algorithm
- P(A|B) is the probability of hypothesis A given data B. This is called Posterior probability.
- P(B|A) is the probability of data B given the hypothesis A is true.
- P(A) is the probability of the hypothesis being true regardless of data. This is called Prior probability.
- P(B) is the probability of data. This is called Evidence.
Now let us see what P(A|B) and P(B|A) are?
These are conditional probabilities and have the following interpretation.
The above formula explains P(A|B) and the same holds for P(B|A).
If you are still in a daze about Bayes theorem, have a look at the below image.
What is Naive Bayes?
It uses probability in doing the task of classification. We will consider a simple text classification problem in exploring it. Suppose we are working on building a classification problem which is about given a text, we should classify it whether its about sports or not.
If you are a newbie to machine learning, then Naive bayes is the best thing to start off.
The training phase of Naive Bayes
Step 1: Feature engineering
In this step, we try to extract all the features. We just apply Bag of words(BOW) for each category i.e, sports and not sports. All words belonging to Sports will go into one BOW and the words belonging to not sports will go to another BOW. Simply, these contain the words and their counts.
Now that we obtained the frequency table, as shown above, let us compute the likelihood table. The likelihood table is as follows.
Training phase in Naive Bayes is noothing but computing the probabilities and storing them.
Greatt! We are done with the training of The Naive Bayes model!
Take a break if required!
The testing phase of Naive Bayes
Now suppose you are given a sentence and you should classify it. The sentence is ‘a very close game’ and we have to classify it. Overall, we are interested in obtaining the below one.
The above equations are read as what is the probability of ‘a very close game’ belonging to sports and not sports. After computing them, which is higher we classify it to that one. Applying Bayes theorem, the equation becomes as follows,
Because we are only trying to find out which category the given sentence falls into, we can discard the denominator P(a very close game) in both and just compare the numerators
It is now easier since we could compute the probabilities and do the task. So now we have to count how many times the given sentence ‘a very close game’ appears with sports label and not sports label then divide it by total to obtain the above-mentioned values.
Here comes the caveat!
‘A very close game’ doesn’t appear in the training data, so the probability should be zero. By this we can say that if any sentence is to be classified, then it should be in training data. If this is true our model won’t be that useful as we have just seen.
What to do?????
So here comes the naive assumption of the Naive Bayes algorithm. Here we assume that each word is independent of other words. Now we need not look at the complete sentence, we can only look at individual words.
Naive assumption made here is that each feature is independent of other features.
In the above X is an n-dimensional vector and after applying conditional independence assumption it becomes as on RHS.
Now rewriting the probability is as follows.
Let’s go back to what we were doing with this. Applying the naive assumption to our problem we get the following equations
Now let’s jump into the calculation of probabilities. Shhh! one more caveat. The word close doesn’t appear in the sports category and this makes multiplying probabilities to zero. (P(close | sports) = 0, leading to P(a very close game | sports)=0)
What to do???
A simple and intuitive solution is Laplace smoothing. Applying this will incorporate a pseudo-count in every probability estimate and making no estimate zero. Intuitively it is a way of regularizing and we shall discuss this in Bias-variance trade-off.
where pseudo count α > 0is the hyperparameter in Naive Bayes. In most of the cases, α is equal to 1. In the above equation d is the number of words. So here d=14. (Total number of words are 14). So let α =1 and d=14, then it gives us a non zero probability as follows,
Now let us get the final probabilities.
From the above table, we see that P(a very close game | Sports) which shows that the given sentence belongs to the sports category.
So during test time, we use the counts of words in train which are in test and then perform the above discussed process
Avoiding numerical underflow issues.
As we have seen above that the final probabilities are quite small. It is because we were multiplying many tiny probabilities and it is leading to numerical underflow issues. So to avoid these underflow issues, we take the log and this simply adds the probabilities rather multiplying.
We take log because as it is a monotonic function and it keeps the smaller probability values smaller and vice-versa.
Bias and Variance trade-off.
I will give you some intuition on it. As we have seen above that α is the hyperparameter in Naive Bayes, using it we control overfitting and underfitting.
Case 1: When α is small(say zero), it works well on train data. Whenever the test data comes, as seen above it fails to classify i.e, it fails to generalize. Intuitively, we can say it is performing well on train data and failing on test data which is a case of overfitting. So if α is less, it leads to overfitting.
Case 2: If α is large(say in 100’s), then this quantity nullifies the probability estimates and simply gives the same value for every example. This is clearly a situation of underfitting where our model is not using training data at all. So if α is large, it leads to underfitting.
We try different α values and select the one which gives the best accuracy.
Implementation in Python
At this moment, we will be having the likelihoods of each and every word along with the class it belongs to. Sort all the values of the likelihoods in descending order and select the ones with higher ones. The ones with higher values will contribute to the corresponding category. Top words come a number of times and these, in turn, create high probabilities.
The above equation says that for every word which belongs to some class will have likelihood values. Based on the magnitude of the value, we arrive at feature importance. This helps in the interpretability of the model, hence Naive Bayes is interpretable to some extent.
Pros and Cons
- It works well with high dimensional data and hence we use it for text classification problems as they have high dimensions.
- It is interpretable and used a lot in low latency systems as we only store the likelihood and prior probabilities.
- When the independence assumption holds true, it works a lot better compared to other machine learning algorithms.
- If the independence assumption fails, then it is a bad model and can’t be applied.
- If Laplace smoothing isn’t used then it can easily overfit and whenever a new observation comes it simply assigns a probability of zero. Hence we should use Laplace smoothing.
- In the example discussed, we have seen that Naive Bayes gives the probability values which are very very less and hence we won’t consider these values much.
We can improve the accuracy of the Naive Bayes model by updating it with new vocabulary words rather retraining it.
- We have understood about Naive Bayes in good detail.
- We have also seen the naive assumption it makes and understood it.
- The only hyperparameter in Naive Bayes is α. We only tune it and get a decent model.
- It is used a lot in categorizing news, spam detection, digit recognition, sentiment analysis and many more.