In this post, we will see Naive Bayes which is one of the most common algorithms used for text classification. It is one of the easiest algorithms to start ML.

** See my blogs on KNN and Logistic regression if you haven’t seen yet**.

## Table of contents

- Introduction.
- Bayes theorem.
- What is the Naive Bayes algorithm?
- The training phase of Naive Bayes.
- The testing phase of Nayes Bayes.
- Avoiding numerical stability issues.
- Bias and Variance trade-off.
- Implementation in Python.
- Feature importance.
- Pros and Cons.
- Summary.

## Introduction

We call the algorithm ‘**Naive’** because of the assumption it makes that each feature is independent of other features and in reality it is not true. The **‘Bayes’** part is given after the great statistician, philosopher, **Thomas Bayes** who created the Bayes theorem.

Note: Naive Bayes is a supervised learning algorithm

## Bayes theorem

where,

**P(A|B)**is the probability of hypothesis**A**given data**B**. This is called**Posterior probability**.**P(B|A)**is the probability of data**B**given the hypothesis**A**is true.-
**P(A)**is the probability of the hypothesis being true regardless of data. This is called**Prior probability**. -
**P(B)**is the probability of data. This is called**Evidence**.

Now let us see what **P(A|B)** and **P(B|A)** are?

These are **conditional probabilities** and have the following interpretation.

The above formula explains **P(A|B)** and the same holds for **P(B|A)**.

If you are still in a daze about Bayes theorem, have a look at the below image.

## What is Naive Bayes?

It uses probability in doing the task of classification. We will consider a simple text classification problem in exploring it. Suppose we are working on building a classification problem which is about given a text, we should classify it whether its about sports or not.

If you are a newbie to machine learning, then Naive bayes is the best thing to start off.

## The training phase of Naive Bayes

**Step 1: Feature engineering**

In this step, we try to extract all the features. We just apply Bag of words(**BOW**) for each category i.e, **sports** and **not sports. **All words belonging to **Sports** will go into one BOW and the words belonging to **not sports** will go to another BOW. Simply, these contain the words and their counts.

Now that we obtained the frequency table, as shown above, let us compute the likelihood table. The likelihood table is as follows.

Training phase in Naive Bayes is noothing but computing the probabilities and storing them.

Greatt! We are done with the training of The Naive Bayes model!

Take a break if required!

## The testing phase of Naive Bayes

Now suppose you are given a sentence and you should classify it. The sentence is **‘a very close game’** and we have to classify it. Overall, we are interested in obtaining the below one.

The above equations are read as what is the probability of **‘a very close game’** belonging to sports and not sports. After computing them, which is higher we classify it to that one. Applying Bayes theorem, the equation becomes as follows,

Because we are only trying to find out which category the given sentence falls into, we can discard the denominator **P(a very close game)** in both and just compare the numerators

It is now easier since we could compute the probabilities and do the task. So now we have to count how many times the given sentence **‘a very close game’** appears with **sports** label and **not sports** label then divide it by total to obtain the above-mentioned values.

Here comes the caveat!

‘A very close game’ doesn’t appear in the training data, so the probability should be zero. By this we can say that if any sentence is to be classified, then it should be in training data. If this is true our model won’t be that useful as we have just seen.

What to do?????

So here comes the * naive* assumption of the Naive Bayes algorithm. Here we assume that each word is independent of other words. Now we need not look at the complete sentence, we can only look at individual words.

Naive assumption made here is that each feature is independent of other features.

In the above X is an **n-dimensional** vector and after applying conditional independence assumption it becomes as on RHS.

Now rewriting the probability is as follows.

Let’s go back to what we were doing with this. Applying the naive assumption to our problem we get the following equations

Now let’s jump into the calculation of probabilities. Shhh! one more caveat. The word **close** doesn’t appear in the sports category and this makes multiplying probabilities to ** zero**. (

**P(**= 0, leading to

*close*|*sports*)**P(**=0)

*a very close game*|*sport*s)

*What to do???*A simple and intuitive solution is Laplace smoothing. Applying this will incorporate a **pseudo-count** in every probability estimate and making no estimate zero. Intuitively it is a way of regularizing and we shall discuss this in Bias-variance trade-off.

where **pseudo count** * α* > 0is the hyperparameter in Naive Bayes. In most of the cases,

*is equal to 1. In the above equation*

**α****is the number of words. So here**

*d***=14. (Total number of words are 14). So let**

*d**=1 and*

**α****=14, then it gives us a non zero probability as follows,**

*d*Now let us get the final probabilities.

From the above table, we see that **P(a very close game | Sports)** which shows that the given sentence belongs to the **sports** category.

So during test time, we use the counts of words in train which are in test and then perform the above discussed process

## Avoiding numerical underflow issues.

As we have seen above that the final probabilities are quite small. It is because we were multiplying many tiny probabilities and it is leading to numerical underflow issues. So to avoid these underflow issues, we take the log and this simply adds the probabilities rather multiplying.

We take log because as it is a monotonic function and it keeps the smaller probability values smaller and vice-versa.

## Bias and Variance trade-off.

I will give you some intuition on it. As we have seen above that* α* is the hyperparameter in Naive Bayes, using it we control overfitting and underfitting.

**Case 1:** When * α* is small(say zero), it works well on train data. Whenever the test data comes, as seen above it fails to classify i.e, it fails to generalize. Intuitively, we can say it is performing well on train data and failing on test data which is a case of

**overfitting**. So if

*is less, it leads to overfitting.*

**α****Case 2:** If **α** is large(say in 100’s), then this quantity nullifies the probability estimates and simply gives the same value for every example. This is clearly a situation of **underfitting** where our model is not using training data at all. So if * α* is large, it leads to underfitting.

We try differentαvalues and select the one which gives the best accuracy.

## Implementation in Python

Go to this link where I implemented Naive Bayes on a Donors choose dataset in full detail which covers hyperparameter tuning. If you have any queries drop a mail to me.

## Feature importance

At this moment, we will be having the likelihoods of each and every word along with the class it belongs to. Sort all the values of the likelihoods in descending order and select the ones with higher ones. The ones with higher values will contribute to the corresponding category. Top words come a number of times and these, in turn, create high probabilities.

The above equation says that for every word which belongs to some class will have likelihood values. Based on the magnitude of the value, we arrive at feature importance. This helps in the interpretability of the model, hence Naive Bayes is interpretable to some extent.

## Pros and Cons

**Pro**

- It works well with high dimensional data and hence we use it for text classification problems as they have high dimensions.
- It is interpretable and used a lot in low latency systems as we only store the likelihood and prior probabilities.
- When the independence assumption holds true, it works a lot better compared to other machine learning algorithms.

**Cons**

- If the independence assumption fails, then it is a bad model and can’t be applied.
- If Laplace smoothing isn’t used then it can easily overfit and whenever a new observation comes it simply assigns a probability of zero. Hence we should use Laplace smoothing.
- In the example discussed, we have seen that Naive Bayes gives the probability values which are very very less and hence we won’t consider these values much.

We can improve the accuracy of the Naive Bayes model by updating it with new vocabulary words rather retraining it.

## Summary

- We have understood about Naive Bayes in good detail.
- We have also seen the naive assumption it makes and understood it.
- The only hyperparameter in Naive Bayes is
We only tune it and get a decent model.**α**. - It is used a lot in categorizing news, spam detection, digit recognition, sentiment analysis and many more.