In this post, I will walk you through a powerful classification algorithm.

*See my blog on **KNN** if you haven’t seen it yet.*

The above gif gives a clear picture of what a logistic regression tries to do visually.

## Table of contents

- Introduction.
- Types of variables.
- Why can’t we use linear regression for classification?
- What is Logistic regression?
- Assumptions.
- Sigmoid function.
- Cost function.
- Types of Logistic regression.
- Feature importance.
- Pros and Cons.
- Bias and variance trade-off.
- Softmax regression
- Summary.

## Introduction

Logistic regression is a **supervised algorithm** that is used for classification purposes like malign/benign, pass/fail, alive/dead. It basically uses the logistic function to model a binary dependent variable.

## Types of variables

There are basically two types of variables namely Categorical and numerical.

**Categorical**: It takes on variables that are names or labels. The color of a ball (blue or green) and the gender of a person (male or female). These fall under categorical variables.

**Numerical**: These are quantitative in nature meaning they are measurable. For instance, when someone is speaking about the strength of a class, they are telling about the number of students in the class and it is a measurable attribute of the class.

## Why can’t we use linear regression for classification?

From the above figure, we fit both linear regression and logistic regression for a classification problem. In linear regression, the line has regions R1 and R2 which expresses probability greater than 1 and less than 0 and it makes no sense, hence logistic regression comes to the rescue. On the other hand, logistic regression does the required task as explained below.

## What is logistic regression?

Logistic regression is a linear classifier that tries to find a plane that separates the given data. The name appears like regression but it actually is a classification algorithm. It is a regression that computes the likelihood of an event that has occurred.

The name logistic is used because of logistic function which it uses to squash the values between 0 and 1.

## Assumptions

Every algorithm makes some assumptions and these are the assumptions made by logistic regression.

- It assumes that data is linearly separable.
- The model should have little or no multi-collinearity.
- It needs the dependent variable to be binary or ordinal as required.
- Independent variables should be linearly related to log odds.
- Logistic regression requires observations to be independent of each other.
- It typically requires a large sample size.

## Sigmoid function

As we know that, we try to find a plane that best separates the points (It is a linear method. The below equation often is called a hypothesis equation.

The below picture shows three different types of linear classifiers namely perceptron, linear regression, and logistic regression. The notations used are: wᵢ → weights/coefficients, s → regression function, h → hypothesis set that selected classifier brings, θ → sigmoid/logistic function.

As we see from Eq 1, we get real continuous-valued outputs and this has no sense in classifying i.e. So we need a function which shrinks these values between (0,1) which is a lot useful while working with probabilities and sigmoid function comes with this application.

From the above plot, we see that the curve is S-shaped and if t>0 and goes to + ∞ then sig(t) starts from 0.5 and its tail tends to 1. On the other side, if t<0 and goes to – ∞ then sig(t) tends to 0. Simply put it is constrained by horizontal asymptotes. Sigmoid is always bounded between 0 and 1.

Now after passing the hypothesis function through the sigmoid function, the equation follows as below and the output gets squashed between 0 and 1.

where f(*x*) is a function consisting of our features which is generally a linear decision surface.

## Cost function

Now that we learned what a sigmoid function is, let us derive cost function. We can define conditional probabilities for i*th* observation as follows. The equations are for two labels name 0 and 1.

From the above equations, we can generalize it as below

If y=1, then the equation will be h(x) and if y=0 then the equation is (1-h(x))

Now let’s define another term **likelihood of parameters** as:

Maximum likelihood estimation (**MSE**) is a method of estimating the parameters of a probability distribution by maximizing a likelihood function. The coefficients of the model (beta) are learned from the training data and the best coefficients of the model will predict a probability near to 1 for y=1 and vice versa.

Since we are multiplying many probabilities, we may run into numerical stability issues. Hence to avoid such issues we take the log of likelihood. Taking log has many advantages in calculations like converting multiplications to additions.

And finally, we obtain the cost function as:

Maximizing the above function is equivalent to minimizing the below functionis an alternative manner.

We use a negative sign because log values are negative.and a loss of negative value has no meaning

It can be thought as we are penalizing our model for every misclassification and if its correctly classified then the value is zero. The cost function for logistic regression is often called **log loss** and it ranges from **zero to infinity**. We can use any optimizer to minimize the cost function.

Log loss of zero is an ideal condition

## Types of logistic regression

Based on the target variable, we have 3 types as follows

**Binomial**: In this case, the target variable has only two possible values. Examples are a pass/fail, win/loss, etc.**Multinomial**: In this case, the target variable has more than 2 possible values and each type has no quantitative significance. Examples include ‘Cancer type A’, ‘Cancer type B’ and ‘Cancer type C’**Ordinal**: In this case, it is the same as multinomial but the difference is the values have quantitative significance. Examples include ‘Good’, ‘Very good’, ‘Best’ and ‘Poor’.

## Feature importance

In the above equation, we observe * k* weights which correspond to

**features in the dataset.**

*k*After training the model, we get the weights associated with each feature. The absolute value of the weight can be used in feature importance. The feature which has higher absolute value can be given higher importance and vice versa. Larger the positive value of a feature signifies the higher importance in the prediction of positive class and vice versa.

After obtaining the feature importance, we can remove the features with less significance and train the model again.

## Pros and Cons

**Pros**

- It is easy to implement and works very well with high dimensional space.
- It can be used in low latency systems as we only store the weights and predict the given test point.

**Cons**

- It fails when the data distribution is not linearly separable. To avoid this we may try different feature engineering techniques. The below picture gives an idea of this point.

With proper feature engineering we can apply logistic regression and it works like a charm.

## Bias and variance trade-off

I am using some different derivation of logistic regression, which makes it easier to explain bias and variance trade-off.

The above equation is equivalent to the below one

From the above interpretation, we see that the minimum value of the above equation is zero as ** w** tends to increase. By this, we can conclude that there are higher chances of overfitting. So there are good chances of having a more generalization error on unseen data. To avoid the issue we use regularizer and we do it by adding a regularizer term to the cost function.

From the above figure, we see that the regularizer term is added. Now λ is the hyperparameter we tune for best results. Regularization is a way of penalizing the large weight coefficients and reducing the complexity of the model. Let’s discuss this in two cases

**Case 1**: If λ = 0, then we are overfitting as discussed above and hence we shouldn’t consider lower λ values. This situation is a variance problem.

**Case 2**: If λ tends to infinity, then the regularizer term is dominant and we are learning nothing from the train data which is an underfitting problem. This situation is a bias problem.

The above discussed is for L2 regularization, we can also use the L1 norm and get L1 regularization. Using L1 induces sparsity i.e, it removes all the unimportant features weights will be zero.

We can also use elastic net regularizer which is a hybrid of L1 and L2 regularizers and we have to tune two hyperparameters.

Regularization is a must task to do to avoid overfitting and underfitting problem

## Softmax regression

Softmax regression is a generalization of regression for multiclass classification. As the name suggests it uses a softmax function instead of the sigmoid function. It is as follows.

where the input ** z** is

The other concepts hold the same and we use optimizers and solve the cost function.

Softmax is simply the extension of logistic regression for multiclass classification.

**Summary**

- We have understood what a sigmoid function is and its essence in logistic regression.
- In this blog, we have seen what is logistic regression from a probability standpoint. We can also derive it using a loss optimization framework and a geometric view.
- During test time, we simply take the test point and compute the probability then we classify it. It is generally a simple dot product between the test point and weight vector then passing through sigmoid function.
- We must transform the data into the format in which logistic regression works well.
- We try all the three regularizers i.e, L1, L2, and elastic net, then select the one which gives the best results.
- We understood about softmax regression which is used for multiclass classification.

## References

Post your queries in the comment section, I will be happy to answer them.