Logistic Regression

15 min readAug 18, 2020

Regression- Since term is present in logistic regression, we may spare few seconds to review regression. As confusing as the name might be, you can rest assured, logistic regression is a type of Classication Algorithm and not a regression algorithm . Its about predicting binary variables i.e when the target variable is categorical.

Logistic regression is probably the first thing a budding data scientist should try to get a hang on classification problems.

Now lets go to the first word ..

Logistic comes from logistic function or logistic curve which is a common S-shaped curve with equation.It is a mathematical function having a characteristic that can take any real value and map it to between 0 to 1 shaped like the letter “S”. The logistic function also called sigmoid function

Their value strictly ranges from 0 to 1. If ‘Z’ goes to infinity, Y(predicted) will become 1 and if ‘Z’ goes to negative infinity, Y(predicted) will become 0.

So now we are aware about the terms included im the name . Lets now explore the algorithm -

Logistic regression is a statistical method which uses categorical and continuous variables to predict a categorical outcome.
When the dependent variable (outcome variable) of choice has two categorical outcomes, the analysis is termed binary logistic regression🎭. However, if the outcome variable consists of more than two levels, the analysis is referred to as multinomial logistic regression.
In order to map predicted values to probabilities, we use the Sigmoid function. Its a probablitisc model. While a deterministic model gives a single possible outcome for an event, a probabilistic model gives a probability distribution as a solution.

Types of Logistic Regression

Binary Logistic Regression: The categorical response has only two 2 possible outcomes. E.g.: Spam or Not

2. Multinomial Logistic Regression: Three or more categories without ordering. E.g.: Predicting which food is preferred more (Veg, Non-Veg, Vegan)

3. Ordinal Logistic Regression: Three or more categories with ordering. E.g.: Movie rating from 1 to 5

Assumptions in Logistic Regression🌀

It requires large sample sizes because maximum likelihood estimates are less powerful at low sample sizes than ordinary least square.

Why logistic regression performs better than linear regression for classification problems 🤔

the predicted value is continuous, not probabilistic: Its prediction output can be any real number, range from negative infinity to infinity. The regression line is a straight line.Whereas logistic regression is for classification problems, which predicts a probability range between 0 to 1.
Distribution of error terms: The distribution of data in the case of linear and logistic regression is different. Linear regression assumes that error terms are normally distributed. In the case of binary classification, this assumption does not hold true.
Variance of Residual errors: Linear regression assumes that the variance of random errors is constant. This assumption is also violated in the case of logistic regression
sensitive to imbalance data/outliers when using linear regression for classification : Linear Regression is based on OLS ie distance. when we have outlier, it can change the best fit line. As it is purely distance based( As linear regression tries to fit the regression line by minimising prediction error, in order to minimise the distance of predicted and actual value).

How the Logistic Regression Algorithm Works 💼 ?

Lets go through the diabetes example , wherein we try to predict whether a person has diabetes or not based on that person’s blood sugar level.
The way to decide that is Decision Boundary📏. To predict which class a data belongs, a threshold can be set . Based upon this threshold, the obtained estimated probability is classified into classes. Say, if predicted_value ≥ 0.5, then classify email as spam else as not spam.
A simple boundary ⛓decision approach will not work very well for this example. It would be too risky to decide the class blunty on the basis of the cutoff because suppose this person’s sugar level (195 mg/dL) is very close to the threshold (200 mg/dL), below which people are declared as non-diabetic. It is, therefore, quite possible that this person was just a non-diabetic with a slightly high blood sugar level. After all, the data does have people with slightly high sugar levels (220 mg/dL), who are not diabetics.

Hence it is better to talk in terms of probability. One such curve which can model the probability of diabetes very well is the sigmoid curve.

since the sigmoid curve has all the properties you would want — extremely low values in the start, extremely high values in the end, and intermediate values in the middle — it’s a good choice for modelling the value of the probability of diabetes.

Making Prediction📈📊 —Using our knowledge of sigmoid functions and decision boundaries, we can now write a prediction function.As the probability gets closer to 1, our model is more confident that the observation is in class 1.

We can replace Z by (α+βx) from the sigmoid function when there is 1 predictor like linear regression .

P is the probability of the outcome occurring.

e is the base of the natural logarithm.

x is the value of the predictor

α is the intercept and β is the coefficient of the predictor x.

α and β are unknown, and they must be estimated based on the available training data using a technique known as Maximum Likelihood Estimation (MLE).

Just as ordinary least square regression is the method used to estimate coefficients for the best fit line in linear regression, logistic regression uses maximum likelihood estimation (MLE) to obtain the model coefficients that relate predictors to the target.

Note that the Z of the Sigmoid function can also be replaced by:

when there is n number of predictors.

Next step is how you can find the Best Fit sigmoid curve. By varying the values of β0 and β1, you get different sigmoid curves. Now, based on some function that you have to minimise or maximise, you will get the best fit sigmoid curve.
Here comes the Cost function of Logistic regression i.e. likelihood 🥰function.

The parameters (betas ) of a logistic regression model can be estimated by the probabilistic framework called maximum likelihood estimation.
This function can then be optimized to find the set of parameters that results in the largest sum likelihood over the training dataset (or we can say it involves treating the problem as an optimization or search problem, where we seek a set of parameters that results in the best fit for the joint probability of the data sample (X) )
In the case of logistic regression, there are two approaches to MLE. They are conditional and unconditional methods. Conditional and unconditional methods are algorithms that use different likelihood functions. The unconditional formula employs the joint probability of positives (for example, churn) and negatives (for example, non-churn). The conditional formula is the ratio of the probability of observed data to the probability of all possible configurations. The unconditional method is preferred if the number of parameters is lower compared to the number of instances. If the number of parameters is high compared to the number of instances, then conditional MLE is to be preferred.

Model result in terms of Odds and Log Odds-

Odds- It is the ratio of the probability of an event occurring to the probability of the event not occurring. For example,
let’s assume that the probability of winning a lottery is 0.01. Then, the probability of not winning is 1–0.01 = 0.99. The odds of winning the lottery = (Probability of winning)/(Probability of not winning).The odds of winning the lottery = 0.01/0.99. Hence, the odds of winning the lottery is 1 to 99, and the odds of not winning the lottery is 99 to 1

While the equation is correct, it is not very intuitive. In other words, the relationship between P and x is so complex that it is difficult to understand what kind of trend exists between the two.So, clearly, the relationship between P and x is too complex to see any apparent trends. However, if you convert the equation to a slightly different form, you can achieve a much more intuitive relationship.

This is useful because we can see that the calculation of the output on the right is linear again (just like linear regression), and the input on the left is a log of the probability of the default class.

This ratio on the left is called the odds of the default class Odds are calculated as a ratio of the probability of the event divided by the probability of not the event.

The output (i.e. the probabilities) has a linear relationship with the log of odds

ln(odds) = b0 + b1 * X

Because the odds are log transformed, we call this left hand side the log-odds or the probit.

.Basically, the odds of having diabetes (P/1-P), indicate how much likelier a person is to have diabetes than to not have it. For example, a person for whom the odds of having diabetes are equal to 4, is 4times more likely to have diabetes than to not have it.

Basically, with every linear increase in x, the increase in odds is multiplicative. For example, in the diabetes case, after every increase of 11.5 in the value of x, the odds are approximately doubled, i.e. they increase by a multiplicative factor of about 2.

It makes more sense to present a logistic regression model’s results in terms of log odds or odds than to talk in terms of probability. This happens especially a lot in industries like finance, banking, etc.

Why can’t we use Mean Square Error (MSE) as a cost function for logistic regression🤨

One of the main reasons why MSE doesn’t work with logistic regression is when MSE loss function is plotted with respect to weights of the logistic regression model, the curve obtained is not a convex curve which makes it very difficult to find the global minimum.

In logistic regression, we use the sigmoid function and perform a non-linear transformation to obtain the probabilities. Squaring this non-linear transformation will lead to non-convexity with local minimums. Finding the global minimum in such cases using gradient descent is not possible. Due to this reason, MSE is not suitable for logistic regression. Cross-entropy or log loss is used as a cost function for logistic regression. In the cost function for logistic regression, the confident wrong predictions are penalised heavily. The confident right predictions are rewarded less. By optimising this cost function, convergence is achieved.

Now lets , code it in python 🐍

First, you imported the logistic regression library from sklearn and created a logistic regression object using:

from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()

2. we take the train dataset and use it to build a model using statsmodels using:

X_train_sm = sm.add_constant(X_train[col])
logm1 = sm.GLM(y_train,X_train_sm, family = sm.families.Binomial())
res = logm1.fit()

Here, we will use the GLM (Generalized Linear Models) method of the library statsmodels. ‘Binomial()’ in the ‘family’ argument tells statsmodels that it needs to fit a logit curve to a binomial data( Diabetic or not).

3. You can get these probabilities by simply using the ‘predict’ function.

# Getting the predicted values on the train set
y_train_pred = res.predict(X_train_sm)

4. Since the logistic curve gives you just the probabilities and not the actual classification of ‘Diabetic’ and ‘Non-Diabetic’, you need to find a threshold probability to classify patients as ‘Diabetic’ and ‘Non-Diabetic’. For now lets keep it 0.5. But there is a find to find optimal threshold value which we will see later (hint-ROC curve)

y_train_pred['predicted'] = y_train_pred.Conversion_Prob.map(lambda x: 1 if x > 0.5 else 0)

5. You chose a cutoff of 0.5 in order to classify the patients Now, since you’re classifying the patients into two classes, you’ll obviously have some errors. The classes of errors that would be there are:

‘Diabetic’ customers being (incorrectly) classified as ‘Non-Diabetic’
‘Non-Diabetic’ customers being (incorrectly) classified as ‘Diabetic’

To capture these errors, and to evaluate how well the model is, you’ll use something known as the ‘Confusion Matrix’.

This table shows a comparison of the predicted and actual labels.

True positive(TP).: This shows that a model correctly predicted Positive cases as Positive.
False positive(FP): This shows that a model incorrectly predicted Negative cases as Positive (Type I error).

3. False Negative:(FN) This shows that an incorrectly model predicted Positive cases as Negative (Type II error).

4. True Negative(TN): This shows that a model correctly predicted Negative cases as Positive.

# Create confusion matrix
confusion = metrics.confusion_matrix(y_train_pred.diabetic, y_train_pred.predicted)

6. Now that we know about confusion matrix , let’s see how good is your model built so far based on the accuracy.

# Calculate accuracy
print(metrics.accuracy_score(y_train_pred.diabetic, y_train_pred.predicted))

Model Evaluation metrics 📈

Why accuracy is not the best Model Evaluation metric for Classification problem ? Lets see with an example

Lets take an example of churn vs not-churn model

From the table above, you can see that there are 595 + 692 = 1287 actual ‘churn’ customers, . But out of these 1287, the current model only predicts 692 as ‘churn’. Thus, only 692 out of 1287, or only about 53% of ‘churn’ customers, will be predicted by the model as ‘churn’. This is very risky — the company won’t be able to roll out offers to the rest 47% ‘churn’ customers and they could switch to a competitor!

So although the accuracy is about 80%, the model only predicts 53% of churn cases correctly.

In essence, what’s happening here is that you care more about one class (class=’churn’) than the other. This is a very common situation in classification problems — you almost always care more about one class than the other. On the other hand, the accuracy tells you model’s performance on both classes combined — which is fine, but not the most important metric.

Similarly, if you incorrectly predict many diseased patients as ‘Not having cancer’, it can be very risky. In such cases, it is better that instead of looking at the overall accuracy, you care about predicting the 1’s (the diseased) correctly.Hence, it is very crucial that you consider the overall business problem you are trying to solve to decide the metric you want to maximise or minimise.

This brings us to (drum rolls 🥁🥁) -

1. Sensitivity & Specificity

Sensitivity=True Positives/(True Positives+False Negatives)

Sensitivity=692/1287≈53.768%

Thus, you can clearly see that although you had a high accuracy (~80.475%), your sensitivity turned out to be quite low (~53.768%)

As you can now infer, this value will be given by the value True Negatives (3269) divided by the actual number of negatives, i.e. True Negatives + False Positives (3269 + 366 = 3635). Hence, by replacing these values in the formula, you get specificity as:

Specificity=True NegativesTrue Negatives+False Positives

Specificity=3269/3635≈89.931%

2. AUC — ROC Curve 🌈

AUC — ROC curve is a performance measurement for classification problem at various thresholds settings. It helps us to evaluate the model as well as in finding the optimal threshold .

ROC is a probability curve and AUC represents degree or measure of separability. It tells how much model is capable of distinguishing between classes.

Higher the AUC, better the model is at predicting 0s as 0s and 1s as 1s. The ROC curve is plotted with TPR against the FPR where TPR is on y-axis and FPR is on the x-axis.

True Positive Rate (TPR)-

True Positive Rate (TPR)=True Positives/Total Number of Actual Positives

As you might remember, the above formula is nothing but the formula for sensitivity. Hence, the term True Positive Rate that you just learnt about is nothing but sensitivity.

False Positive Rate (FPR)-

False Positive Rate (FPR)=False Positives/Total Number of Actual Negatives

Again, if you recall the formula for specificity, it is given by -

Specificity=TN/(TN+FP)

Hence, you can see that the formula for False Positive Rate (FPR) is nothing but (1 — Specificity).

Notice that in the curve when Sensitivity increases , (1 — Specificity) is increasing, it simply means that Specificity is decreasing.

we can say -

TPR⬆️, FPR⬆️ and TPR⬇️, FPR⬇️

So you can clearly see that there is a tradeoff between the True Positive Rate and the False Positive Rate, or simply, a tradeoff between sensitivity and specificity. When you plot the true positive rate against the false positive rate, you get a graph which shows the trade-off between them and this curve is known as the ROC curve.

Lets see , how to Interpreting 👓 the AUCROC Curve -

For a completely random model, the ROC curve will pass through the 45-degree line and in the best case it passes through the upper left corner of the graph. So the least area that an ROC curve can have is 0.5, and the highest area it can have is 1.

By determining the Area under the curve (AUC) of a ROC curve, you can determine how good the model is. If the ROC curve is more towards the upper-left corner of the graph, it means that the model is very good and if it is more towards the 45-degree diagonal, it means that the model is almost completely random.

Lets see how we choose the optimal threshold📐

As you can see, when the probability thresholds are very low, the sensitivity is very high and specificity is very low. Similarly, for larger probability thresholds, the sensitivity values are very low but the specificity values are very high. And at about 0.3, the three metrics seem to be almost equal with decent values and hence, we choose 0.3 as the optimal cut-off point. The following graph also showcases that at about 0.3, the three metrics intersect.

you could choose any other cut-off point as well based on which of these metrics you want to be high.

3. Precision & Recall

The approach here is to find what percentage of the model’s positive (1’s) predictions are accurate. This is nothing but Precision.

Precision: Probability that a predicted ‘Yes’ is actually a ‘Yes’.

Precision=TP/(TP+FP)

Recall: Probability that an actual ‘Yes’ case is predicted correctly.

Recall=TP/(TP+FN)

Remember that ‘Recall’ is exactly the same as sensitivity. Don’t get confused between these.

You might be wondering, if these are almost the same, then why even study them separately? The main reason behind this is that in the industry, some businesses follow the ‘Sensitivity-Specificity’ view and some other businesses follow the ‘Precision-Recall’ view and hence, will be helpful for you if you know both these standard pairs of metrics.

Similar to sensitivity and specificity, there is a trade-off between precision and recall as well.

F−measure :It is the harmonic mean of precision and recall. In some cases, there will be a trade-off between the precision and the recall. In such cases, the F-measure will drop. It will be high when both the precision and the recall are high. Depending on the business case at hand and the goal of data analytics, an appropriate metric should be selected.
F−measure=2×Precision×Recall/(Precision+Recall)

Thats all for now 🤗

If you liked the article, show your support by clapping for this article. This article is basically a colab of many articles from medium , analytical vidya , upgrad material etc.

If you are also learning Machine learning like me follow me, for more articles. Lets go on this trip together :)

You can also follow me on Linkedin