Learn about classification ,logistic regression and how to implement it .
- Linear Regression vs Logistic Regression
- Binary Logistic Regression
- Sigmoid Function
- Decision Boundary
- Multi-class Logistic Regression
- Scikit-learn Implementation.
- Importing Model
- Parameter C (Regularization)
- Training and prediction
Logistic Regression is a statistical method that is used to analyze a dataset containing one or more independent variables which determines their class. Linear models are highly used to classify the dataset into their respective classes.
In classification, the goal is to predict a class label, which is to be predicted from a predefined list of possible classes. Classification can be binary , i.e classification between 2 classes and multi-class , i.e classification between more than 2 classes.
Linear Regression vs Logistic Regression:
Linear regression and logistic regression, both are the members of linear models family.Linear regression is used for regression problems , whereas Logistic regression is used for classification problems. Let’s suppose we have to analyze student’s data in any school and we try to predict that whether he will pass the exam or not, that is a classification problem and we use Logistic regression for this. And if we are predicting the student’s marks from 0 to 100 , it’s a regression problem and Linear regression will perform better.
Binary Logistic Regression :
Binary logistic regression is used to classify between 2 classes , say 0 and 1. Some examples of Binary classifications are:
- E-mail : Spam /not Spam
- Churn Prediction
- Tumor : malignant / benign
Hypothesis Equation :
For binary logistic regression , we need a function that must satisfies : 0 ≤ h(X) ≤ 1. So that we can classify everything for which h(x) > 0.5 as class 1 and remaining as class 0.
For linear regression we saw :
h(X) = θ.T * X
where, θ and X is vectors of weight and attributes. And θ.T is transpose of θ.
For logistic regression , we modify this equation to:
h(X) = σ ( θ.T * X )
where σ ( ) is known as sigmoid function or logistic function.
Sigmoid Function :
In simple words , sigmoid function is a function which converts any real number into a value between 0 and 1.
Therefore , our hypothesis function becomes :A logistic or sigmoid curve can only go between 0 and 1.
Decision Boundary :
A decision boundary is a boundary which divides the classes. Since we’ve set our threshold or cut-off point 0.5 for the sigmoid function , so any value which is more than 0.5 will be classified as class 1 and any value less than 0.5 will be classified as class 0.
Similar to linear regression , logistic regression learns the best weights for the given dataset with the help of gradient descent. Logistic regression makes prediction using the hypothesis function,which gives the probability and then using decision boundary, it tells about the class of the given point.
Multi-Class Logistic Regression:
Multi-class Logistic Regression is used when we have more than 2 predefined classes . A common technique to extend binary classification to multi-class classification is one-vs-rest approach. In this approach, a binary model is learned for every class which tries to separate it from all the other classes.
To make a prediction, the classifier that has highest score on it’s single class wins, i.e this label will be returned as the prediction.
Implementing Logistic Regression with Scikit-Learn :
We do not have to implement this model from scratch to use it, there are many libraries which provide logistic regression as a function. I’ll show you how to use logistic regression with scikit-learn .
Importing Model :
The first thing we need to do is to import Logistic Regression from sklearn library and then initialize the model.
You can see that there are many parameters to help your model to work better. We’ll just discuss one parameter (C) and you can see the rest in the documentation.
Parameter C :
This parameter determines the strength of regularization . The higher values of C correspond to less regularization. In simple words, when you use high value for the parameter C, the logistic regression try to fit the training set as best as possible. This may result in variance(overfiting) problem , i.e the model will perform good with training dataset but not with test set. So you should try different values of the parameter to generalize the model and find the best fit.
Training and Predictions :
To train a model on your data , sklearn provides fit() method.
>> model.fit( X_train, y_train)
Here X_train consists the attributes of the given dataset and y_train consist of class labels. This statement will train your data on the training set.
This statement will predict the class labels of test set.
We can also check the score of how our model perform by using score() method.
In this article we talked about classification ,what is logistic regression, how it works and how to implement it. To make your model perform better on the data , you should try different values for the parameters .
If you have any thoughts on this , you can ask in the comments or you can directly contact me through the contact forum. Please support us by sharing this article and giving your feedback.