Machine Learning Series Part 2: Logistic Regression
Logistic regression is a popular classification algorithm used to predict the outcome of a binary or multi-class variable. It is widely used in various fields such as medicine, finance, and marketing.
What is Logistic Regression?
Logistic regression is a statistical method that is used to predict the probability of a binary outcome (i.e., a “yes” or “no” outcome) based on one or more predictor variables. The outcome variable is usually coded as 1 for the positive outcome and 0 for the negative outcome. The logistic regression model estimates the probability of the positive outcome using a logistic function. The logistic function is an S-shaped curve that maps any real-valued number to a probability value between 0 and 1.
The logistic regression model can be written as:
p(y=1|x) = 1 / (1 + exp(-z))
where:
- p(y=1|x) is the probability of the positive outcome given the predictor variable x
- exp is the exponential function
- z is the linear combination of the predictor variables and their coefficients
Logistic regression is a type of generalized linear model (GLM) that uses a logit link function to model the relationship between the predictor variables and the binary outcome variable. The logit link function is the inverse of the logistic function, which transforms the probability values into log-odds values.
Implementation
Scikit-learn is a popular Python library for machine learning. It provides a wide range of tools for various machine learning tasks, including logistic regression. Here is how to implement logistic regression using scikit-learn:
The first step is to import the necessary libraries and load the data. We will use the breast cancer dataset from scikit-learn’s dataset module.
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
# load data
data = load_breast_cancer()
X = data.data
y = data.target
The next step is to split the data into training and testing sets. We will use 80% of the data for training and 20% for testing.
# split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
The third step is to fit the logistic regression model on the training data and make predictions on the testing data.
# fit logistic regression model
lr = LogisticRegression()
lr.fit(X_train, y_train)
# make predictions on testing data
y_pred = lr.predict(X_test)
The final step is to evaluate the performance of the model on the testing data. We can use various metrics such as accuracy, precision, recall, and F1-score to evaluate the model’s performance.
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
# evaluate model performance
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1-score: {f1:.2f}")
Conclusion
Logistic regression is a powerful classification algorithm that is widely used in various fields such as medicine, finance, and marketing. Scikit-learn is a popular Python library for machine learning that provides a wide range of tools for logistic regression and other machine learning tasks. In this article, we have discussed the basics of logistic regression and how to implement it using scikit-learn. We have also shown how to evaluate the performance of the model using various metrics. Logistic regression is a great algorithm for binary and multi-class classification tasks and scikit-learn makes it easy to implement and use.