Workspace
Daryl Anthony Butron Cuayla/

Logistic Regression Binary Classification

0
Beta
Spinner

Logistic Regression Binary Classification

Logistic regression is a fundamental machine learning method originally from the field of statistics. It's a great choice for generating a baseline for any binary classification problem (meaning there are only two outcomes). This template trains and evaluates a logistic regression model for a binary classification problem. If you would like to learn more about logistic regression, take a look at DataCamp's Linear Classifiers in Python course.

To swap in your dataset in this template, the following is required:

  • There's at least one feature column and a column with a binary categorical target variable you would like to predict.
  • The features have been cleaned and preprocessed, including categorical encoding.
  • There are no NaN/NA values. You can use this template to impute missing values if needed.

The placeholder dataset in this template consists of churn data from a telecom company. Each row represents a customer over a year and whether the customer churned (the target variable; 1 = yes, 0 = no). You can find more information on this dataset's source and dictionary here.

1. Loading packages and data

# Load packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import (
    accuracy_score,
    confusion_matrix,
    precision_score,
    recall_score,
    RocCurveDisplay,
)
from sklearn.model_selection import RandomizedSearchCV
from sklearn.preprocessing import StandardScaler

# Load the data and replace with your CSV file path
df = pd.read_csv("data/customer_churn.csv")
df
# Check if there are any null values
print(df.isnull().sum())
# Check columns to make sure you have feature(s) and a target variable
df.info()

2. Splitting and standardizing the data

To split the data, we'll use the train_test_split() function. Then, we'll standardize the input data using StandardScaler() (note: this should be done after splitting the data to avoid data leakage). To learn more about standardizing data and preprocessing techniques, visit DataCamp's Preprocessing for Machine Learning in Python.

# Split the data into two DataFrames: X (features) and y (target variable)
X = df.iloc[:, 0:8]  # Specify at least one column as a feature
y = df["Churn"]  # Specify one column as the target variable

# Split the data into train and test subsets
# You can adjust the test size and random state
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.30, random_state=123
)

# Standardize X data based on X_train
sc = StandardScaler().fit(X_train)
X_train_scaled = sc.transform(X_train)
X_test_scaled = sc.transform(X_test)

3. Building a logistic regression classifier

The following code builds a scikit-learn logistic regression classifier (linear_model.LogisticRegression) using the most fundamental parameters. As a reminder, you can learn more about these parameters in DataCamp's Linear Classifiers in Python course and scikit-learn's documentation.

from sklearn import preprocessing

# Define parameters: these will need to be tuned to prevent overfitting and underfitting
params = {
    "penalty": "l2",  # Norm of the penalty: 'l1', 'l2', 'elasticnet', 'none'
    "C": 1,  # Inverse of regularization strength, a positive float
    "random_state": 123,
}

# Create a logistic regression classifier object with the parameters above
clf = LogisticRegression(**params)

# Train the classifer on the train set
clf = clf.fit(X_train_scaled, y_train)

# Predict the outcomes on the test set
y_pred = clf.predict(X_test_scaled)

To evaluate this classifier, we can calculate the accuracy, precision, and recall scores. You'll have to decide which performance metric is best suited for your problem and goal.

# Calculate the accuracy, precision, and recall scores
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))

4. Other evaluation methods: confusion matrix and ROC curve

We can use a confusion matrix and a receiver operating characteristic (ROC) curve to get a fuller picture of the model's performance. These are available from sklearn's metrics module.

# Calculate confusion matrix
cnf_matrix = confusion_matrix(y_test, y_pred)

# Plot a labeled confusion matrix with Seaborn
sns.heatmap(cnf_matrix, annot=True, fmt="g")
plt.title("Confusion matrix")
plt.ylabel("Actual label")
plt.xlabel("Predicted label")