Logistic Regression Binary Classification
  • AI Chat
  • Code
  • Report
  • Beta
    Spinner

    Logistic Regression Binary Classification

    Logistic regression is a fundamental machine learning method originally from the field of statistics. It's a great choice for generating a baseline for any binary classification problem (meaning there are only two outcomes). This template trains and evaluates a logistic regression model for a binary classification problem. If you would like to learn more about logistic regression, take a look at DataCamp's Linear Classifiers in Python course.

    To swap in your dataset in this template, the following is required:

    • There's at least one feature column and a column with a binary categorical target variable you would like to predict.
    • The features have been cleaned and preprocessed, including categorical encoding.
    • There are no NaN/NA values. You can use this template to impute missing values if needed.

    The placeholder dataset in this template consists of churn data from a telecom company. Each row represents a customer over a year and whether the customer churned (the target variable; 1 = yes, 0 = no). You can find more information on this dataset's source and dictionary here.

    1. Loading packages and data

    # Load packages
    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import (
        accuracy_score,
        confusion_matrix,
        precision_score,
        recall_score,
        RocCurveDisplay,
    )
    from sklearn.model_selection import RandomizedSearchCV
    from sklearn.preprocessing import StandardScaler
    
    # Load the data and replace with your CSV file path
    df = pd.read_csv("data/customer_churn.csv")
    df
    # Check if there are any null values
    print(df.isnull().sum())
    # Check columns to make sure you have feature(s) and a target variable
    df.info()

    2. Splitting and standardizing the data

    To split the data, we'll use the train_test_split() function. Then, we'll standardize the input data using StandardScaler() (note: this should be done after splitting the data to avoid data leakage). To learn more about standardizing data and preprocessing techniques, visit DataCamp's Preprocessing for Machine Learning in Python.

    # Split the data into two DataFrames: X (features) and y (target variable)
    X = df.iloc[:, 0:8]  # Specify at least one column as a feature
    y = df["Churn"]  # Specify one column as the target variable
    
    # Split the data into train and test subsets
    # You can adjust the test size and random state
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.30, random_state=123
    )
    
    # Standardize X data based on X_train
    sc = StandardScaler().fit(X_train)
    X_train_scaled = sc.transform(X_train)
    X_test_scaled = sc.transform(X_test)

    3. Building a logistic regression classifier

    The following code builds a scikit-learn logistic regression classifier (linear_model.LogisticRegression) using the most fundamental parameters. As a reminder, you can learn more about these parameters in DataCamp's Linear Classifiers in Python course and scikit-learn's documentation.

    from sklearn import preprocessing
    
    # Define parameters: these will need to be tuned to prevent overfitting and underfitting
    params = {
        "penalty": "l2",  # Norm of the penalty: 'l1', 'l2', 'elasticnet', 'none'
        "C": 1,  # Inverse of regularization strength, a positive float
        "random_state": 123,
    }
    
    # Create a logistic regression classifier object with the parameters above
    clf = LogisticRegression(**params)
    
    # Train the classifer on the train set
    clf = clf.fit(X_train_scaled, y_train)
    
    # Predict the outcomes on the test set
    y_pred = clf.predict(X_test_scaled)

    To evaluate this classifier, we can calculate the accuracy, precision, and recall scores. You'll have to decide which performance metric is best suited for your problem and goal.

    # Calculate the accuracy, precision, and recall scores
    print("Accuracy:", accuracy_score(y_test, y_pred))
    print("Precision:", precision_score(y_test, y_pred))
    print("Recall:", recall_score(y_test, y_pred))

    4. Other evaluation methods: confusion matrix and ROC curve

    We can use a confusion matrix and a receiver operating characteristic (ROC) curve to get a fuller picture of the model's performance. These are available from sklearn's metrics module.

    # Calculate confusion matrix
    cnf_matrix = confusion_matrix(y_test, y_pred)
    
    # Plot a labeled confusion matrix with Seaborn
    sns.heatmap(cnf_matrix, annot=True, fmt="g")
    plt.title("Confusion matrix")
    plt.ylabel("Actual label")
    plt.xlabel("Predicted label")