George Boorman














Sign up
Beta
Spinner

Supervised Learning

Predicting values of a target variable given a set of features

  • For example, predicting if a customer will buy a product (target) based on their location and last five purchases (features).

Regression

  • Predicting the values of a continuous variable e.g., house price.

Classification

  • Predicting a binary outcome e.g., customer churn.

Data Dictionary

The data has the following fields:

Column nameDescription
loan_idUnique loan id
genderGender - Male / Female
marriedMarital status - Yes / No
dependentsNumber of dependents
educationEducation - Graduate / Not Graduate
self_employedSelf-employment status - Yes / No
applicant_incomeApplicant's income
coapplicant_incomeCoapplicant's income
loan_amountLoan amount (thousands)
loan_amount_termTerm of loan (months)
credit_historyCredit history meets guidelines - 1 / 0
property_areaArea of the property - Urban / Semi Urban / Rural
loan_statusLoan approval status (target) - 1 / 0
# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
# Read in the dataset
loans = pd.read_csv("loans.csv")

# Preview the data
loans.head()

Exploratory Data Analysis

We can't just dive straight into machine learning! We need to understand and format our data for modeling. What are we looking for?

Cleanliness

  • Are columns set to the correct data type?
  • Do we have missing data?

Distributions

  • Many machine learning algorithms expect data that is normally distributed.
  • Do we have outliers (extreme values)?

Relationships

  • If data is strongly correlated with the target variable it might be a good feature for predictions!

Feature Engineering

  • Do we need to modify any data, e.g., into different data types (ML models expect numeric data), or extract part of the data?
# Remove the loan_id to avoid accidentally using it as a feature
loans.drop(columns=["loan_id"], inplace=True)
# Counts and data types per column
loans.info()
# Distributions and relationships
sns.pairplot(data=loans, diag_kind="kde", hue="loan_status")
plt.show()
# Correlation between variables
sns.heatmap(loans.corr(), annot=True)
plt.show()
# Target frequency
loans["loan_status"].value_counts(normalize=True)
# Class frequency by loan_status
for col in loans.columns[loans.dtypes == "object"]:
    sns.countplot(data=loans, x=col, hue="loan_status")
    plt.show()

Modeling

# First model using loan_amount
X = loans[["loan_amount"]]
y = loans[["loan_status"]]

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X,
                                                   y,
                                                   test_size=0.3,
                                                   random_state=42,
                                                   stratify=y)

# Previewing the training set
print(X_train[:5], "\n", y_train[:5])
# Instantiate a logistic regression model
clf = LogisticRegression(random_state=42)

# Fit to the training data
clf.fit(X_train, y_train)

# Predict test set values
y_pred = clf.predict(X_test)

# Check the model's first five predictions
print(y_pred[:5])

Classification Metrics

 

Accuracy

 

Confusion Matrix

True Positive (TP) = # Correctly predicted as positive

True Negative (TN) = # Correctly predicted as negative

False Positive (FP) = # Incorrectly predicted as positive (actually negative)

False Negative (FN) = # Incorrectly predicted as negative (actually positive)

 

Predicted: NegativePredicted: Positive
Actual: NegativeTrue NegativeFalse Positive
Actual: PositiveFalse NegativeTrue Positive

 

Confusion Matrix Metrics




  • AI Chat
  • Code