Getting Started with Machine Learning in Python

Supervised Learning

Predicting values of a target variable given a set of features

For example, predicting if a customer will buy a product (target) based on their location and last five purchases (features).

Regression

Predicting the values of a continuous variable e.g., house price.

Classification

Predicting a binary outcome e.g., customer churn.

Data Dictionary

The data has the following fields:

Column name	Description
`loan_id`	Unique loan id
`gender`	Gender - `Male` / `Female`
`married`	Marital status - `Yes` / `No`
`dependents`	Number of dependents
`education`	Education - `Graduate` / `Not Graduate`
`self_employed`	Self-employment status - `Yes` / `No`
`applicant_income`	Applicant's income
`coapplicant_income`	Coapplicant's income
`loan_amount`	Loan amount (thousands)
`loan_amount_term`	Term of loan (months)
`credit_history`	Credit history meets guidelines - `1` / `0`
`property_area`	Area of the property - `Urban` / `Semi Urban` / `Rural`
`loan_status`	Loan approval status (target) - `1` / `0`

# Import required libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# Read in the dataset
loans = pd.read_csv("loans.csv")

# Preview the data
loans.head()

Exploratory Data Analysis

We can't just dive straight into machine learning! We need to understand and format our data for modeling. What are we looking for?

Cleanliness

Are columns set to the correct data type?
Do we have missing data?

Distributions

Many machine learning algorithms expect data that is normally distributed.
Do we have outliers (extreme values)?

Relationships

If data is strongly correlated with the target variable it might be a good feature for predictions!

Feature Engineering

Do we need to modify any data, e.g., into different data types (ML models expect numeric data), or extract part of the data?

# Remove the loan_id to avoid accidentally using it as a feature
loans.drop(columns=["loan_id"], inplace=True)

# Counts and data types per column
loans.info()

# Distributions and relationships
sns.pairplot(data=loans, diag_kind="kde", hue="loan_status")
plt.show()

# Correlation between variables
sns.heatmap(loans.corr(), annot=True)
plt.show()

# Target frequency
loans["loan_status"].value_counts(normalize=True)

# Class frequency by loan_status
for col in loans.columns[loans.dtypes == "object"]:
    sns.countplot(data=loans, x=col, hue="loan_status")
    plt.show()

Modeling

# First model using loan_amount
X = loans[["loan_amount"]]
y = loans[["loan_status"]]

# Split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X,
                                                   y,
                                                   test_size=0.3,
                                                   random_state=42,
                                                   stratify=y)

# Previewing the training set
print(X_train[:5], "\n", y_train[:5])

# Instantiate a logistic regression model
clf = LogisticRegression(random_state=42)

# Fit to the training data
clf.fit(X_train, y_train)

# Predict test set values
y_pred = clf.predict(X_test)

# Check the model's first five predictions
print(y_pred[:5])

Classification Metrics

Accuracy

Confusion Matrix

True Positive (TP) = # Correctly predicted as positive

True Negative (TN) = # Correctly predicted as negative

False Positive (FP) = # Incorrectly predicted as positive (actually negative)

False Negative (FN) = # Incorrectly predicted as negative (actually positive)

	Predicted: Negative	Predicted: Positive
Actual: Negative	True Negative	False Positive
Actual: Positive	False Negative	True Positive

Confusion Matrix Metrics

‌
‌
‌