Skip to content
Getting Started with Machine Learning in Python
  • AI Chat
  • Code
  • Report
  • Spinner

    Supervised Learning

    Predicting values of a target variable given a set of features

    • For example, predicting if a customer will buy a product (target) based on their location and last five purchases (features).

    Regression

    • Predicting the values of a continuous variable e.g., house price.

    Classification

    • Predicting a binary outcome e.g., customer churn.

    Data Dictionary

    The data has the following fields:

    Column nameDescription
    loan_idUnique loan id
    genderGender - Male / Female
    marriedMarital status - Yes / No
    dependentsNumber of dependents
    educationEducation - Graduate / Not Graduate
    self_employedSelf-employment status - Yes / No
    applicant_incomeApplicant's income
    coapplicant_incomeCoapplicant's income
    loan_amountLoan amount (thousands)
    loan_amount_termTerm of loan (months)
    credit_historyCredit history meets guidelines - 1 / 0
    property_areaArea of the property - Urban / Semi Urban / Rural
    loan_statusLoan approval status (target) - 1 / 0
    # Import required libraries
    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    import seaborn as sns
    from sklearn.model_selection import train_test_split
    from sklearn.linear_model import LogisticRegression
    from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
    # Read in the dataset
    loans = pd.read_csv("loans.csv")
    
    # Preview the data
    loans.head()

    Exploratory Data Analysis

    We can't just dive straight into machine learning! We need to understand and format our data for modeling. What are we looking for?

    Cleanliness

    • Are columns set to the correct data type?
    • Do we have missing data?

    Distributions

    • Many machine learning algorithms expect data that is normally distributed.
    • Do we have outliers (extreme values)?

    Relationships

    • If data is strongly correlated with the target variable it might be a good feature for predictions!

    Feature Engineering

    • Do we need to modify any data, e.g., into different data types (ML models expect numeric data), or extract part of the data?
    # Remove the loan_id to avoid accidentally using it as a feature
    loans.drop(columns=["loan_id"], inplace=True)
    # Counts and data types per column
    loans.info()
    # Distributions and relationships
    sns.pairplot(data=loans, diag_kind="kde", hue="loan_status")
    plt.show()
    # Correlation between variables
    sns.heatmap(loans.corr(), annot=True)
    plt.show()
    # Target frequency
    loans["loan_status"].value_counts(normalize=True)
    # Class frequency by loan_status
    for col in loans.columns[loans.dtypes == "object"]:
        sns.countplot(data=loans, x=col, hue="loan_status")
        plt.show()

    Modeling

    # First model using loan_amount
    X = loans[["loan_amount"]]
    y = loans[["loan_status"]]
    
    # Split into training and test sets
    X_train, X_test, y_train, y_test = train_test_split(X,
                                                       y,
                                                       test_size=0.3,
                                                       random_state=42,
                                                       stratify=y)
    
    # Previewing the training set
    print(X_train[:5], "\n", y_train[:5])
    # Instantiate a logistic regression model
    clf = LogisticRegression(random_state=42)
    
    # Fit to the training data
    clf.fit(X_train, y_train)
    
    # Predict test set values
    y_pred = clf.predict(X_test)
    
    # Check the model's first five predictions
    print(y_pred[:5])

    Classification Metrics

     

    Accuracy

     

    Confusion Matrix

    True Positive (TP) = # Correctly predicted as positive

    True Negative (TN) = # Correctly predicted as negative

    False Positive (FP) = # Incorrectly predicted as positive (actually negative)

    False Negative (FN) = # Incorrectly predicted as negative (actually positive)

     

    Predicted: NegativePredicted: Positive
    Actual: NegativeTrue NegativeFalse Positive
    Actual: PositiveFalse NegativeTrue Positive

     

    Confusion Matrix Metrics