Ensemble_Modeling_Loan_Data

Beta

Ensemble Modeling - Loan Data

This dataset consists of more than 9,500 loans with information on the loan structure, the borrower, and whether the loan was pain back in full. This data was extracted from LendingClub.com, which is a company that connects borrowers with investors.

Not sure where to begin? Scroll to the bottom to find challenges!

Data dictionary

	Variable	Explanation
0	credit_policy	1 if the customer meets the credit underwriting criteria; 0 otherwise.
1	purpose	The purpose of the loan.
2	int_rate	The interest rate of the loan (more risky borrowers are assigned higher interest rates).
3	installment	The monthly installments owed by the borrower if the loan is funded.
4	log_annual_inc	The natural log of the self-reported annual income of the borrower.
5	dti	The debt-to-income ratio of the borrower (amount of debt divided by annual income).
6	fico	The FICO credit score of the borrower.
7	days_with_cr_line	The number of days the borrower has had a credit line.
8	revol_bal	The borrower's revolving balance (amount unpaid at the end of the credit card billing cycle).
9	revol_util	The borrower's revolving line utilization rate (the amount of the credit line used relative to total credit available).
10	inq_last_6mths	The borrower's number of inquiries by creditors in the last 6 months.
11	delinq_2yrs	The number of times the borrower had been 30+ days past due on a payment in the past 2 years.
12	pub_rec	The borrower's number of derogatory public records.
13	not_fully_paid	1 if the loan is not fully paid; 0 otherwise.

Source of dataset.

Don't know where to start?

Challenges are brief tasks designed to help you practice specific skills:

🗺️ Explore: Generate a correlation matrix between the numeric columns. What columns are positively and negatively correlated with each other? Does it change if you segment it by the purpose of the loan?
📊 Visualize: Plot histograms for every numeric column with a color element to segment the bars by not_fully_paid.
🔎 Analyze: Do loans with the same purpose have similar qualities not shared by loans with differing purposes? You can consider only fully paid loans.

Scenarios are broader questions to help you develop an end-to-end project for your portfolio:

You recently got a job as a machine learning scientist at a startup that wants to automate loan approvals. As your first project, your manager would like you to build a classifier to predict whether a loan will be paid back based on this data. There are two things to note. First, there is class imbalance; there are fewer examples of loans not fully paid. Second, it's more important to accurately predict whether a loan will not be paid back rather than if a loan is paid back. Your manager will want to know how you accounted for this in training and evaluation your model.

You will need to prepare a report that is accessible to a broad audience. It will need to outline your motivation, analysis steps, findings, and conclusions.

Load & Explore the data

import pandas as pd

loan_data = pd.read_csv("loan_data.csv")
loan_data.head()

print(loan_data['not.fully.paid'].value_counts())
loan_data['not.fully.paid'].value_counts().plot(kind='barh')

We notice an imbalanced data scenario, where there are much more observations with class 0 than there are for class 1. One way of solving this issue is using the undersampling approach, meaning that we are reducing the number of observations for the class 0. In this scenario, there will be as many observations for class 0 as there are for class 1.

First get the number of class 1.
Create a new dataframe of class 0.
Concatenate the final two dataframes.

loan_data_class_1 = loan_data[loan_data['not.fully.paid'] == 1]
number_class_1 = len(loan_data_class_1)
loan_data_class_0 = loan_data[loan_data['not.fully.paid'] == 0].sample(number_class_1)

final_loan_data = pd.concat([loan_data_class_1, 
                             loan_data_class_0])

print(final_loan_data.shape)

final_loan_data['not.fully.paid'].value_counts().plot(kind='barh')

Data preparation

By looking at the data set, we notice that not all the features have the same scale, and some machine learning models such as KNN are sensitive to this scaling issue. This can be addressed by normalizing the ranges of the features to the same scale. In this scenario, between 0 and 1.

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler(feature_range=(0, 1))

# Remove unwanted 'purpose' column and get the data
final_loan_data.drop('purpose', axis=1, inplace=True)
X = final_loan_data.drop('not.fully.paid', axis=1)

normalized_X = scaler.fit_transform(X)

from sklearn.model_selection import train_test_split

y = final_loan_data['not.fully.paid']
r_state = 2023 
t_size = 0.33


# Split the data into train and test
X_train, X_test, y_train, y_test = train_test_split(normalized_X, y,
                                                    test_size=t_size,
                                                    random_state=r_state,
                                                    stratify=y)