Beta
Loan Data
Ready to put your coding skills to the test? Join us for our Workspace Competition!
For more information, visit datacamp.com/workspacecompetition
Context
This dataset (source) consists of data from almost 10,000 borrowers that took loans - with some paid back and others still in progress. It was extracted from lendingclub.com which is an organization that connects borrowers with investors. We've included a few suggested questions at the end of this template to help you get started.
Load packages
library(skimr)
library(tidyverse)
library(FinCal)
library(parsnip)
library(rsample)
library(yardstick)
Load your Data
loans <- readr::read_csv('data/loans.csv.gz')
skim(loans) %>%
select(-(numeric.p0:numeric.p100)) %>%
select(-(complete_rate))
Understand your data
Variable | class | description |
---|---|---|
credit_policy | numeric | 1 if the customer meets the credit underwriting criteria; 0 otherwise. |
purpose | character | The purpose of the loan. |
int_rate | numeric | The interest rate of the loan (more risky borrowers are assigned higher interest rates). |
installment | numeric | The monthly installments owed by the borrower if the loan is funded. |
log_annual_inc | numeric | The natural log of the self-reported annual income of the borrower. |
dti | numeric | The debt-to-income ratio of the borrower (amount of debt divided by annual income). |
fico | numeric | The FICO credit score of the borrower. |
days_with_cr_line | numeric | The number of days the borrower has had a credit line. |
revol_bal | numeric | The borrower's revolving balance (amount unpaid at the end of the credit card billing cycle). |
revol_util | numeric | The borrower's revolving line utilization rate (the amount of the credit line used relative to total credit available). |
inq_last_6mths | numeric | The borrower's number of inquiries by creditors in the last 6 months. |
delinq_2yrs | numeric | The number of times the borrower had been 30+ days past due on a payment in the past 2 years. |
pub_rec | numeric | The borrower's number of derogatory public records. |
not_fully_paid | numeric | 1 if the loan is not fully paid; 0 otherwise. |
Now you can start to explore this dataset with the chance to win incredible prices! Can't think of where to start? Try your hand at these suggestions:
- Extract useful insights and visualize them in the most interesting way possible.
- Find out how long it takes for users to pay back their loan.
- Build a model that can predict the probability a user will be able to pay back their loan within a certain period.
- Find out what kind of people take a loan for what purposes.
Judging Criteria
CATEGORY | WEIGHTAGE | DETAILS |
---|---|---|
Analysis | 30% |
|
Results | 30% |
|
Creativity | 40% |
|
# Adding columns based on existing ones
# Fico score ratings calculated based on information from https://www.myfico.com/credit-education/what-is-a-fico-score
loans_updated <- loans %>%
mutate(duration_in_months =
round(case_when(
is.nan(n.period(int_rate/12, -revol_bal, 0, installment , type = 0) ) ~ 0,
TRUE ~ n.period(int_rate/12, -revol_bal, 0, installment , type = 0)
),
0),
fico_rating = case_when( fico < 580 ~ "Poor",
fico > 580 & fico <= 679 ~ "Fair",
fico > 679 & fico <= 739 ~ "Good",
fico > 739 & fico <= 799 ~ "Very Good",
TRUE ~ "Exceptional"
),
customer_marked_delinq = case_when(delinq_2yrs > 0 ~ 1,
TRUE ~ 0
),
fully_paid = case_when(not_fully_paid == 0 ~ "Yes",
TRUE ~ "No"
) %>% as_factor()
)
# loans summary by type, fico rating, and paid status
loans_summary <- loans_updated %>%
group_by(purpose, fico_rating)%>%
summarize(purpose_counts = n(),
not_paid = sum(not_fully_paid)
) %>%
ungroup() %>%
group_by(purpose)%>%
mutate(purpose_total = sum(purpose_counts),
not_paid_total = sum(not_paid)) %>%
ungroup()
#Loans purpose and type of borrowers based on fico rating
loans_summary %>%
distinct(purpose, purpose_counts, fico_rating) %>%
ggplot(aes(x = reorder(purpose,-purpose_counts), y=purpose_counts, fill=fico_rating)) +
geom_col( ) +
theme_bw() +
labs(title = "Loan counts by purpose and fico rating",
caption = "As shown in the chart, majority of borrowers have either a Good or Very Good Fico Rating within each loan purpose." ,
fill = "Fico rating") +
theme(axis.text.x = element_text(angle = 45, hjust = 1, color="black"),
axis.text.y = element_text(color="black"),
axis.title = element_blank(),
plot.caption = element_text(hjust = 0, face="italic", colour = "red")
)
# Median time to repay the loan
paste("Median time to pay off the revolving credit facility is", loans_updated %>%
summarize(median_time_to_repay = median(duration_in_months)) %>%
pull(median_time_to_repay), "months"
)
# Loan delinquencies
loans_updated %>%
select(purpose, customer_marked_delinq) %>%
group_by(purpose) %>%
summarise(loan_counts = n(),
customers_marked_delinq = sum(customer_marked_delinq)
) %>%
ungroup() %>%
mutate(share_delinq_cust = scales::percent(customers_marked_delinq/loan_counts)) %>%
arrange(desc(share_delinq_cust))
# Model to predict probability of paying off the credit facility
set.seed(30002)
split_obj <- rsample::initial_split(loans_updated, prop = 3/4, strata = "days_with_cr_line")
train <- rsample::training(split_obj)
test <- rsample::testing(split_obj)
log_model <- logistic_reg("classification") %>%
set_engine("glm") %>%
fit(fully_paid ~ ., data = train %>%
select(-not_fully_paid,
-fico_rating,
-purpose,
-inq_last_6mths,
-int_rate,
-pub_rec,
-days_with_cr_line,
-revol_bal)
)
log_model$fit %>%
broom::tidy()%>%
arrange(p.value) %>%
mutate(term = as_factor(term) %>% fct_rev()) %>%
ggplot(aes(x = estimate, y = term)) +
geom_point() +
ggrepel::geom_label_repel(aes(label = estimate),
size = 3
) +
labs(title = "Model estimates in order of importance with 9 variables")
#predictions and confusion matrix
prediction_glm <- log_model %>%
predict(new_data = test) %>%
bind_cols(test %>% select(fully_paid))
prediction_glm %>%
conf_mat(fully_paid, .pred_class) %>% # confusion matrix function from yardstick package
pluck(1) %>% # it allows us to retrieve values from lists, vectors, matrix
as_tibble() %>%
ggplot(aes(Truth, Prediction, alpha = n)) +
geom_tile(show.legend = FALSE) +
geom_text(aes(label = n), color = "steelblue", alpha = 1, size = 8) +
theme_bw() +
labs(title = "Confusion matrix plot")
# Checking model accuracy
prediction_glm %>%
metrics(fully_paid, .pred_class) %>%
select(-.estimator) %>%
filter(.metric == "accuracy")