Competition - Loan Data

(Invalid URL)

Loan Data

Ready to put your coding skills to the test? Join us for our Workspace Competition!
For more information, visit datacamp.com/workspacecompetition

Context

This dataset (source) consists of data from almost 10,000 borrowers that took loans - with some paid back and others still in progress. It was extracted from lendingclub.com which is an organization that connects borrowers with investors. We've included a few suggested questions at the end of this template to help you get started.

Load packages

library(skimr)
library(tidyverse)
library(FinCal) 
library(parsnip) 
library(rsample)
library(yardstick)

Load your Data

loans <- readr::read_csv('data/loans.csv.gz')
skim(loans) %>% 
  select(-(numeric.p0:numeric.p100)) %>%
  select(-(complete_rate))

Understand your data

Variable	class	description
credit_policy	numeric	1 if the customer meets the credit underwriting criteria; 0 otherwise.
purpose	character	The purpose of the loan.
int_rate	numeric	The interest rate of the loan (more risky borrowers are assigned higher interest rates).
installment	numeric	The monthly installments owed by the borrower if the loan is funded.
log_annual_inc	numeric	The natural log of the self-reported annual income of the borrower.
dti	numeric	The debt-to-income ratio of the borrower (amount of debt divided by annual income).
fico	numeric	The FICO credit score of the borrower.
days_with_cr_line	numeric	The number of days the borrower has had a credit line.
revol_bal	numeric	The borrower's revolving balance (amount unpaid at the end of the credit card billing cycle).
revol_util	numeric	The borrower's revolving line utilization rate (the amount of the credit line used relative to total credit available).
inq_last_6mths	numeric	The borrower's number of inquiries by creditors in the last 6 months.
delinq_2yrs	numeric	The number of times the borrower had been 30+ days past due on a payment in the past 2 years.
pub_rec	numeric	The borrower's number of derogatory public records.
not_fully_paid	numeric	1 if the loan is not fully paid; 0 otherwise.

Now you can start to explore this dataset with the chance to win incredible prices! Can't think of where to start? Try your hand at these suggestions:

Extract useful insights and visualize them in the most interesting way possible.
Find out how long it takes for users to pay back their loan.
Build a model that can predict the probability a user will be able to pay back their loan within a certain period.
Find out what kind of people take a loan for what purposes.

Judging Criteria

CATEGORY	WEIGHTAGE	DETAILS
Analysis	30%	Documentation on the goal and what was included in the analysis How the question was approached Visualisation tools and techniques utilized
Results	30%	How the results derived related to the problem chosen The ability to trigger potential further analysis
Creativity	40%	How "out of the box" the analysis conducted is Whether the publication is properly motivated and adds value

# Adding columns based on existing ones
# Fico score ratings calculated based on information from https://www.myfico.com/credit-education/what-is-a-fico-score

loans_updated <- loans %>%
                 mutate(duration_in_months = 
                              round(case_when(
                                       is.nan(n.period(int_rate/12, -revol_bal, 0, installment , type = 0) ) ~ 0,
                                       TRUE ~ n.period(int_rate/12, -revol_bal, 0, installment , type = 0)              
                                          ), 
                                    0),
                           
                             fico_rating = case_when( fico < 580 ~ "Poor",
                                                      fico > 580 & fico <= 679 ~ "Fair",
                                                      fico > 679 & fico <= 739 ~ "Good",
                                                      fico > 739 & fico <= 799 ~ "Very Good",
                                                      TRUE ~ "Exceptional"
                                                   ),
                        
                           customer_marked_delinq = case_when(delinq_2yrs > 0 ~ 1,
                                                              TRUE ~ 0
                                                              ),
                        
                          fully_paid = case_when(not_fully_paid == 0 ~ "Yes",
                                                 TRUE ~ "No"
                            
                                                  ) %>% as_factor()
                          )

# loans summary by type, fico rating, and paid status

loans_summary <- loans_updated %>%
                                  group_by(purpose, fico_rating)%>%
                                  summarize(purpose_counts = n(),
                                            not_paid = sum(not_fully_paid)
                                            ) %>%
                                                 ungroup() %>%
              
                                  group_by(purpose)%>%
                                  mutate(purpose_total = sum(purpose_counts),
                                         not_paid_total = sum(not_paid)) %>%
                                                                          ungroup()

#Loans purpose and type of borrowers based on fico rating

loans_summary %>%
         distinct(purpose, purpose_counts, fico_rating) %>%
          
                             ggplot(aes(x = reorder(purpose,-purpose_counts), y=purpose_counts, fill=fico_rating)) +
                                         geom_col(  ) +
  
                                                                              
                                         theme_bw() +
                                          
                                         labs(title   = "Loan counts by purpose and fico rating",
                                              caption = "As shown in the chart, majority of borrowers have either a Good or Very Good Fico Rating within each loan purpose." ,
                                                         
                                              
                                              fill    = "Fico rating") +
                                                                       
           
                                         theme(axis.text.x = element_text(angle = 45, hjust = 1, color="black"),
                                               axis.text.y = element_text(color="black"),
                                               axis.title = element_blank(),
                                               plot.caption = element_text(hjust = 0, face="italic", colour = "red")
                                               )

# Median time to repay the loan
  
paste("Median time to pay off the revolving credit facility is", loans_updated %>%
        summarize(median_time_to_repay = median(duration_in_months)) %>%
        pull(median_time_to_repay), "months"
       )

# Loan delinquencies

 loans_updated %>%
         select(purpose, customer_marked_delinq) %>%
         group_by(purpose) %>%
         summarise(loan_counts       = n(),
                   customers_marked_delinq   = sum(customer_marked_delinq)
                  ) %>%
                      ungroup() %>%
         mutate(share_delinq_cust =  scales::percent(customers_marked_delinq/loan_counts)) %>%
         
         arrange(desc(share_delinq_cust))

# Model to predict probability of paying off the credit facility
set.seed(30002)

split_obj <- rsample::initial_split(loans_updated, prop = 3/4, strata = "days_with_cr_line")

train <-  rsample::training(split_obj)
test  <-  rsample::testing(split_obj)

log_model <- logistic_reg("classification") %>%
                     set_engine("glm") %>%
                     fit(fully_paid ~ ., data = train %>% 
                                                        select(-not_fully_paid, 
                                                                -fico_rating,
                                                                 -purpose,
                                                                 -inq_last_6mths,
                                                                 -int_rate,
                                                                 -pub_rec,
                                                                 -days_with_cr_line,
                                                                 -revol_bal)
                        )

log_model$fit %>%
    broom::tidy()%>%
    arrange(p.value) %>%
    mutate(term = as_factor(term) %>% fct_rev()) %>%
    
    ggplot(aes(x = estimate, y = term)) + 
    geom_point() +
    
    ggrepel::geom_label_repel(aes(label = estimate),
        size = 3
    ) + 
    labs(title = "Model estimates in order of importance with 9 variables")

#predictions and confusion matrix

prediction_glm <- log_model %>%
                             predict(new_data = test) %>%
                             bind_cols(test %>% select(fully_paid))

prediction_glm %>%
    conf_mat(fully_paid, .pred_class) %>%        # confusion matrix function from yardstick package
    pluck(1) %>%                                  # it allows us to retrieve values from lists, vectors, matrix  
    as_tibble() %>%
    ggplot(aes(Truth, Prediction, alpha = n)) +
    geom_tile(show.legend = FALSE) + 
    geom_text(aes(label = n), color = "steelblue", alpha = 1, size = 8) +
    theme_bw() +
    labs(title = "Confusion matrix plot")

# Checking model accuracy

prediction_glm %>%
            metrics(fully_paid, .pred_class) %>%
            select(-.estimator) %>%
            filter(.metric == "accuracy")