Skip to content

(Invalid URL)

Loan Data

Ready to put your coding skills to the test? Join us for our Workspace Competition!
For more information, visit datacamp.com/workspacecompetition

Context

This dataset (source) consists of data from almost 10,000 borrowers that took loans - with some paid back and others still in progress. It was extracted from lendingclub.com which is an organization that connects borrowers with investors. We've included a few suggested questions at the end of this template to help you get started.

Load packages

library(skimr)
library(tidyverse)
library(FinCal) 
library(parsnip) 
library(rsample)
library(yardstick)

Load your Data

loans <- readr::read_csv('data/loans.csv.gz')
skim(loans) %>% 
  select(-(numeric.p0:numeric.p100)) %>%
  select(-(complete_rate))

Understand your data

Variableclassdescription
credit_policynumeric1 if the customer meets the credit underwriting criteria; 0 otherwise.
purposecharacterThe purpose of the loan.
int_ratenumericThe interest rate of the loan (more risky borrowers are assigned higher interest rates).
installmentnumericThe monthly installments owed by the borrower if the loan is funded.
log_annual_incnumericThe natural log of the self-reported annual income of the borrower.
dtinumericThe debt-to-income ratio of the borrower (amount of debt divided by annual income).
ficonumericThe FICO credit score of the borrower.
days_with_cr_linenumericThe number of days the borrower has had a credit line.
revol_balnumericThe borrower's revolving balance (amount unpaid at the end of the credit card billing cycle).
revol_utilnumericThe borrower's revolving line utilization rate (the amount of the credit line used relative to total credit available).
inq_last_6mthsnumericThe borrower's number of inquiries by creditors in the last 6 months.
delinq_2yrsnumericThe number of times the borrower had been 30+ days past due on a payment in the past 2 years.
pub_recnumericThe borrower's number of derogatory public records.
not_fully_paidnumeric1 if the loan is not fully paid; 0 otherwise.

Now you can start to explore this dataset with the chance to win incredible prices! Can't think of where to start? Try your hand at these suggestions:

  • Extract useful insights and visualize them in the most interesting way possible.
  • Find out how long it takes for users to pay back their loan.
  • Build a model that can predict the probability a user will be able to pay back their loan within a certain period.
  • Find out what kind of people take a loan for what purposes.

Judging Criteria

CATEGORYWEIGHTAGEDETAILS
Analysis30%
  • Documentation on the goal and what was included in the analysis
  • How the question was approached
  • Visualisation tools and techniques utilized
Results30%
  • How the results derived related to the problem chosen
  • The ability to trigger potential further analysis
Creativity40%
  • How "out of the box" the analysis conducted is
  • Whether the publication is properly motivated and adds value
# Adding columns based on existing ones
# Fico score ratings calculated based on information from https://www.myfico.com/credit-education/what-is-a-fico-score

loans_updated <- loans %>%
                 mutate(duration_in_months = 
                              round(case_when(
                                       is.nan(n.period(int_rate/12, -revol_bal, 0, installment , type = 0) ) ~ 0,
                                       TRUE ~ n.period(int_rate/12, -revol_bal, 0, installment , type = 0)              
                                          ), 
                                    0),
                           
                             fico_rating = case_when( fico < 580 ~ "Poor",
                                                      fico > 580 & fico <= 679 ~ "Fair",
                                                      fico > 679 & fico <= 739 ~ "Good",
                                                      fico > 739 & fico <= 799 ~ "Very Good",
                                                      TRUE ~ "Exceptional"
                                                   ),
                        
                           customer_marked_delinq = case_when(delinq_2yrs > 0 ~ 1,
                                                              TRUE ~ 0
                                                              ),
                        
                          fully_paid = case_when(not_fully_paid == 0 ~ "Yes",
                                                 TRUE ~ "No"
                            
                                                  ) %>% as_factor()
                          )
# loans summary by type, fico rating, and paid status

loans_summary <- loans_updated %>%
                                  group_by(purpose, fico_rating)%>%
                                  summarize(purpose_counts = n(),
                                            not_paid = sum(not_fully_paid)
                                            ) %>%
                                                 ungroup() %>%
              
                                  group_by(purpose)%>%
                                  mutate(purpose_total = sum(purpose_counts),
                                         not_paid_total = sum(not_paid)) %>%
                                                                          ungroup()
#Loans purpose and type of borrowers based on fico rating

loans_summary %>%
         distinct(purpose, purpose_counts, fico_rating) %>%
          
                             ggplot(aes(x = reorder(purpose,-purpose_counts), y=purpose_counts, fill=fico_rating)) +
                                         geom_col(  ) +
  
                                                                              
                                         theme_bw() +
                                          
                                         labs(title   = "Loan counts by purpose and fico rating",
                                              caption = "As shown in the chart, majority of borrowers have either a Good or Very Good Fico Rating within each loan purpose." ,
                                                         
                                              
                                              fill    = "Fico rating") +
                                                                       
           
                                         theme(axis.text.x = element_text(angle = 45, hjust = 1, color="black"),
                                               axis.text.y = element_text(color="black"),
                                               axis.title = element_blank(),
                                               plot.caption = element_text(hjust = 0, face="italic", colour = "red")
                                               ) 
  
                                      
                                         
                                         
                                       
                                              
 
# Median time to repay the loan
  
paste("Median time to pay off the revolving credit facility is", loans_updated %>%
        summarize(median_time_to_repay = median(duration_in_months)) %>%
        pull(median_time_to_repay), "months"
       )
# Loan delinquencies

 loans_updated %>%
         select(purpose, customer_marked_delinq) %>%
         group_by(purpose) %>%
         summarise(loan_counts       = n(),
                   customers_marked_delinq   = sum(customer_marked_delinq)
                  ) %>%
                      ungroup() %>%
         mutate(share_delinq_cust =  scales::percent(customers_marked_delinq/loan_counts)) %>%
         
         arrange(desc(share_delinq_cust))
                   
         
# Model to predict probability of paying off the credit facility
set.seed(30002)

split_obj <- rsample::initial_split(loans_updated, prop = 3/4, strata = "days_with_cr_line")

train <-  rsample::training(split_obj)
test  <-  rsample::testing(split_obj)

log_model <- logistic_reg("classification") %>%
                     set_engine("glm") %>%
                     fit(fully_paid ~ ., data = train %>% 
                                                        select(-not_fully_paid, 
                                                                -fico_rating,
                                                                 -purpose,
                                                                 -inq_last_6mths,
                                                                 -int_rate,
                                                                 -pub_rec,
                                                                 -days_with_cr_line,
                                                                 -revol_bal)
                        )

log_model$fit %>%
    broom::tidy()%>%
    arrange(p.value) %>%
    mutate(term = as_factor(term) %>% fct_rev()) %>%
    
    ggplot(aes(x = estimate, y = term)) + 
    geom_point() +
    
    ggrepel::geom_label_repel(aes(label = estimate),
        size = 3
    ) + 
    labs(title = "Model estimates in order of importance with 9 variables")
    

#predictions and confusion matrix

prediction_glm <- log_model %>%
                             predict(new_data = test) %>%
                             bind_cols(test %>% select(fully_paid))

prediction_glm %>%
    conf_mat(fully_paid, .pred_class) %>%        # confusion matrix function from yardstick package
    pluck(1) %>%                                  # it allows us to retrieve values from lists, vectors, matrix  
    as_tibble() %>%
    ggplot(aes(Truth, Prediction, alpha = n)) +
    geom_tile(show.legend = FALSE) + 
    geom_text(aes(label = n), color = "steelblue", alpha = 1, size = 8) +
    theme_bw() +
    labs(title = "Confusion matrix plot")
# Checking model accuracy

prediction_glm %>%
            metrics(fully_paid, .pred_class) %>%
            select(-.estimator) %>%
            filter(.metric == "accuracy")