Competition - Loan Data
  • AI Chat
  • Code
  • Report
  • Beta
    Spinner

    Loan Data

    Ready to put your coding skills to the test? Join us for our Workspace Competition!
    For more information, visit datacamp.com/workspacecompetition

    Context

    This dataset (source) consists of data from almost 10,000 borrowers that took loans - with some paid back and others still in progress. It was extracted from lendingclub.com which is an organization that connects borrowers with investors. We've included a few suggested questions at the end of this template to help you get started.

    Load packages

    library(skimr)
    library(tidyverse)
    library(FinCal) 
    library(parsnip) 
    library(rsample)
    library(yardstick)
    
    

    Load your Data

    loans <- readr::read_csv('data/loans.csv.gz')
    skim(loans) %>% 
      select(-(numeric.p0:numeric.p100)) %>%
      select(-(complete_rate))

    Understand your data

    Variableclassdescription
    credit_policynumeric1 if the customer meets the credit underwriting criteria; 0 otherwise.
    purposecharacterThe purpose of the loan.
    int_ratenumericThe interest rate of the loan (more risky borrowers are assigned higher interest rates).
    installmentnumericThe monthly installments owed by the borrower if the loan is funded.
    log_annual_incnumericThe natural log of the self-reported annual income of the borrower.
    dtinumericThe debt-to-income ratio of the borrower (amount of debt divided by annual income).
    ficonumericThe FICO credit score of the borrower.
    days_with_cr_linenumericThe number of days the borrower has had a credit line.
    revol_balnumericThe borrower's revolving balance (amount unpaid at the end of the credit card billing cycle).
    revol_utilnumericThe borrower's revolving line utilization rate (the amount of the credit line used relative to total credit available).
    inq_last_6mthsnumericThe borrower's number of inquiries by creditors in the last 6 months.
    delinq_2yrsnumericThe number of times the borrower had been 30+ days past due on a payment in the past 2 years.
    pub_recnumericThe borrower's number of derogatory public records.
    not_fully_paidnumeric1 if the loan is not fully paid; 0 otherwise.

    Now you can start to explore this dataset with the chance to win incredible prices! Can't think of where to start? Try your hand at these suggestions:

    • Extract useful insights and visualize them in the most interesting way possible.
    • Find out how long it takes for users to pay back their loan.
    • Build a model that can predict the probability a user will be able to pay back their loan within a certain period.
    • Find out what kind of people take a loan for what purposes.

    Judging Criteria

    CATEGORYWEIGHTAGEDETAILS
    Analysis30%
    • Documentation on the goal and what was included in the analysis
    • How the question was approached
    • Visualisation tools and techniques utilized
    Results30%
    • How the results derived related to the problem chosen
    • The ability to trigger potential further analysis
    Creativity40%
    • How "out of the box" the analysis conducted is
    • Whether the publication is properly motivated and adds value
    # Adding columns based on existing ones
    # Fico score ratings calculated based on information from https://www.myfico.com/credit-education/what-is-a-fico-score
    
    loans_updated <- loans %>%
                     mutate(duration_in_months = 
                                  round(case_when(
                                           is.nan(n.period(int_rate/12, -revol_bal, 0, installment , type = 0) ) ~ 0,
                                           TRUE ~ n.period(int_rate/12, -revol_bal, 0, installment , type = 0)              
                                              ), 
                                        0),
                               
                                 fico_rating = case_when( fico < 580 ~ "Poor",
                                                          fico > 580 & fico <= 679 ~ "Fair",
                                                          fico > 679 & fico <= 739 ~ "Good",
                                                          fico > 739 & fico <= 799 ~ "Very Good",
                                                          TRUE ~ "Exceptional"
                                                       ),
                            
                               customer_marked_delinq = case_when(delinq_2yrs > 0 ~ 1,
                                                                  TRUE ~ 0
                                                                  ),
                            
                              fully_paid = case_when(not_fully_paid == 0 ~ "Yes",
                                                     TRUE ~ "No"
                                
                                                      ) %>% as_factor()
                              )
    # loans summary by type, fico rating, and paid status
    
    loans_summary <- loans_updated %>%
                                      group_by(purpose, fico_rating)%>%
                                      summarize(purpose_counts = n(),
                                                not_paid = sum(not_fully_paid)
                                                ) %>%
                                                     ungroup() %>%
                  
                                      group_by(purpose)%>%
                                      mutate(purpose_total = sum(purpose_counts),
                                             not_paid_total = sum(not_paid)) %>%
                                                                              ungroup()
    #Loans purpose and type of borrowers based on fico rating
    
    loans_summary %>%
             distinct(purpose, purpose_counts, fico_rating) %>%
              
                                 ggplot(aes(x = reorder(purpose,-purpose_counts), y=purpose_counts, fill=fico_rating)) +
                                             geom_col(  ) +
      
                                                                                  
                                             theme_bw() +
                                              
                                             labs(title   = "Loan counts by purpose and fico rating",
                                                  caption = "As shown in the chart, majority of borrowers have either a Good or Very Good Fico Rating within each loan purpose." ,
                                                             
                                                  
                                                  fill    = "Fico rating") +
                                                                           
               
                                             theme(axis.text.x = element_text(angle = 45, hjust = 1, color="black"),
                                                   axis.text.y = element_text(color="black"),
                                                   axis.title = element_blank(),
                                                   plot.caption = element_text(hjust = 0, face="italic", colour = "red")
                                                   ) 
      
                                          
                                             
                                             
                                           
                                                  
     
    # Median time to repay the loan
      
    paste("Median time to pay off the revolving credit facility is", loans_updated %>%
            summarize(median_time_to_repay = median(duration_in_months)) %>%
            pull(median_time_to_repay), "months"
           )
    # Loan delinquencies
    
     loans_updated %>%
             select(purpose, customer_marked_delinq) %>%
             group_by(purpose) %>%
             summarise(loan_counts       = n(),
                       customers_marked_delinq   = sum(customer_marked_delinq)
                      ) %>%
                          ungroup() %>%
             mutate(share_delinq_cust =  scales::percent(customers_marked_delinq/loan_counts)) %>%
             
             arrange(desc(share_delinq_cust))
                       
             
    
    # Model to predict probability of paying off the credit facility
    set.seed(30002)
    
    split_obj <- rsample::initial_split(loans_updated, prop = 3/4, strata = "days_with_cr_line")
    
    train <-  rsample::training(split_obj)
    test  <-  rsample::testing(split_obj)
    
    log_model <- logistic_reg("classification") %>%
                         set_engine("glm") %>%
                         fit(fully_paid ~ ., data = train %>% 
                                                            select(-not_fully_paid, 
                                                                    -fico_rating,
                                                                     -purpose,
                                                                     -inq_last_6mths,
                                                                     -int_rate,
                                                                     -pub_rec,
                                                                     -days_with_cr_line,
                                                                     -revol_bal)
                            )
    
    log_model$fit %>%
        broom::tidy()%>%
        arrange(p.value) %>%
        mutate(term = as_factor(term) %>% fct_rev()) %>%
        
        ggplot(aes(x = estimate, y = term)) + 
        geom_point() +
        
        ggrepel::geom_label_repel(aes(label = estimate),
            size = 3
        ) + 
        labs(title = "Model estimates in order of importance with 9 variables")
        
    
    
    #predictions and confusion matrix
    
    prediction_glm <- log_model %>%
                                 predict(new_data = test) %>%
                                 bind_cols(test %>% select(fully_paid))
    
    prediction_glm %>%
        conf_mat(fully_paid, .pred_class) %>%        # confusion matrix function from yardstick package
        pluck(1) %>%                                  # it allows us to retrieve values from lists, vectors, matrix  
        as_tibble() %>%
        ggplot(aes(Truth, Prediction, alpha = n)) +
        geom_tile(show.legend = FALSE) + 
        geom_text(aes(label = n), color = "steelblue", alpha = 1, size = 8) +
        theme_bw() +
        labs(title = "Confusion matrix plot")
    
    # Checking model accuracy
    
    prediction_glm %>%
                metrics(fully_paid, .pred_class) %>%
                select(-.estimator) %>%
                filter(.metric == "accuracy")