Nathan Muschinske/

Analyzing Credit Scores with tidymodels in R (copy)


Analyzing Credit Scores with tidymodels in R

Welcome to Analyzing Credit Scores with tidymodels in R!

In this live training, we'll explore what differentiates consumer credit score levels and demonstrate how dimensionality reduction can retain much of the information in a dataset while reducing its size. We'll use the embed and tidymodels to build UMAP and decision tree models. We will to demonstrate the concept of information by comparing the performance of decision tree models before and after applying UMAP dimensionality reduction.

If you want to learn more about dimensionality reduction and the tidymodels framework, check out the new Dimensionality Reduction in R

Let's get started!

Setup Environment

First, we'll load the necessary packages -- tidyverse, tidymodels, embed (note we will need to install embed).

I'm assuming you've used the tidyverse before. If you have not used tidymodels or embed packages before, here's a quick summary.

  • tidymodels -- next generation of packages that incorporate tidyverse principles into machine learning and modeling.
  • embed -- contains extra recipes steps to create "embeddings" (i.e., encoding predictors)
# install the 'embed' package

# load the needed packages
# set options to enlarge our plots
options(repr.plot.width=12, repr.plot.height=16)

Load the Credit Data

The data was adapted from Kaggle's "Credit score classification" data (thanks Rohan Paris!).

We'll load it using read_csv() and take a glimpse of it.

# the credit score data is available here
data_url <- ""

# use read_csv to load the data
credit_df <- read_csv(data_url)

# reorder the credit_score factor levels
credit_df <- credit_df %>% 
  mutate(credit_score = factor(credit_score, levels = c("Poor", "Standard", "Good")))

# look at the available features

The data's dimensionality is just its number of columns. credit_df has 23 dimensions, or features -- one target variable (credit_score) and 22 predictor variables.

The target variable -- credit_score -- is categorical and has three levels: Poor, Standard, and Good. So, from a machine learning perspective we'll be dealing with a classification problem.

Our core objective is to understand what differentiates consumers with poor, standard, and good credit scores. In short, we want to explain why consumers' credit scores differ. Along the way, we'll learn about UMAP (feature extraction algorithm) and the tidymodels framework.


Let's visually explore credit_df a little and see if we can understand why consumers have different credit scores.

NOTE:: As humans we can't visualize high-dimensional data -- we are limited to about three dimensionals (maybe four, if you add animation to capture time).

What differentiates consumer credit scores?

Let's generate a few plots to see if we can discover a few predictors that do a good job of separating the credit scores.

Annual income density plot

Let's start by plotting the distribution of annual income for each of the three credit score levels.

# plot annual_income distribution for each credit score level
credit_df %>%  
  ggplot(aes(x = annual_income, color = credit_score)) +
  geom_density() +
  xlim(0, 200000)

Takeaway: Those with lower annual income tend to have poorer credit scores. That means that annual income contains information that helps us determine credit score.

Age density plot

Let's explore the age of consumers by creating a density plot of age for each of the credit score levels.

# plot age distribution for each credit_score level
credit_df %>%  
  ggplot(aes(x = age, color = credit_score)) +

Takeaway: Older consumers tend to have better credit score. In other words, age also contains some information that is useful for determining credit_score.

Delay from due date vs. credit history months
  • Delay from due date = the average number of days late on payment

  • Credit history months = the number of months of credit history the consumer has on record

Let's explore both of these features using a scatterplot that separates the credit score levels by color.

# plot delay_from_due_date vs credit_history_months 
credit_df %>%  
  ggplot(aes(x = delay_from_due_date, y = credit_history_months , color = credit_score)) +
  geom_jitter(alpha = 0.4)

  • AI Chat
  • Code