Workspace
Gino Sartor/

Course Notes: Dimensionality Reduction in R

0
Beta
Spinner

Feature selection

# Import any packages you want to use here

Unsupervised feature selection methods

  • drop features with many missing values
  • drop features with low variance
  • drop too correlated features

Add your notes here

When training a machine learning model, you would want a sample that includes each combination several times, so that every combination appears at least once in both the training and testing data set. In this example, healthcare_cat_df had eight dimensions and needed a bare minimum of 6,480 observations.

# Calculate the minimum number of value combinations
healthcare_cat_df %>% 
  summarise(across(everything(), ~ length(unique(.)))) %>% 
  prod()
# Create zero-variance filter
zero_var_filter <- house_sales_df %>% 
  summarise(across(everything(), ~ var(., na.rm = T))) %>% 
  pivot_longer(everything(), names_to = "feature", values_to = "variance") %>% 
  filter(variance == 0) %>% 
  pull(feature)


# Create a missing values filter
n = nrow(df)
na_filter <- house_sales_df %>% 
  summarize(across(everything(), ~ sum(is.na(.)))) %>% 
  pivot_longer(everything(), names_to = "feature", values_to = "NA_count") %>% 
  filter(NA_count/n > 0.8) %>% 
  pull(feature)

# Combine the two filters
low_info_filter <- c(zero_var_filter, na_filter)

# Apply the filter
house_sales_filtered_df <- house_sales_df %>% 
  select(-all_of(low_info_filter))
# Tidymodel approach
# Create missing values recipe
missing_vals_recipe <- 
  recipe(price ~ ., data = house_sales_df) %>% 
  step_filter_missing(all_predictors(), threshold = .5) %>% 
  prep()
  
# Apply recipe to data
filtered_house_sales_df <- 
  bake(missing_vals_recipe, new_data = NULL)


# Prepare recipe
low_variance_recipe <- recipe(price ~ ., data = house_sales_df) %>% 
  step_zv(all_predictors()) %>% 
  step_scale(all_numeric_predictors())%>%  
  step_nzv(all_predictors())%>%
  prep()

# Apply recipe
filtered_house_sales_df <- bake(low_variance_recipe, new_data = NULL)

Selecting based on correlation with other features

# Create a correlation plot
credit_df %>% 
  select(where(is.numeric)) %>% 
  correlate() %>% 
  shave() %>% 
  rplot(print_cor = TRUE) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

# Create a recipe using step_corr to remove numeric predictors correlated > 0.7
corr_recipe <-  
  recipe(price ~ ., data = house_sales_df) %>% 
  step_corr(all_numeric_predictors(), threshold = 0.7) %>% 
  prep() 

# Apply the recipe to the data
filtered_house_sales_df <- 
  corr_recipe %>% 
  bake(new_data = NULL)

# Identify the features that were removed
tidy(corr_recipe, number = 1)

Notice that step_corr() removes the minimal number of correlated features, not all features that are correlated above the threshold.

Supervised feature selection

  • Entropy (information gain)
  • Recursive feature elimination
  • Lasso regression
  • Random forest models
# Initialize the split
split <- initial_split(attrition_df,prop = 0.8, strata = Attrition)

# Extract training set
train <- split %>% training()

# Extract testing set
test <- split %>% testing()
  • AI Chat
  • Code