this is the nav!
Workspace
Gino Sartor/

# Course Notes: Dimensionality Reduction in R

0
Beta

```.mfe-app-workspace-jfrv3u{font-size:13px;line-height:20px;font-family:JetBrainsMonoNL,Menlo,Monaco,'Courier New',monospace;}```# Import any packages you want to use here
``````

### Unsupervised feature selection methods

• drop features with many missing values
• drop features with low variance
• drop too correlated features

When training a machine learning model, you would want a sample that includes each combination several times, so that every combination appears at least once in both the training and testing data set. In this example, healthcare_cat_df had eight dimensions and needed a bare minimum of 6,480 observations.

``````# Calculate the minimum number of value combinations
healthcare_cat_df %>%
summarise(across(everything(), ~ length(unique(.)))) %>%
prod()``````
``````# Create zero-variance filter
zero_var_filter <- house_sales_df %>%
summarise(across(everything(), ~ var(., na.rm = T))) %>%
pivot_longer(everything(), names_to = "feature", values_to = "variance") %>%
filter(variance == 0) %>%
pull(feature)

# Create a missing values filter
n = nrow(df)
na_filter <- house_sales_df %>%
summarize(across(everything(), ~ sum(is.na(.)))) %>%
pivot_longer(everything(), names_to = "feature", values_to = "NA_count") %>%
filter(NA_count/n > 0.8) %>%
pull(feature)

# Combine the two filters
low_info_filter <- c(zero_var_filter, na_filter)

# Apply the filter
house_sales_filtered_df <- house_sales_df %>%
select(-all_of(low_info_filter))``````
``````# Tidymodel approach
# Create missing values recipe
missing_vals_recipe <-
recipe(price ~ ., data = house_sales_df) %>%
step_filter_missing(all_predictors(), threshold = .5) %>%
prep()

# Apply recipe to data
filtered_house_sales_df <-
bake(missing_vals_recipe, new_data = NULL)

# Prepare recipe
low_variance_recipe <- recipe(price ~ ., data = house_sales_df) %>%
step_zv(all_predictors()) %>%
step_scale(all_numeric_predictors())%>%
step_nzv(all_predictors())%>%
prep()

# Apply recipe
filtered_house_sales_df <- bake(low_variance_recipe, new_data = NULL)``````

### Selecting based on correlation with other features

``````# Create a correlation plot
credit_df %>%
select(where(is.numeric)) %>%
correlate() %>%
shave() %>%
rplot(print_cor = TRUE) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))

# Create a recipe using step_corr to remove numeric predictors correlated > 0.7
corr_recipe <-
recipe(price ~ ., data = house_sales_df) %>%
step_corr(all_numeric_predictors(), threshold = 0.7) %>%
prep()

# Apply the recipe to the data
filtered_house_sales_df <-
corr_recipe %>%
bake(new_data = NULL)

# Identify the features that were removed
tidy(corr_recipe, number = 1)``````

Notice that step_corr() removes the minimal number of correlated features, not all features that are correlated above the threshold.

### Supervised feature selection

• Entropy (information gain)
• Recursive feature elimination
• Lasso regression
• Random forest models
``````# Initialize the split
split <- initial_split(attrition_df,prop = 0.8, strata = Attrition)

# Extract training set
train <- split %>% training()

# Extract testing set
test <- split %>% testing()``````
• AI Chat
• Code