Machine Learning Scientist with R Course 1 - 7

Easy to explain and understand
Can capture non-linear relationship
Don't require normalization of numeric features standardization of numeric features dummy indicator variables
Robust to outlier
Fast for large datasets 1.1.2 Cons
Hard to interpret and take long time if large or ensembled
Prone to overfitting in high variance, complex trees

1.2 Create a decision tree and commands

1.2.1 Pick a model class wit specifications of engine and model

library(tidymodels)
TREE_SPEC <- decision_tree() %>%
set_engine("rpart")%>% # engine that powers your model
set_mode("classification") # Classification / regression 1.2.2 Fit the model using training data
TREE_MODEL<- TREE_SPEC %>% fit(formula = Y ~ X, data = TRAINING_DATA) 1.2.3 Split data into training and test datasets
set.seed(INT)
SPLIT_OBJ <- initial_split(FULL_DATA, prop =0.9, strata = Y) # Specify proportion (default = 0.75) and strate = outcome variable to be equally distributed
TRAINING_DATA <- training(SPLIT_OBJ)
TEST_DATA <- testing(SPLIT_OBJ)
nrow(TRAINING_DATA)/nrow(FULL_DATA) # Verify overall proportion
nrow(TEST_DATA)/nrow(FULL_DATA)
counts_train <- table(TRAINING_DATA$outcome) # Check imbalance
prop_yes_train <- counts_train["yes"]/sum(counts_train)
counts_test <- table(TEST_DATA$outcome)
prop_yes_test <- counts_test["yes"]/sum(counts_test) #Note: Avoid imbalance esp. for disease with rare outcomes 1.2.4 Predict new data
PREDICTIONS <- predict(TREE_MODEL, new_data = TEST_DATA, type ="class")
predict(TREE_MODEL, new_data = TEST_DATA, type ="prob") 1.2.5 Evaluate

Confusion matrix (reveals how confusing a model is: diagona = correct) 1.1) Values True Positive(TP):prediction is yes, truth is yes True Negative (TN):prediction is no,truth is no False Positive (FP):prediction is yes,truth is no False Negative (FN):prediction is no,truth is yes 1.2) Metrics [ALL = TP + TN + FP +FN] 1.2.1) Accuracy = (TP + TN) / ALL
For further details Go to ## More model measures in Chapter 3.

1.2.2) Sensitivity 1.2.3) Specificity 1.2.4) ROC 1.2.5) AUC 2) Commands

PRED_COMBINED <- PREDICTIONS %>%
mutate(true_class = TEST_DATA$outcome)
conf_mat(data = PRED_COMBINED, estimate = .pred_class, truth = true_class) # from yardstick package
accuracy(data = PRED_COMBINED, estimate = .pred_class, truth = true_class)

Chapter 2. Regression Trees and Cross-Validation

2.1 Construct a regression tree with hyperparameters

TREE_SPEC <- decision_tree(min_n = NUM_1, # No of data points in node for further split (default = 20) tree_depth = NUM_2, # max depth of tree (default = 30) cost_complexity = NUM_3) %>% # penalty for complexity (default = 0.01)
set_engine("rpart")%>%
set_mode("regression")
MODEL <- TREE_SPEC %>%
fit(formula = Y ~ X, data = TRAINING_DATA)

2.2 Predict using the regression tree

PREDICTIONS <- predict(MODEL, new_data = TEST_DATA)

2.3 Evaluate the regression tree using metrics [how far from the truth]

Out-of-sample performance (training - testing) Vs In-of-sample performance (within training) 3.1) Mean Absolute Error (MAE) = average length of sum(abs[actual-predict])

mae(PREDICTIONS, estimate = .pred, truth = Y) 3.2) Root Mean Square Error (RMSE) = Root of (MAE)^2, large error get higher weight
rmse(PREDICTIONS, estimate = .pred, truth = Y)

2.4 Cross validation: give more robust out-of-sample performance

4.1) Set seed and folds

set.seed(INT) # Set seed for reproducibility
DATA_FOLDS <- vfold_cv(TRAINING_DATA, v = INT) # Specify no of folds 4.2) Fit the folds
FIT_CV <- fit_resamples(TREE_SPEC,
Y ~ X,
resamples = DATA_FOLDS, metrics = metric_set(mae, rmse)) 4.3) Collect all metrics (errors)
ALL_ERRORS <- collect_metrics(FIT_CV, summarize =FALSE) # If default(true), will calculate mean of each metric 4.4) Visualize the errors
ggplot(ALL_ERRORS, aes(.estimate, fill = .metric)) + geom_histogram()

2.5 Note: Bias-variance tradeoff

Simple vs complex model e.g. more tree_depth -> more complex

simple_spec <- decision_tree(tree_depth =2)%>%
set_mode("regression")
simple_spec <- decision_tree(tree_depth =15)%>%
set_mode("regression")

Overfitting = high variance (high MAE) vs Underfitting = high bias (high RMSE on both training and test sets)

bind_rows(training = mae(train_results, estimate = .pred, truth = final_grade), test = mae(test_results, estimate = .pred, truth = final_grade),
.id ="dataset") Note: Best model = around the sweet spot between bias and variance

Detecting overfitting and underfitting 3.1) Overfitting -> reduce complexity (hyperparameter tuning)

collect_metrics(FIT_CV) # Out-sample/CV (found high CV/error, high variance)
mae(training_pred, # In-sample (found small MAE) estimate = .pred,
truth = final_grade) 3.2) Underfitting -> increase complexity (hyperparameter tuning)
mae(TRIANING_PRED, estimate = .pred, truth = Y) # In-sample (found high MAE)

Chapter 3.Hyperparameters and Ensemble Models

3.1 Tuning hyperparameters

3.1.1 Hyperparameters

Characteristics = influence shape and complexity of trees
Hyperparameteris in parsnip decision tree

min_n # minimum no of samples required to further split a node
tree_depth # max allowed depth of the tree
cost_complexity # no penalizes size of tree 3.1.2 Goal of tuning = finding the optimal set of hyperparameter values 3.1.3 Steps of tuning

Create placeholders: tune()

SPEC_UNTUNED <- decision_tree(
min_n = tune(),
tree_depth = tune() )%>%
set_engine("rpart")%>%
set_mode("classification")

Create a tuning grid: grid_regular()

TREE_GRID <- grid_regular(
parameters(SPEC_UNTUNED),
levels =3) # No of grid points for each parameter

Tune the grid: tune_grid() 3.1) Build a model for every grid point + evaluate model out-of-sample (cross validation, CV)

TUNE_RESULTS <- tune_grid(SPEC_UNTUNED,
Y ~ X,
resamples = DATA_FOLDS,
grid = TREE_GRID,
metrics = metric_set(accuracy)) 3.2) Visualize the tuning results
autoplot(TUNE_RESULTS)

Use the best parameters: finalized_model() 4.1) Select the best performing parameteris

FINAL_PARAMS <- select_best(TUNE_RESULTS) 4.2) Plug them into specifiction
BEST_SPEC <- finalize_model(SPEC_UNTUNED, FINAL_PARAMS)

Build the final model

FINAL_MODEL <- fit(BEST_SPEC, Y ~ X, data = FULL_DATA)

3.2 More model measures

3.2.1 Sensitivity = [TP / (TP+FN)]

sens(PREDICTIONS, estimate = .pred_class, truth = true_class) #PREDICTIONS = a tibble .pred_class and true_class #Similar argument as accuracy() and conf_mat() 3.2.2 Specificity = [TN / (TN+FP)]
spec(PREDICTIONS, estimate = .pred_class, truth = true_class) 3.2.3 Receiver-operating-characteristic (ROC) curve # Sen (True positive rate) vs 1 - Specificity (False positive rate)

Add probabilities on test data # As classifcation use a constant threshold, so a test for different threshold needed

PREDICTIONS <- predict(MODEL, TEST_DATA, type = "prob")%>% bind_cols(TEST_DATA)

Calculate the ROC curve for all thresholds

ROC <- roc_curve(PREDICTIONS, estimate =.pred_yes, truth = Y)

Plot the ROC curve

autoplot(ROC) 3.2.4 Area under (the ROC) curve (AUC)

Calculate AUC # only change function

roc_auc(PREDICTIONS, estimate =.pred_yes, truth = Y)

Interpretation 0 = every example incorrectly classified 0.5 = performing not better than random chance 1 = all examples correctly for every threshold (perfect model) or 100% accuracy + 0% false positive rate

3.3 Bagged trees

3.3.1 Background

High variance among trees -> many heads better than one (bagging)
Bagging = short for Boostrap Aggregation 3.3.2 Steps
Bootstrap and train (random sampling w/ replacement and build K models on K samples in parallel; more K = more time needed) 1.1) Specify the bagged trees

library(baguette)
SPEC_BAGGED <- bag_tree() %>% set_mode("classification") %>% set_engine("rpart", times = INT) # Set K in times

1.2) Train all trees

MODEL_BAGGED <- fit(SPEC_BAGGED, formula = Y ~ X, data = TRAINING_DATA) #Note: the higher std.eror for a variable, the more important it is

Aggregate (generate predictions from each tree and ensemble prediction, reduce variance among trees or higher performance, justified by average (regression) or majority vote (classification))

PREDICTIONS <- predict(MODEL, TEST_DATA, type = "prob")%>% bind_cols(TRAINING_DATA)
ROC <- roc_curve(PREDICTIONS, estimate =.pred_yes, truth = Y)
roc_auc(PREDICTIONS, estimate =.pred_yes, truth = Y)

Check for overfitting using out-of-sample evaluation 3.1) Estimate AUC using cross-validation (esp. for model with high AUC)

set.seed(INT)
CV_RESULTS <- fit_resamples(SPEC_BAGGED, Y ~ X, resamples = vfold_cv(TRAINING_DATA, v = INT), metrics = metric_set(roc_auc)) 3.2) Collect metrics
collect_metrics(CV_RESULTS) 3.3) Interpret If mean of AUC (out-of-sample) < that of in-sample, it can be overfitting -> need more penality (tuning cost_complexity)

3.4 Random forest (improved version of bagged trees)

3.4.1 Characteristics

Suitable for high-dimensional data
Easy to use
Out-of-the-box performance
Use in many packages: ranger, randomForest - tidymodels interface to these packages: rand_forest() in parsnip package 5.4.2 Concept
Identical to bagging: train trees on bootstrap samples
Difference = random predictors across trees -> random forest X dataset predicted by N features -> Majority vote (classification) -> Final class 3.4.3 Steps
Specify a random forest model through hyperparameters

RAND_SPEC <- rand_forest(mtry = INT, # Predictors seen at each node (default = sqrt(NO_VAR_X)) trees = INT, # No of trees in the forest min_n = 10) %>% # smallest node size allowed set_mode("classification") %>% # Set the mode set_engine("ranger") # Use engine ranger or randomForest

Train a forest

RAND_SPEC %>% fit(Y ~ X data = TRIANING_DATA)

In case of wanting to see variable importance, use the codes:

rand_forest(mode = "classification") %>% set_engine("ranger", importance = "impurity") %>% # Add impurity fit(still_customer ~ ., data = customers_train) %>% vip::vip() # Add variable importance

Chapter 4.Boosted Trees

4.1 Introduction to boosting

4.1.1 Definition = Improving model based on the previous model(s)

4.1.2 Adaptive boosting (Adaboost)

Concept: weight of wrongly classified training examples in subsequent trainings
How it works: 2.1) start by training a decision with equal weight for each observation 2.2) Increase weight of observations hard to classifiy and lower weight of eaily-classified observations
Further improved by adding gradient descent
Overall steps and commands 4.1) Specify spec

BOOST_SPEC <- boost_tree()%>% # Specify the model class set_mode("classification") %>% # Set the mode
set_engine("xgboost") # Set the engine 4.2) Specify model
BOOST_MODEL <- fit(BOOST_SPEC, formula = Y ~ X, data = TRAINING_DATA) 4.3) Evaluate the ensemble
set.seed(INT)
DATA_FOLDS <- vfold_cv(TRAINING_DATA, v = INT)
CV_RESULTS <- fit_resamples(BOOST_SPEC, Y ~ X, resamples = DATA_FOLDS, metrics = metric_set(roc_auc))

4.2 Gradient boosting

4.2.1 Adaboost

Uses "decision stumps" as weak learners
Attaches weight to observations: high weight for difficult observations, low weight for correct predictions

4.2.2 Gradient boosting

Uses "small decision trees" as weak learners (start w/ less complex trees)
Lose function instead of weights (arrow function)
Loss function optimization by gradient descent

4.2.3 Pros of boosting

Among the best-performing machine learning models
Good option for unbalanced data

4.2.4 Cons of boosting

Prone to overfitting
Training can be slow (depending on learning rate hyperparameters)
Mang tunging hyperparameters

4.2.5 Hyperparameters for gradient boosting

min_n: (from simple decision tree) min no of observation for further node split
tree_depth: (from simple decision tree) max depth of the tree / no of split
sample_size: (from random forest) amount of data exposeted to fitting routine
trees: (from random forest) no of trees in the ensemble
mtry: (from random forest) no of predictors randomly sampled at each split
learn_rate: (from boosted tres) rate at which the boosting algorithm adapts from iteration to iteration
loss_reduction: (from boosted tres) rediction in the loss function required to split further
stop_iter: (from boosted tres) no of iterations w/o improvement before stopping

4.3 Optimization of the boosted ensemble

4.3.1 Tuning workflow and their commands

Create the tuning spec: tune()

BOOST_SPEC <- boost_tree( trees = INT, # No of trees learn_rate = tune(), tree_depth = tune(), sample_size = tune()) %>% set_mode("classification") %>% set_engine("xgboost")

Create the tuning grid: grid_regular() OR grid_random() 2.1) With grid_regular()

TUNEGRID_BOOST <- grid_regular(parameters(BOOST_SPEC),levels = INT) # specify level; no of combination = level ^ (no of tuned hyper parameters) 2.2) With grid_random() #not evenly spread grid, random comb
TUNEGRID_BOOST <- grid_random(parameters(BOOST_SPEC),size = INT) # Specify size (no of combinations)

Split data: vfold_cv()

DATA_FOLDS <- vfold_cv(TRAINING_DATA, v = INT)

Tune alnog the grid: tune_grid()

TUNE_RESULTS <- tune_grid(BOOST_SPEC, Y ~ X, resamples = DATA_FOLDS, grid = TUNEGRID_BOOST metrics = metric_set(roc_auc))
autoplot(TUNE_RESULTS)

Finalize the model: select_best() and finalize_model() 5.1) Select the final hyperparameters

BEST_PARAMS <- select_best(TUNE_RESULTS) 5.2) Finalize the specification
FINAL_SPEC <- finalize_model(BOOST_SPEC, BEST_PARAMS)

Train the final model: fit()

FINAL_MODEL <- FINAL_SPEC %>% fit( formula = Y ~ X, data = TRAINING_DATA)

4.4 Model comparison

4.4.1 Steps of comparing models and their commands

Combine predictions

bind_cols(decision_tree, bagged_trees, random_forest, boosted_trees. TEST_DATA %>% select(Y))

Calculate AUC 2.1) For decision trees

AUC_TREE <- roc_auc(PRED_COMBINED, truth = Y, estimate = PREDS_TREE) 2.2) For bagged trees
AUC_BAGGED <- roc_auc(PRED_COMBINED, truth = Y, estimate = PREDS_BAGGING) 2.3) For random forest
AUC_FOREST <- roc_auc(PRED_COMBINED, truth = Y, estimate = PREDS_FOREST) 2.4) For boosted trees
AUC_BOOST <- roc_auc(PRED_COMBINED, truth = Y, estimate = PREDS_BOOSTING)

Combine all AUCs # Specify .id = "model"

bind_rows(AUC_TREE,AUC_BAGGED,AUC_FOREST,AUC_BOOST, .id = "model")

Reformat the results by reshaping the predictiongs into long format

PREDICTIONS_LONG <- tidyr::pivot_longer(PRED_COMBINED, cols = starts_with("preds_"), names_to = "model", values_to = "predictions")

Calculate cutoff values

CUTOFFS <- PREDICTIONS_LONG %>% group_by(model) %>% roc_curve(truth = Y, estimate = predictions)

Plot ROC curves

autoplot(CUTOFFS)

Exercise

1. Specify that tree

#Load the package

library(tidymodels) #Pick a model class
tree_spec <- decision_tree() %>% set_engine("rpart") %>% #Set the engine set_mode("classification") # Set the mode

2. Train that model

#Create the specification

tree_spec <- decision_tree() %>% set_engine("rpart") %>% set_mode("classification") #Train the model
tree_model_bmi <- tree_spec %>% fit(formula = outcome ~ bmi, data = diabetes) # Outcome = yes/no

3. Train/test split

#Create the split

diabetes_split <- initial_split(diabetes, prop = 0.8) #Extract the training and test set
diabetes_train <- training(diabetes_split)
diabetes_test <- testing(diabetes_split) #Verify the proportions of both sets
round(nrow(diabetes_train) / nrow(diabetes), 2) == 0.80
round(nrow(diabetes_test) / nrow(diabetes), 2) == 0.20

4. Avoiding class imbalances

#Preparation

set.seed(9888)
diabetes_split <- initial_split(diabetes, prop = 0.75) #Proportion of 'yes' outcomes in the training data
counts_train <- table(training(diabetes_split)$outcome)
prop_yes_train <- counts_train["yes"] / sum(counts_train) #Proportion of 'yes' outcomes in the test data
counts_test <- table(testing(diabetes_split)$outcome)
prop_yes_test <- counts_test["yes"] / sum(counts_test)
paste("Proportion of positive outcomes in training set:", round(prop_yes_train, 2))
paste("Proportion of positive outcomes in test set:", round(prop_yes_test, 2)) #Create a split with a constant outcome distribution [specify strata]
diabetes_split <- initial_split(diabetes, prop = 0.75, strata = outcome)
counts_train <- table(training(diabetes_split)$outcome)
prop_yes_train <- counts_train["yes"] / sum(counts_train)
counts_test <- table(testing(diabetes_split)$outcome)
prop_yes_test <- counts_test["yes"] / sum(counts_test)
paste("Proportion of positive outcomes in training set:", round(prop_yes_train, 2))
paste("Proportion of positive outcomes in test set:", round(prop_yes_test, 2))

5. From zero to hero

set.seed(9) #Create the balanced data split
diabetes_split <- initial_split(diabetes, prop = 0.75, strata = outcome) #Build the specification of the model
tree_spec <- decision_tree() %>% set_engine("rpart") %>% set_mode("classification") #Train the model
model_trained <- tree_spec %>%
fit(formula = outcome ~ bmi + skin_thickness, data = training(diabetes_split))

6. Make predictions

#Train your model

-model <- tree_spec %>% fit(formula = outcome ~ ., data = diabetes_train) #Generate predictions

predictions <- predict(model, new_data = diabetes_test, type ="class") #Add the true outcomes
predictions_combined <- predictions %>% mutate(true_class = diabetes_test$outcome)

7. Crack the matrix

#The confusion matrix

diabetes_matrix <- conf_mat(data = predictions_combined, estimate = .pred_class, truth = true_class)

8. Are you predicting correctly?

#Calculate the number of correctly predicted classes

correct_predictions <- 75 + 34 #Calculate the number of all predicted classes
all_predictions <- 75 + 34 + 18 + 27 #Calculate and print the accuracy
acc_manual <- correct_predictions / all_predictions

#The accuracy calculated by a function
acc_auto <- accuracy(predictions_combined, estimate = .pred_class, truth = true_class)
acc_auto$.estimate

9. Train a regression tree

library(tidymodels) #Build the specification
model_spec <- decision_tree() %>% set_mode("regression") %>% set_engine("rpart") #Fit to the data
model_fit <- model_spec %>% fit(formula = final_grade ~ cocoa_percent + review_date , data = chocolate_train)

10. Predict new values

#Train the model

chocolate_model <- model_spec %>% fit(formula = final_grade ~., data = chocolate_train) #Predict new data
predictions <- predict(chocolate_model, new_data = chocolate_test) %>% #Add the test set bind_cols(chocolate_test

11. In-sample performance

#Predict using the training set

in_sample_predictions <- predict(model, new_data = chocolate_train) #Calculate the vector of absolute differences
abs_diffs <- abs(chocolate_train.pred) #Calculate the mean absolute error
1 / length(abs_diffs) * sum(abs_diffs)

12. Out-of-sample performance

#Predict ratings on test set and add true grades

test_enriched <- predict(model, new_data = chocolate_test) %>% bind_cols(chocolate_test) #Compute the mean absolute error using one single function
mae(test_enriched, estimate = .pred, truth = final_grade)

13. Bigger mistakes, bigger penalty

#Calculate the squared differences

squared_diffs <- (test_enriched.pred)^2 #Compute the RMSE using the formula
rmse_manual <- sqrt(1 / length(squared_diffs) * sum(squared_diffs)) #Compute the RMSE using a function
rmse_auto <- rmse(test_enriched, estimate = .pred, truth = final_grade)

14. Create the folds

#Set seed for reproducibility

set.seed(20) #Build 10 folds
chocolate_folds <- vfold_cv(chocolate_train, v = 10)

15. Fit the folds

#Create a specification

tree_spec <- decision_tree() %>% set_engine("rpart") %>% set_mode("regression") #Fit all folds to the specification
fits_cv <- fit_resamples(tree_spec, final_grade ~ ., resamples = chocolate_folds, metrics = metric_set(mae, rmse))

16. Evaluate the folds

library(ggplot2) #Collect the errors
all_errors <- collect_metrics(fits_cv, summarize = FALSE) #Plot an error histogram
ggplot(all_errors, aes(.estimate, fill = .metric)) + geom_histogram() #Collect and print error statistics
collect_metrics(fits_cv, summarize = TRUE)

17. Adjust model complexity

#Create a model that can grow arbitrarily complex

chocolate_model <- decision_tree(min_n = 2, cost_complexity = 0) %>% set_mode("regression") %>% set_engine("rpart") %>% fit(final_grade ~ ., data = chocolate_train)

18. In-sample and out-of-sample performance

#Predict on and combine with training data and calculate the error [In-of-sample]

predict(complex_model, new_data = chocolate_train) %>% bind_cols(chocolate_train) %>% mae(estimate = .pred, truth = final_grade) #Predict on and combine with test data and calculate the error [Out-of-sample]
predict(complex_model, new_data = chocolate_test) %>% bind_cols(chocolate_test) %>% mae(estimate = .pred, truth = final_grade)

19. Generate a tuning grid

#Create a specification with tuning placeholders

tune_spec <- decision_tree(tree_depth = tune(), cost_complexity = tune()) %>% set_mode("classification") %>% set_engine("rpart") #Create a regular grid
tree_grid <- grid_regular(parameters(tune_spec), levels = 2)

20. Tune along the grid

set.seed(275) #Create CV folds of the customers tibble
folds <- vfold_cv(customers, v = 3) #Tune along the grid
tune_results <- tune_grid(tune_spec, still_customer ~ ., resamples = folds, grid = tree_grid, metrics = metric_set(accuracy)) #Plot the tuning results
autoplot(tune_results)

21. Pick the winner

#Select the parameters that perform best

final_params <- select_best(tune_results) #Finalize the specification
best_spec <- finalize_model(tune_spec, final_params) #Build the final model
final_model <- fit(best_spec, still_customer ~ ., data = customers)

22. Calculate specificity

#Create CV folds of the training data

folds <- vfold_cv(customers_train, v = 3) #Calculate CV specificity
specificities <- fit_resamples(tree_spec, still_customer ~ ., resamples = folds, metrics = metric_set(specificity)) #Collect the metrics
collect_metrics(specificities)

23. Draw the ROC curve

#Predict probabilities on test set

predictions <- predict(model, customers_test, type = "prob") %>% bind_cols(customers_test) #Calculate the ROC curve for all thresholds
roc <- roc_curve(predictions, estimate = .pred_yes, truth = still_customer) #Plot the ROC curve
autoplot(roc)

24. Area under the ROC curve

#Calculate area under the curve

auc_result <- roc_auc(predictions, estimate = .pred_yes, truth = still_customer)
print(paste("The area under the ROC curve is", round(auc_result$.estimate, 3)))

25. Create bagged trees

#Create the specification

library(baguette)
spec_bagged <- bag_tree() %>% set_mode("classification") %>% set_engine("rpart", times = 20) #Fit to the training data
model_bagged <- fit(spec_bagged, still_customer ~ total_trans_amt + customer_age + education_level, data = customers_train)

26. In-sample ROC and AUC

#Predict on training set and add to training set

predictions <- predict(model_bagged, new_data = customers_train, type = "prob") %>% bind_cols(customers_train) #Create and plot the ROC curve
roc_curve(predictions, estimate = .pred_yes, truth = still_customer) %>% autoplot() #Calculate the AUC
roc_auc(predictions, estimate = .pred_yes, truth = still_customer)

27. Check for overfitting

set.seed(55) #Estimate AUC using cross-validation
cv_results <- fit_resamples(spec_bagged, still_customer ~ total_trans_amt + customer_age + education_level, resamples = vfold_cv(customers_train, v = 3), metrics = metric_set(roc_auc)) #Collect metrics
collect_metrics(cv_results)

28. Variable importance

#Specify a random forest

spec <- rand_forest() %>% set_mode("classification") %>% set_engine("ranger", importance = "impurity") #Train the forest
model <- spec %>% fit(still_customer ~ ., data = customers_train) #Plot the variable importance
vip::vip(model)

29. Specify a boosted ensemble

#Specify the model class

boost_spec <- boost_tree() %>% #Set the mode set_mode("classification") %>% #Set the engine set_engine("xgboost")

30. Train a boosted ensemble

boost_model <- fit(boost_spec, still_customer ~ ., data = customers_train)

31. Evaluate the ensemble

set.seed(99) #Create CV folds
folds <- vfold_cv(customers_train, v = 5) #Fit and evaluate models for all folds
cv_results <- fit_resamples(boost_spec, still_customer ~ ., resamples = folds, metrics = metric_set(roc_auc)) #Collect cross-validated metrics
collect_metrics(cv_results)

31. Compare to a single classifier

Boosted tree

set.seed(100)
predictions <- boost_tree() %>% set_mode("classification") %>% set_engine("xgboost") %>% fit(still_customer ~ ., data = customers_train) %>% predict(new_data = customers_train, type = "prob") %>% bind_cols(customers_train)
roc_auc(predictions , truth = still_customer, estimate = .pred_yes)

Simple tree

set.seed(100)
predictions <- decision_tree() %>% # Change function set_mode("classification") %>% set_engine("rpart") %>% # Change engine fit(still_customer ~ ., data = customers_train) %>% predict(new_data = customers_train, type = "prob") %>% bind_cols(customers_train)
roc_auc(predictions, truth = still_customer, estimate = .pred_yes)

32. Tuning preparation

#Create the specification with placeholders

boost_spec <- boost_tree( trees = 500, learn_rate = tune(), tree_depth = tune(), sample_size = tune()) %>% set_mode("classification") %>% set_engine("xgboost") #Create the tuning grid
tunegrid_boost <- grid_regular(parameters(boost_spec), levels = 3)

33. The actual tuning

#Create CV folds of training data

folds <- vfold_cv(customers_train, v = 6) #Tune along the grid
tune_results <- tune_grid(boost_spec, still_customer ~ ., resamples = folds, grid = tunegrid_boost, metrics = metric_set(roc_auc)) #Plot the results
autoplot(tune_results)

34. Finalize the model

#Select the final hyperparameters

-best_params <- select_best(tune_results) #Finalize the specification final_spec <- finalize_model(boost_spec, best_params) #Train the final model on the full training data

final_model <- final_spec %>% fit( still_customer ~ ., data = customers_train)

35. Compare AUC

#Calculate the AUC for each model

auc_tree <- roc_auc(preds_combined, truth = still_customer, estimate = preds_tree)
auc_bagged <- roc_auc(preds_combined, truth = still_customer, estimate = preds_bagging)
auc_forest <- roc_auc(preds_combined, truth = still_customer, estimate = preds_forest)
auc_boost <- roc_auc(preds_combined, truth = still_customer, estimate = preds_boosting) #Combine AUCs into one tibble
combined <- bind_rows(decision_tree = auc_tree, bagged_tree = auc_bagged, random_forest = auc_forest, boosted_tree = auc_boost, .id = "model")

36. Plot ROC curves

#Reshape the predictions into long format

predictions_long <- tidyr::pivot_longer(preds_combined, cols = starts_with("preds_"), names_to = "model", values_to = "predictions") predictions_long %>% #Group by model group_by(model) %>% #Calculate values for every cutoff roc_curve(truth = still_customer, estimate = predictions) %>% #Create a plot from the calculated data autoplot

Course 6 Support Vector Machines (SVMs) in R

Chapter 1. Introduction

1.1 Maximum margin separator

Definition

Separable = A dataset in which the classes do not overlap
Decision boundary = a seperating boundry between separables
Maximum margin separator = the best decision boundary is one that maximizes the margin

Concept

Maximal margin separator lies halfway between the two clusters that you want to divide

Command: Visualize the maximal margin separator

Create data frame with maximal margin separator

MM_SEP <- data.frame(sep = c((SEP_CLUS_1 + SEP_CLUS_2) / 2)) # SEP_CLUS_1 = extreme point for lower cluster, SEP_CLUS_2 = extreme point for higher cluster

Add mm boundary to previous plot

PLOT <- PLOT + geom_point(data = MM_SEP, aes(sep, 0),color = "blue",size = 4)

1.2 Generating a linearly separable dataset

Concept

Create a dataset (two variables) that can be used to illustrate SVMs
Create a linear decision boundary that seperates the two variable in the dataset

Steps and commands

Generate a two-dimensional dataset 1.1) Set required number of data points

N <- 200 1.1) Set seed to ensure reproducibility
set.seed(INT) 1.1) Generate dataframe with two predictors x1 and x2 in (0,1)
DF <- data.frame(X1 = runif(n), X2 = runif(n))

Create two classes (-1 or +1), separated by a straight line decision boundary

DFX1 - DF$X2 > 0,-1, 1), levels = c(-1, 1))

Visualize using ggplot

PLOT <- ggplot(data = DF, aes(x = X1, y = X2, color = Y)) + geom_point() + scale_color_manual(values = c("-1" = "red","1" = "blue")) + # Distinguish by color geom_abline(slope = 1, intercept = 0) # Decision boundary is line where X1 = X2 passing through (0,0) and slope = 1

Add a margin 4.1) Remove points that lie close to the boundary

MARGIN <- 0.05 # Create a margin of 0.05 in dataset
DF1 <- DF[abs(DFX2) > MARGIN, ] # Retain only those points that lie outside the margin
nrow(DF1) # Check number of data points remaining 4.2) Replot
PLOT <- ggplot(data = DF1, aes(x = X1, y = X2, color = Y)) + geom_point() + scale_color_manual(values = c("-1" = "red","1" = "blue")) + # Distinguish by color geom_abline(slope = 1, intercept = 0) # Decision boundary is line where X1 = X2 passing through (0,0) and slope = 1

Plot the margin upper and lower boundaries

PLOT <- PLOT + geom_abline(slope = 1, intercept = DELTA, linetype = "dashed") + geom_abline(slope = 1, intercept = -DELTA, linetype = "dashed")

Chapter 2. Support Vector Classifiers - Linear Kernels

2.1 Linear Support Vector Machines

2.1.1 Concept

Decision boundaries can have different shapes e.g. line (previous Chapter), polynomial, or more complex function
Type of decision boundary = "kernel", it must be specified upfront

2.1.2 SVM with linear kernel

library(e1071)
svm(RESP_VAR ~ EXP_VAR, # formula (same as other models) data = DF # dataframe containing the data type = "classification" # set to C-classification kernel = "linear" # Form of decision boundary cost = "XXX" # tuning parameter gamma = "YYY" # tuning parameter scale = TRUE/FALSE # In real life, it should be set to "True" )

2.1.3 Steps of building SVM with linear kernel and Commands

Split data into training and test datasets 1.1) Set seed for reproducibility

set.seed(INT) 1.2) Set the upper bound for the number of rows to be in the training set
SAMPLE_SIZE <- floor(0.8 * nrow(DF)) 1.3) Randomly assign rows to training/test sets in 80/20 proportion
DATA <- sample(seq_len(nrow(DF)), size = SAMPLE_SIZE) 1.4) Separate training and test sets
TRAIN_DATA <- DF[DATA, ]
TEST_DATA <- DF[-DATA, ]

Build a linear SVM

library(e1071)
SVM_MODEL <- svm(Y ~., data = TRAIN_DATA, # Use train data type = "C-classification", kernel = "linear", scale = FALSE)

Look into the model 3.1) See overview of the model

SVM_MODEL 3.2) List components of model
names(SVM_MODEL) 3.3) Index of support vectors in training dataset
SVM_MODEL$Index 3.4) Support vectors
SVM_MODEL$SV 3.5) Negative intercept (unweighted)
SVM_MODEL$rho 3.6) Weighting coefficients for support vectors
SVM_MODEL$coefs

Obtain class predictions for training and test datasets

PRED_TRAIN <- predict(SVM_MODEL, TRAIN_DATA)
PRED_TEST <- predict(SVM_MODEL, TEST_DATA)

Evaluate the accuracy of the model

mean(PRED_TRAIN == TRAIN_DATA$Y)
mean(PRED_TEST == TEST_DATA$Y)

2.2 Visualizing Linear SVMs

Steps of visualizing linear SVMs (Manual method)

Plot the training data
Mark out the support vectors using index from the svm model

PLOT <- ggplot(data = TRAIN_DATA, aes(x = X1, y = X2, color = Y)) + geom_point() + scale_color_manual(values = c("red","blue"))

Find slope and inter 3.1) Identify support vectors

DF_SV <- TRAIN_DATA[SVM_MODEL$index, ] 3.2) Mark out support vectors in plot
PLOT <- PLOT + geom_point(data = DF_SV ,aes(x = X1, y = X2), color = "purple", size = 4, alpha = 0.5) Note: ggplot works in layers, so geom_point add another layer to mark.

Find slope and intercept of the boundary 4.1) Build weight vector from coefs and SV element of svm model

WEIGHT <- t(SVM_MODELSV # %*% = multiply matrices 4.2) Calculate slope
SLOPE_1 <- -WEIGHT[1] / WEIGHT[2] 4.3) Calculate intercept
INTERCEPT_1 <- SVM_MODEL$rho / WEIGHT[2]

Add decision boundary using slope and intercept calculated 5.1) Plot decision boundary based on calculated slope and intercept

PLOT <- PLOT + geom_abline(slope = SLOPE_1, intercept = INTERCEPT_1) 5.2) Add soft margin parallel to the decision boundary, offset by 1/ WEIGHT[2] on either side
PLOT <- PLOT + geom_abline(slope = SLOPE_1, intercept = INTERCEPT_1 - 1 / WEIGHT[2], linetype = "dashed") + geom_abline(slope = SLOPE_1, intercept = INTERCEPT_1 + 1 / WEIGHT[2], linetype = "dashed") Note: Soft margin classifiers Allow for uncertainty in location / shape of boundary as decision boundary is neither perfectly linear or known The decision boundary is linear -> can reduce margin

Steps of visualizing linear SVMs (Convenient method)

library(e1071)
plot(x = SVM_MODEL,data = TRAIN_DATA)

2.3 Tuning linear SVMs

2.3.1 Cost parameter for linear SVM

svm_model <- svm(y ~ ., data = TRAIN_DATA,
type ="C-classification",
kernel ="linear", cost = INT # Specify cost; the more, the smaller margin (shown in number of support vectors) scale =FALSE) Note: Reducing cost for increasing accuracy only works for linear SVM. In non-linear dataset, it does not improve accuracy much.

2.4 Multiclass problems

2.4.1 Concept: How SVM algorithm deal with multiclass problems

SVM works binary classifiers
SVM applies the "one-against-one classification strategy" for multiclass problems as follows: 2.1) Partition the data into subsets containing two classes each 2.2) Solve the binary classification problem for each subset 2.3) Use majority vote to assign a class to each data point

2.4.2 Commands [same as previous]

library(e1071)
svm_model <- svm(Species ~ ., data = TRAIN_DATA,
type ="C-classification",
kernel ="linear")

Chapter 3. Support Vector Classifiers - Polynomial Kernels

1. Generating a radially separable dataset

1.1 Generate a 2D uniformly distributed set of points

Set required number of datapoints

n <- INT

Set seed to ensure reproducibility

set.seed(INT)

Generate dataframe with 2 predictors x1 and x2 in (-1, 1)

DF <- data.frame(X1 = runif(n, min = -1, max = 1), X2 = runif(n, min = -1, max = 1))

1.2 Create data within a circular boundary of decision

Create a circular decision boundary of radius INT

RADIUS <- INT
RADIUS_SQUARED <- RADIUS ^ 2

Categorize data points depending on location wrt boundary

DFX1 ^ 2 + DF$X2 ^ 2 < RADIUS_SQUARED,-1, 1), levels = c(-1, 1))

1.3 Plot the dataset

library(ggplot2)
PLOT <- ggplot(data = DF, aes(x = X1, y = X2, color = Y)) + geom_point() + scale_color_manual(values = c("-1" = "red","1" = "blue"))
PLOT

1.4 Add a circular boundary

lying on a circle of radius r

CIRCLE <- function(X1_CENTER, X2_CENTER, r, npoint = 100) { #Angular spacing of 2*pi/npoint between points THETA <- seq(0, 2 * pi, length.out = npoint) X1_CIRC <- X1_CENTER + r * cos(THETA) X2_CIRC <- X2_CENTER + r * sin(THETA) data.frame(X1C = X1_CIR, X2C = X2_CIRC) }

Generate boundary

BOUNDARY <- circle(X1_CENTER = 0, X2_CENTER = 0, r = radius)

Add boundary to previous plot

PLOT <- PLOT + geom_path(data = BOUNDARY,aes(x = X1C, y = X2C), inherit.aes = FALSE)

2. Creating linear SVMs model on radially separable data

2.1 Build a linear SVM model, preliminarily evaluate and visualize it

Build a linear SVM model

SVM_MODEL <- svm(Y ~., data = TRAIN_DATA, type = "C-classification", kernel = "linear", cost = 1)

Calculate accurary on test data

PRED_TEST <- predict(SVM_MODEL, TEST_DATA)
mean(PRED_TEST == TEST_DATA$Y)

Plot the model with train data

plot(SVM_MODEL, TRAIN_DATA)

2.2 Tune cost parameter

Tune cost parameter

SVM_MODEL <- svm(Y ~., data = TRAIN_DATA, type = "C-classification", kernel = "linear", cost = 100) # Tune to 100 for example

Evaluate the tuned model using accuracy

PRED_TEST <- predict(SVM_MODEL, TEST_DATA)
mean(PRED_TEST == TEST_DATA$Y)

Plot the tuned model with train data

plot(SVM_MODEL, TRAIN_DATA)

2.3 Evaluate model

Concept: a better estimate of accuracy 1.1) Calculate mean accuracy over a number of independent train/test plits 1.2) Check SD of result to know variability
Commands

ACCURACY <- rep(NA, 100)
set.seed(10)
for (i in 1:100) { SAMPLE_SIZE <- floor(0.8 * nrow(DF)) TRAIN <- sample(seq_len(nrow(DF)), size = SAMPLE_SIZE) TRAIN_DATA <- DF[TRAIN, ] TEST_DATA <- DF[-TRAIN, ] SVM_MODEL <- svm(Y ~., data = TRAIN_DATA, type = "C-classification", kernel = "linear", cost = 100) PRED_TEST <- predict(SVM_MODEL, TEST_DATA) ACCURACY[i] <- mean(PRED_TEST == TEST_DATA$Y) }
mean(ACCURACY)
sd(ACCURACYy)

3. Applying kernel trick on radially separable data

3.1 Transform the problem

Concept = Devise a transformation that makes the problem linearly separable
Equation x1 ^ 2 + x2 ^ 2 = NUM If x1^2 = X1 and x2^2 = X2, then X1 + X2 = NUM -> linear equation

DF_1 <- data.frame(X1SQ = DFX2^2, y = DF$Y) # DF from previous model

Plot the transformed

TRANSFORMED_PLOT <- ggplot(data = DF_1, aes(x = X1SQ, y = X2SQ, color = Y)) + geom_point() + scale_color_manual(values = c("red","blue")) + geom_abline(slope = -1, intercept = RADIUS_SQUARED) # Example of slope while intercept

3.2 Concept: Polynomial kernel

Parameters from polynomial kernel: gamma * (u.v) + coef0) ^ degree 1.1) degree = the degree of the polynomial 1.2) gamma and coef0 = tuning parameters 1.3) u , v = vectors (datapoints) belonging to the dataset
Kernel functions = functions that satisfy specific properties of transforming SVMs

3.3 Apply quadratic(polynomial) kernal on radially separable data

Split train/test dat and build a model

SVM_MODEL <- svm(Y ~., data = TRAIN_DATA, type = "C-classification", kernel = "polynomial", # Specify type as polynomial degree = INT) # Set the degree of the polynomial

Evaluate the model

PRED_TEST <- predict(SVM_MODEL, TEST_DATA)
mean(pred_test == testset$y)

Plot the model

plot(SVM_MODEL, TRAIN_DATA)

4. Tuning polynomial SVMs

4.1 Objectives of tuning

Hard to find optimal values of parameters manually for complex kernels

4.2 Concept: how it works

Set a range of search values for each parameter
For each possible combination, build an SVM model
Evaluate the model accuracy
Return the parameter with best accuracy

4.3 Steps and commands

Set a range of search values for each parameter

TUNE_OUT <- tune.svm(x = TRAIN_DATA[,-3], y = TEST_DATA[,3], type = "C-classification", kernel = "polynomial", degree = 2, cost = 10^(-1:2), # Set range gamma = c(0.1,1,10), # Set range coef0 = c(0.1,1,10)) # Set range
TUNE_OUTcost
TUNE_OUTgamma
TUNE_OUTcoef0

Build an SVM model for each combination and evaluate its accuracy in train and test datasets

SVM_MODEL <- svm(Y ~., data = TRAIN_DATA, type = "C-classification", kernel = "polynomial", degree = 2, cost = TUNE_OUTcost, gamma = TUNE_OUTgamma, coef0 = TUNE_OUTcoef0)

Evaluate the model accuracy for train and test datasets

PRED_TRAIN <- predict(SVM_MODEL, TRAIN_DATA)
mean(PRED_TRAIN == TRAIN_DATA$Y)
PRED_TEST <- predict(SVM_MODEL, TEST_DATA)
mean(PRED_TEST == TEST_DATA$Y)

Return the parameter with best accuracy

Chapter 4. Support Vector Classifiers - Radial Basis Function (RBF) Kernels

1. Generating a complex dataset

1.1 Concept: RBF kernel

Highly flexible kernal - can fix complex decision boundaries
Commonly used in practice

1.2 Steps of generating a complex dataset

Generate a complex dataset with points and different distributions

N <- INT # Set number of data points
set.seed(INT) # Set seed for reproducibility
DF <- data.frame(X1 = rnorm(n, mean = -0.5, sd = 1), X2 = runif(n, min = -1, max = 1))

Generate data points with boundary (two equi-radial circles with a single point in common) 2.1) Set radius and centers

RADIUS <- 0.7
RADIUS_SQUARED <- RADIUS ^ 2
CENTER_1 <- c(-0.7, 0)
CENTER_2 <- c(0.7, 0) 2.2) Classify points
DFX1 - CENTER_1[1]) ^ 2 + (DFX1 - CENTER_2[1]) ^ 2 + (DF$X2 - CENTER_2[2]) ^ 2 < RADIUS_SQUARED, -1, 1), levels = c(-1, 1))

Visualize the dataset

PLOT <- ggplot(data = DF, aes(x = X1, y = X2, color = Y)) + geom_point() + guides(color = "none") + scale_color_manual(values = c("red","blue"))

Generate boundary around the data points 4.1) Generate points on a circle using function

CIRCLE <- function(x1_center, x2_center, r, npoint = 100) { THETA <- seq(0, 2 * pi, length.out = npoint) X1_CIRC <- X1_CENTER + r * cos(THETA) X2_CIRC <- X2_CENTER + r * sin(THETA) data.frame(X1C = X1_CIR, X2C = X2_CIRC) } 4.2) Generate the boundaries using the points in 4.1)
BOUNDARY_1 <- circle(X1_CENTER = CENTER_1[1], X2_CENTER = CENTER_1[2], r = RADIUS)
PLOT <- PLOT + geom_path(data = BOUNDARY_1, aes(x = X1C, y = X2C), inherit.aes = FALSE)
BOUNDARY_2 <- circle(X1_CENTER = CENTER_2[1], X2_CENTER = CENTER_2[2], r = RADIUS)
PLOT <- PLOT + geom_path(data = BOUNDARY_2, aes(x = X1C, y = X2C), inherit.aes = FALSE)

2. Applying linear and polynomial kernels to the complex dataset

Quadratic kernel

2.1 Apply the quadratic kernel with default parameters

Split train/test dat and build a model

SVM_MODEL <- svm(Y ~., data = TRAIN_DATA, type = "C-classification", kernel = "polynomial", # Specify type as polynomial degree = INT) # Set the degree of the polynomial

Evaluate the model

PRED_TEST <- predict(SVM_MODEL, TEST_DATA)
mean(PRED_TEST == TEST_DATA$Y)

Plot the model

plot(SVM_MODEL, TRAIN_DATA)

2.2 Apply higher degree -> polynomial kernal doesn't fit much

SVM_MODEL <- svm(Y ~., data = TRAIN_DATA, type = "C-classification", kernel = "polynomial", degree = INT) # Set a higher degree of the polynomial degree = 4)

2.3 Apply another approach

Concept - Heuristic (points close to each another have the same class or K-Nearest Neighbors algorithm)
How heuristic works - For a given point in the data set, e.g. X1 = (a,b): 2.1) The kernel should have a maximum at (a,b) and should decay as one moves away from (a,b) 2.2) In absence of any information, the rate of decay (gamma) should be the same in all directions 2.3) The rate of decay should be tunable.
Function for heuristic - exp(-gamma * r), where r is the distance btw X1 and any other point X [RBF concept]

RBF <- function(r, gamma) exp(-gamma * r)
ggplot(data.frame(r = c(-0, 10)), aes(r)) + stat_function(fun = RBF, args = list(gamma = 0.2), aes(color = "0.2")) + stat_function(fun = RBF, args = list(gamma = 0.4), aes(color = "0.4")) + stat_function(fun = RBF, args = list(gamma = 0.6), aes(color = "0.6")) + stat_function(fun = RBF, args = list(gamma = 0.8), aes(color = "0.8")) + stat_function(fun = RBF, args = list(gamma = 1), aes(color = "1")) + stat_function(fun = RBF, args = list(gamma = 2), aes(color = "2")) + scale_color_manual("gamma", values = c("red","orange","yellow","green","blue","violet")) + ggtitle("Radial basis function (gamma = 0.2 to 2)")

3. Applying Radial Basis Function (RBF) kernels to the complex dataset

3.1 Concept

Decreasing function of distance between two points in dataset
Simulates k-NN algorith (closer point, more similar)

3.2 Steps of building an SVM using the RBF kernel and commands

Build RBF kernel SVM for complex dataset and evaluate it 1.1) Build an SVM model with RBF kernel

SVM_MODEL <- svm(Y ~., data = TRAIN_DATA, type = "C-classification", kernel = "radial") # Specify "radial" 1.2) Evaluate the model for both train and test datasets
PRED_TRAIN <- predict(SVM_MODEL, TRAIN_DATA)
mean(PRED_TRAIN == TRAIN_DATA$Y)
PRED_TEST <- predict(SVM_MODEL, TEST_DATA)
mean(PRED_TEST == TEST_DATA$Y) 1.3) Plot the model
plot(SVM_MODEL, TRAIN_DATA)

Refind the decision boundary 2.1) Tune gamme and cost using tune.svm()

TUNE_OUT <- tune.svm(x = TRAIN_DATA[,-3], y = TRAIN_DATA[,3], gamma = 5 * 10 ^ (-2:2), cost = c(0.01, 0.1, 1, 10, 100) type = "C-classification", kernel = "radial") 2.2) Print best parameters
TUNE_OUTcost
TUNE_OUTgamma

Build the tuned model, evaluate it using test dataset, and plot it using train dataset

SVM_MODEL <- svm(Y ~., data = TRAIN_DATA, type = "C-classification", kernel = "radial" cost = TUNE_OUTcost gamma = TUNE_OUTgamma )
PRED_TEST <- predict(SVM_MODEL, TEST_DATA)
mean(PRED_TEST == TEST_DATA$Y)
plot(SVM_MODEL, TRAIN_DATA)

Exercise

1.Visualizing a sugar content dataset

#Load ggplot2

library(ggplot2) #Print variable names
colnames(df) #Plot sugar content along the x-axis
plot_df <- ggplot(data = df, aes(x = sugar_content, y = 0)) + geom_point() + geom_text(aes(label = sugar_content), size = 2.5, vjust = 2, hjust = 0.5)
plot_df

2. Find the maximal margin separator

#The maximal margin separator is at the midpoint of the two extreme points in each cluster.

mm_separator <- (8.9 + 10)/2

3. Visualize the maximal margin separator

#create data frame containing the maximum margin separator

separator <- data.frame(sep = mm_separator) #add ggplot layer
plot_sep <- plot_ + geom_point(data = separator, aes(x = sep, y = 0), color = "blue", size = 4)
plot_sep

4. Generate a 2d uniformly distributed dataset

#set seed

set.seed(42) #set number of data points.
n <- 600 #Generate data frame with two uniformly distributed predictors lying between 0 and 1.
df <- data.frame(x1 = runif(n), x2 = runif(n))

5. Create a decision boundary

#classify data points depending on location

dfx2 - 1.4*df$x1 < 0, -1, 1), levels = c(-1, 1))

6. Introduce a margin in the dataset

#set margin

delta <- 0.07 #retain only those points that lie outside the margin
df1 <- df[abs(1.4*dfx2) > delta, ] #build plot
plot_margins <- ggplot(data = df1, aes(x = x1, y = x2, color = y)) + geom_point() + scale_color_manual(values = c("red", "blue")) + geom_abline(slope = 1.4, intercept = 0)+ geom_abline(slope = 1.4, intercept = delta, linetype = "dashed") + geom_abline(slope = 1.4, intercept = -delta, linetype = "dashed")
plot_margins

7. Creating training and test datasets

#Set the upper bound for the length of the training set

sample_size <- floor(0.8 * nrow(df)) #Assign rows to training set randomly
train <- sample(seq_len(nrow(df)), size = sample_size) #Yield training and test sets
trainset <- df[train, ]
testset <- df[-train, ]

8. Building a linear SVM classifier

library(e1071) #build svm model, setting required parameters
svm_model<- svm(y ~ ., data = trainset, type = "C-classification", kernel = "linear", scale = FALSE)

9. Exploring the model and calculating accuracy

#list components of model

names(svm_model) #list components of model
names(svm_model) #list values of the SV, index and rho
svm_model$SV
svm_model$index
svm_model$rho #compute training accuracy
pred_train <- predict(svm_model, trainset)
mean(pred_train == trainset$y) #compute test accuracy
pred_test <- predict(svm_model, testset)
mean(pred_test == testset$y)

10. Visualizing support vectors using ggplot

#load ggplot

library(ggplot2) #build scatter plot of training dataset
scatter_plot <- ggplot(data = trainset, aes(x = x1, y = x2, color = y)) + geom_point() + scale_color_manual(values = c("red", "blue")) #add plot layer marking out the support vectors
layered_plot <- scatter_plot + geom_point(data = trainset[svm_model$index, ], aes(x = x1, y = x2), color = "purple", size = 4, alpha = 0.5)
layered_plot

11. Visualizing decision & margin bounds using `ggplot2`

#calculate slope and intercept of decision boundary from weight vector and svm model

slope_1 <- -w[1]/w[2]
intercept_1 <- svm_model$rho/w[2] #build scatter plot of training dataset
scatter_plot <- ggplot(data = trainset, aes(x = x1, y = x2, color = y)) + geom_point() + scale_color_manual(values = c("red", "blue")) #add decision boundary
plot_decision <- scatter_plot + geom_abline(slope = slope_1, intercept = intercept_1) #add margin boundaries
plot_margins <- plot_decision + geom_abline(slope = slope_1, intercept = intercept_1 - 1/w[2], linetype = "dashed")+ geom_abline(slope = slope_1, intercept = intercept_1 + 1/w[2], linetype = "dashed")
plot_margins

12. Visualizing decision & margin bounds using `plot()`

#load required library

library(e1071) #build svm model
svm_model<- svm(y ~ ., data = trainset, type = "C-classification", kernel = "linear", scale = FALSE) #plot decision boundaries and support vectors for the training data
plot(x = svm_model, data = trainset)

13. Tuning a linear SVM

#build svm model, cost = 1

svm_model_1 <- svm(y ~ ., data = trainset, type = "C-classification", cost = 1, kernel = "linear", scale = FALSE)
svm_model_1 #build svm model, cost = 100
svm_model_100 <- svm(y ~ ., data = trainset, type = "C-classification", cost = 100, kernel = "linear", scale = FALSE)
svm_model_100

14. Visualizing decision boundaries and margins

[Cost = 1] #add decision boundary and margins for cost = 1 to training data scatter plot

train_plot_with_margins <- train_plot + geom_abline(slope = slope_1, intercept = intercept_1) + geom_abline(slope = slope_1, intercept = intercept_1-1/w_1[2], linetype = "dashed")+ geom_abline(slope = slope_1, intercept = intercept_1+1/w_1[2], linetype = "dashed")
train_plot_with_margins [Cost = 100] #add decision boundary and margins for cost = 100 to training data scatter plot
train_plot_with_margins <- train_plot_100 + geom_abline(slope = slope_100, intercept = intercept_100, color = "goldenrod") + geom_abline(slope = slope_100, intercept = intercept_100-1/w_100[2], linetype = "dashed", color = "goldenrod")+ geom_abline(slope = slope_100, intercept = intercept_100+1/w_100[2], linetype = "dashed", color = "goldenrod")
train_plot_with_margins

15. A multiclass classification problem

#load library and build svm model

library(e1071)
svm_model<- svm(y ~ ., data = trainset, type = "C-classification", kernel = "linear", scale = FALSE) #compute training accuracy
pred_train <- predict(svm_model, trainset)
mean(pred_train == trainset$y) #compute test accuracy
pred_test <- predict(svm_model, testset)
mean(pred_test == testset$y) #plot
plot(x = svm_model, trainset)

16. Iris redux - a more robust accuracy

for (i in 1:100){ #assign 80% of the data to the training set sample_size <- floor(0.8 * nrow(iris)) train <- sample(seq_len(nrow(iris)), size = sample_size) trainset <- iris[train, ] testset <- iris[-train, ] #build model using training data svm_model <- svm(Species~ ., data = trainset, type = "C-classification", kernel = "linear") #calculate accuracy on test data pred_test <- predict(svm_model, testset) accuracy[i] <- mean(pred_test == testset$Species) }
mean(accuracy)
sd(accuracy)

17. Generating a 2d radially separable dataset

#set number of variables and seed

n <- 400
set.seed(1) #Generate data frame with two uniformly distributed predictors, x1 and x2
df <- data.frame(x1 = runif(n, min = -1, max = 1), x2 = runif(n, min = -1, max = 1)) #We want a circular boundary. Set boundary radius
radius <- 0.8
radius_squared <- radius^2 #create dependent categorical variable, y, with value -1 or 1 depending on whether point lies #within or outside the circle
dfx1 ^2 + df$x2 ^2 < radius_squared, -1, 1), levels = c(-1, 1))

18.Visualizing the dataset

#load ggplot

library(ggplot2) #build scatter plot, distinguish class by color
scatter_plot <- ggplot(data = df, aes(x = x1, y = x2, color = "___")) + geom_point() + scale_color_manual(values = c("red", "blue"))
scatter_plot

19. Linear SVM for a radially separable dataset

[Cost = 1] #default cost mode;

svm_model_1 <- svm(y ~ ., data = trainset, type = "C-classification", cost = 1, kernel = "linear") #training accuracy
pred_train <- predict(svm_model_1, trainset)
mean(pred_train == trainset$y) #test accuracy
pred_test <- predict(svm_model_1, testset)
mean(pred_test == testset$y) [Cost = 100] #cost = 100 model
svm_model_2 <- svm(y ~ ., data = trainset, type = "C-classification", cost = 100, kernel = "linear") #accuracy
pred_train <- predict(svm_model_2, trainset)
mean(pred_train == trainset$y) #test accuracy
pred_test <- predict(svm_model_2, testset)
mean(pred_test == testset$y)

20. Average accuracy for linear SVM

#Print average accuracy and standard deviation

accuracy <- rep(NA, 100)
set.seed(2) #Calculate accuracies for 100 training/test partitions
for (i in 1:100){ sample_size <- floor(0.8 * nrow(df)) train <- sample(seq_len(nrow(df)), size = sample_size) trainset <- df[train, ] testset <- df[-train, ] svm_model <- svm(y ~ ., data = trainset, type = "C-classification", kernel = "linear") pred_test <- predict(svm_model, testset) accuracy[i] <- mean(pred_test == testset$y) } #Print average accuracy and standard deviation
mean(accuracy)
sd(accuracy)

21. Visualizing transformed radially separable data

#transform data

df1 <- data.frame(x1sq = dfx2^2, y = df$y) #plot data points in the transformed space
plot_transformed <- ggplot(data = df1, aes(x = x1sq, y = x2sq, color = y)) + geom_point()+ guides(color = "none") + scale_color_manual(values = c("red", "blue")) #add decision boundary and visualize
plot_decision <- plot_transformed + geom_abline(slope = -1, intercept = 0.64)
plot_decision

22.SVM with polynomial kernel

svm_model<- svm(y ~ ., data = trainset, type = "C-classification", kernel = "polynomial", degree = 2) #measure training and test accuracy
pred_train <- predict(svm_model, trainset)
mean(pred_train == trainset$y)
pred_test <- predict(svm_model, testset)
mean(pred_test == testset$y) #plot
plot(svm_model, trainset)

23. Using `tune.svm()`

#tune model

tune_out <- tune.svm(x = trainset[, -3], y = trainset[, 3], type = "C-classification", kernel = "polynomial", degree = 2, cost = 10^(-1:2), gamma = c(0.1, 1, 10),
coef0 = c(0.1, 1, 10)) #list optimal values
tune_outcost
tune_outgamma
tune_outcoef0

24. Building and visualizing the tuned model

#Build tuned model

svm_model <- svm(y~ ., data = trainset, type = "C-classification", kernel = "polynomial", degree = 2, cost = tune_outcost, gamma = tune_outgamma, coef0 = tune_outcoef0) #Calculate training and test accuracies
pred_train <- predict(svm_model, trainset)
mean(pred_train == trainset$y)
pred_test <- predict(svm_model, testset)
mean(pred_test == testset$y) #plot model
plot(svm_model, trainset)

25. Generating a complex dataset - part 1

#number of data points

n <- 1000 #set seed
set.seed(1) #create dataframe
df <- data.frame(x1 = rnorm(n, mean = -0.5, sd = 1), x2 = runif(n, min = -1, max = 1))

26. Generating a complex dataset - part 2

#set radius and centers

radius <- 0.8
center_1 <- c(-0.8, 0)
center_2 <- c(0.8, 0)
radius_squared <- radius^2 #create binary classification variable
dfx1-center_1[1])^2 + (dfx1-center_2[1])^2 + (df$x2-center_2[2])^2 < radius_squared, -1, 1),levels = c(-1, 1))

27. Visualizing the dataset

#Load ggplot2

library(ggplot2) #Plot x2 vs. x1, colored by y
scatter_plot<- ggplot(data = df, aes(x = x1, y = x2, color = y)) + #Add a point layer geom_point() + scale_color_manual(values = c("red", "blue")) + #Specify equal coordinates coord_fixed()
scatter_plot

28. Linear SVM for complex dataset

#build model

svm_model<- svm(y ~ ., data = trainset, type = "C-classification", kernel = "linear") #accuracy
pred_train <- predict(svm_model, trainset)
mean(pred_train == trainset$y)
pred_test <- predict(svm_model, testset)
mean(pred_test == testset$y) #plot model against testset
plot(svm_model, testset)

29. Quadratic SVM for complex dataset

#build model

svm_model<- svm(y ~ ., data = trainset, type = "C-classification", kernel = "polynomial", degree = 2) #accuracy
pred_train <- predict(svm_model, trainset)
mean(pred_train == trainset$y)
pred_test <- predict(svm_model, testset)
mean(pred_test == testset$y) #plot model
plot(svm_model, trainset)

30. Polynomial SVM on a complex dataset

#create vector to store accuracies and set random number seed

accuracy <- rep(NA, 100)
set.seed(2) #calculate accuracies for 100 training/test partitions
for (i in 1:100){ sample_size <- floor(0.8 * nrow(df)) train <- sample(seq_len(nrow(df)), size = sample_size) trainset <- df[train, ] testset <- df[-train, ] svm_model<- svm(y ~ ., data = trainset, type = "C-classification", kernel = "polynomial", degree = 2) pred_test <- predict(svm_model, testset) accuracy[i] <- mean(pred_test == testset$y) } #print average accuracy and standard deviation
mean(accuracy)
sd(accuracy)

31. RBF SVM on a complex dataset

#create vector to store accuracies and set random number seed

accuracy <- rep(NA, 100)
set.seed(2) #calculate accuracies for 100 training/test partitions
for (i in 1:100){ sample_size <- floor(0.8 * nrow(df)) train <- sample(seq_len(nrow(df)), size = sample_size) trainset <- df[train, ] testset <- df[-train, ] svm_model<- svm(y ~ ., data = trainset, type = "C-classification", kernel = "radial") pred_test <- predict(svm_model, testset) accuracy[i] <- mean(pred_test == testset$y) } #print average accuracy and standard deviation
mean(accuracy)
sd(accuracy)

32. Tuning an RBF kernel SVM

#tune model

tune_out <- tune.svm(x = trainset[, -3], y = trainset[, 3], gamma = 5*10^(-2:2), cost = c(0.01, 0.1, 1, 10, 100), type = "C-classification", kernel = "radial") #build tuned model
svm_model <- svm(y~ ., data = trainset, type = "C-classification", kernel = "radial", cost = tune_outcost, gamma = tune_outgamma) #calculate test accuracy
pred_test <- predict(svm_model, testset)
mean(pred_test == testset$y) #Plot decision boundary against test data
plot(svm_model, testset)

Course 7: Fundamentals of Bayesian Data Analysis in R

Chapter 1. What is Bayesian Data Analysis?

1. Introduction

1.1 History background of Baysian

Enigma machine (by Alan Turing) uses Baysian concept to decode Nazi's encrypted codes during WW2

1.2 Bayesian interference

Definition = A method for figuring out unobservable quantities given known facts that uses probability to describe the uncertainty over what the values of [the unknown quantities] could be.
For enigma machine, [the unknown quantities] = configuration of three wheels, the three wheels were selected from a pool of eight and their positions - problem = don't know which wheel and what position it has
Bayesian interference was applied backwards (from the codes) to the wheel setting (which and position).

2. Bayesian Data Analysis

2.1 Basics

Definition = The use of Bayesian inference to learn from data
Background = Thomas Bayes wrote the first article, at first Bayesian inference = probablistic inference
Characteristics 3.1) Can be used for hypothesis testing, linear regression, etc 3.2) Is flexible and allows you to construct problem-specific models
Probability 4.1) A steatement about certainty (complete = 1) or uncertainty (complete = 0) 4.2) Not only about yes/no events 4.3) Role of Probability distribution vs Bayesian interfernce 4.3.1) Role of probability distribution = represent uncertainty 4.3.2) Role of Bayesian interfernce = update probability distributions to reflect what has been learned from data

2.2 Simple Bayesian model [for proportion of success]

Assumption of Bayesian model 1.1) Data is a vector for success and failure represented by 0 or 1 1.2) There is an unknown underlying proportion of success 1.3) If data point is a success is only affected by this proportion. 1.4) Prior to being updated with any data, any underlying proportion of success is equally likely 1.5) The result is a probability distribution that represents what the model knows about the underlying proportion of success.
Steps of building a simple Bayesian model and their commands 2.1) Specify a vector for success and failure represented by 0 or 1

DATA <- c()
2.2) Build a simple Bayesian model
prop_model(DATA) # Return a graph of proportion of success (more data will update the probabily distribution)

3. Samples and posterior summaries

3.1 Prior and Posteriors

Prior = a probability distribution that represents what the model knows before seeing the data (n=0)
Posterior = a probability distribution that represents what the model knows after having seen the data (n= X)

Chapter 2. How does Bayesian inference work?

1. Components of Bayesian inference

1.1 Basic components of inference

Generative model
Priors = what the model knows b/f seeing the data
Data Note: 1) + 2) = Bayesian model

1.2 Generative model

Definition = computer program or mathematic expression or set of rules that you can use to generate data from the parameters
How to build a basic generative model 2.1) Set parameters

PROP_SUCCESS <- 0.15 # 15% of curing success
N_ZOMBIES <- 13 2.2) Simulate data by sampling prob (0-1) if < 15% = success
DATA <- c()
for(zombie in 1:n_zombies) { DATA[zombie] <- runif(1, min = 0, max = 1) < prop_success }
DATA <- as.numeric(DATA) # Turn TRUE/FALSE into 1/0

2. Component 1: Generative model

2.1 Concept of Generative model

Work as the same as binomial distribution function (success/fail)
Other distributions (normal, poission) can be also seen as small generative models

2.2 Steps of Binomial generative model

Parameters: prob, size
Generative model: rbinom to generate data

rbinom ( # Returns vector for probs of how many success each time n = INT, # number of times you want to run the generative model size = INT, # number of trials e.g. number of zombies prob = X) # underlying proportion of success between 0 and 1

Data (vector of prob)

2.3 Application of generative model

Concept = We know data, but want to know parameter(s) by using Bayesian inference

3. Component 2: Priors

3.1 Definition and concept

Prior = represents uncertainty b/f seeing the data
Concept 2.1) We know approxiates of parameters and difference in parameters 2.2) So, we want to include uncertainty in binomial model 2.3) Assume uncertainty in form of probability (0, x) -> uniform distribution

3.2 Steps of Binomial generative model (2)

Assign a large number of samples using uniform distribution: runif (=random uniform) returns probs

PROP_UPDATE <- runif(n = N_SAMPLES, min = 0.0, max = NUM) # Specify uncertainty to create uniform distribution

Insert probs into rbinom (combine variability btw rbinom and runif)

rbinom ( # Returns vector for probs of how many success each time n = INT,
size = INT, prob = PROP_UPDATE)

3. Bayesian models and conditioning (Component 3: Data and Bayesian inference)

3.1 Build a Bayesian model (Prior + generative model)

N_SAMPLES <- 100000
N_TIMES <- 100
PROP_UPDATE <- runif(N_SAMPLES, min = 0.0, max = NUM)
DATA <- rbinom(n = N_SAMPLES, size = N_TIMES, prob = PROP_UPDATE)
PRIOR <- data.frame(proportion_clicks, n_visitors)

3.2 Condition the Bayesian model (Bayesian inference)

Concept if we know an approxiate of parameter, we can remove all samples not fulfilling the approxiate X % (e.g. % of click on website) to reduce uncertainty and so on
Problem
Don't know the underlying X%
Can't condition data (component 3)
Essence of Bayesian inference Definition = Conditioning on data in order to learn about parameter values
Example of Bayesian model updates 4.1) Define prior

prior <- data.frame(proportion_clicks, n_visitors) 4.2) Define certainty (we have data that 13 people clicked and visited our site)
posterior <- prior[prior$n_visitors == 13, ]
prior <- posterior #Replace prior$n_visitors with a new sample and visualize the result
n_samples <- nrow(prior)
n_ads_shown <- 100
priorproportion_clicks)
hist(prior$n_visitors) 4.3) See updated posteior (prob of 5 or more people visiting the site)
(priorn_visitors)

Chapter 3. Why use Bayesian Data Analysis?

1. Benefits of Bayesian data analysis

Flexible

Can include information sources in addition to the data (Note: the data can utilize pre-defined distribution e.g. beta distribution by defining alpha and beta) -> See ### Beta distribution 1.1) Background information 1.2) Expert opinion 1.3) Common knowledge
Can compare between groups or between data sets -> See ## Comparisons between groups and datasets
Can use results to inform business decisions -> See ## Decision analysis
Can change the underlying statistical model -> ## 4. Change underlying statistical model
Is Optimal (probably, no guaranteed) -> ## 5. Bayes is optimal, kind of...

Beta distribution

Basic of beta distribution

beta_sample <- rbeta(n = INT, # Specify number of draws shape1 = POS_INT, # The larger, the closer to 1.0 shape2 = POS_INT) # The larger, the closer to 0.0 Note: If shape 1 or shape 2 <0, Nan Note2: Beta(1,1) and uniform distribution (0,1) are both non-informative prior as they express any value from 0 to 1 equally likely

Translation of data into beta distribution Example of data: Most ads get clicked on 5% of the time, but for some ads it is as low as 2% and for others as high as 8%.

rbeta(shape1 = 5, shape2 = 95 -> returns distribution according to the example

Model update using beta distribution

n_draws <- 100000
n_ads_shown <- 100
proportion_clicks <- rbeta(n_draws, shape1 = 5, shape2 = 95) # Replace uniform distribution with beta distribution
n_visitors <- rbinom(n_draws, size = n_ads_shown, prob = proportion_clicks)
prior <- data.frame(proportion_clicks, n_visitors)
posterior <- prior[prior$n_visitors == 13, ]

2. Comparisons between groups and datasets

2.1 Contrasts

Old prior -> data -> old posterior
Informed prior -> data -> informed posterior Note: which one to choose depends on how much you believe in data.

2.2 Comparisons (between groups or datasets)

Example: compare between video or text ads

Set informed posterior

POSTERIOR <- data.frame(video_prop = posterior_videoproportion_click[1:4000])

Calculate the posterior difference: video_prop - text_prop

posteriorvideo_prop - posterior$text_prop
hist(posterior$prop_diff) # Visualize prop_diff
median(posterior$prop_diff) #Calculate the median of prop_diff
sum(posteriorprop_diff) # Calculate the proportion (prob video better than text)

3. Decision analysis

3.1 Concept

Summary statistics of posterior not equal to that of data Summary statistics of posterior = summary of some parameteris Summary statistics of data = summary of dataset

3.2 Steps of small decision analysis

Set parameter of interest -> e.g. profit and cost (comparing between video and text ads)

visitor_spend <- 2.53
video_cost <- 0.25
text_cost <- 0.05

Calculate the parameters

posteriorvideo_prop * visitor_spend - video_cost
posteriortext_prop * visitor_spend - text_cost

Calculate difference between groups

posteriorvideo_profit - posterior$text_profit
median(posterior$profit_diff)

Make a decision based on the probabalistic difference (e.g. text better than video -> < 0; if > 0 means video better than text)

sum(posteriorprofit_diff)

4. Change underlying statistical model

4.1 Alternative sceneario: completely switch out the binomial model -> Poisson model

Have new data e.g. trial showed 19 click per day
Change in practice e.g. don't pay per view but pay per day

4.2 Poisson model

The Poisson distribution: two parameters 1.1) lambda = mean number of events per time unit 1.2) n = number of times to simulate
Steps of applying poisson model 2.1) Replace proportion with means and change the parameters

n_draws <- 100000
n_ads_shown <- 100
mean_clicks <- runif(n_draws, min = 0.0, max = 80) # Specify mean 2.2) Change the model from binomial to poisson
n_visitors <- rpois(n_draws, lambda = mean_clicks) # Specify lambda
prior <- data.frame(mean_clicks, n_visitors)
posterior <- prior[prior$n_visitors == 19, ] # Update data if needed

5. Bayes is optimal, kind of...

Bayes is optimal in the small world of the model

If uniform distribution assumption is wrong, the model is always wrong no matter how much data (e.g. 20% click proportion, what if it is not)
Bayesian data analyis: there is a separation btw model and computation.

Chapter 4. Bayesian inference with Bayes' theorem

1. Probability rule

1.1 Benefits and drawbacks about basic inference (curing out impossible prob)

Benefits 1.1) Bayesian computation is a hot research topic. 1.2) There are many methods to fit Bayesian models more efficiently. 1.3) The result will be the same, you'll just get it faster
Drawbacks The computation method we've used scales horribly.

1.2 Probability theory

Probability
Mathematical notation 2.1) Probability e.g. P(n_visitors = 13) 2.2) Probability distribition covering all possible numbers e.g. P(n_visitors) 2.3) Conditional probability, meaning "if" e.g. P(n_visitors = 13 |prop_clicks = 10%) 2.4) Conditional probability distribution e.g. P(n_visitors | prop_clicks = 10%), same prob distribution as simulated before

1.3 Manipulating probability

The sum rule: if two possible outcomes are mutually exclusive, can sum them. E.g. p(1 or 2 or 3) = 1/6 + 1/6 + 1/6
The product rule: if two possible outcomes are independent, can multiply them. E.g p(6 and 6) = 1/6 * 1/6

2. Calculating likelihoods

2.1 Simulation (simulate samples -> calculate prob)

Example: P(n_visitors = 13 | prob_success = 10%)

Simulate

n_visitors <- rbinom(n = 100000, size = 100, prob = 0.1)

Calculate

sum(n_visitors == 13) / length(n_visitors)

2.2 Calculation (in R, you can directly calculate prob, not requiring simulation)

Example: P(n_visitors = 13 | prob_success = 10%)

dbinom(13, size = 100, prob = 0.1)

2.3 Applications of the calculation

The sum rule Example: P(n_visitors = 13 | prob_success = 10%)

dbinom(13, size = 100, prob = 0.1) + dbinom(14, size = 100, prob = 0.1)

Getting whole probability distribution

n_visitors = seq(0, 100, by = 1)
probability <- dbinom(n_visitors, size = 100, prob = 0.1)

Getting uniform continuous distribution

x = seq(0, 0.2, by=0.01)
dunif(x, min = 0.0, max = 0.2) # Returns probability density (not actual prob as prob must be c(0,1); d stands for density)

Maximum likelihood estimation 4.1) Vary n_visitors and fix prob (want to know max visitor)

n_ads_shown <- 100
proportion_clicks <- 0.1
n_visitors <- seq(0, 100, by = 1)
prob <- dbinom(n_visitors, size = n_ads_shown, prob = proportion_clicks) #Plot the distribution
plot(x = n_visitors, y = prob, type ="h") 4.2) Vary prob and fix n_visitors (want to know proportion_click)
n_ads_shown <- 100
proportion_clicks <- seq(0, 1, by = 0.01)
n_visitors <- 13
prob <- dbinom(n_visitors, size = n_ads_shown, prob = proportion_clicks) prob
plot(proportion_clicks, prob, type = "h")

3. Bayesian calculation

Bayesian inference by calculation

Set parameters and combinations of the unknown

n_ads_shown <- 100 # Set number of ads to shown (we know)
n_visitors <- seq(0, 100, by = 1) # Set the Unknown
proportion_clicks <- seq(0, 1, by = 0.01) # Set the Unknown
PARAMETERS <- expand.grid(proportion_clicks = proportion_clicks, n_visitors = n_visitors) # List of all combinations

Sample prior

PARAMETERSproportion_clicks, min = 0, max = 0.2)

Calculate the likelihood of the data

PARAMETERSn_visitors, size = n_ads_shown, prob = PARAMETERS$proportion_clicks)

Calculate probability of clicks

PARAMETERSLIKELIHOOD * PARAMETERS$PRIOR
PARAMETERSPROBABILITY / sum(PARAMETERS$PROBABILITY) # Normalize

Condition parameter we want to know in the model (e.g. visitor = 13) and retrieve calculated posterior 5.1) Filter method

PARAMETERS <- PARAMETERS[PARAMETERS$n_visitor == 13, ]
PARAMETERSPROBABILITY / sum(PARAMETERS$PROBABILITY)
plot(PARAMETERSPROBABILITY, type = "h") 5.2) Shortcut method
n_visitors <- 13 # Replace with a number of interest

4. Bayes' theorem

4.1 Bayes' theorem

P(PARAMETER|DATA) = [P(DATA|PARAMETER) x P(PARAMETER)] / sum([P(DATA|PARAMETER) x P(PARAMETER)]) Note:

P(PARAMETER|DATA) = [POSTERIOR] the prob of different parameter values given some data (what we want)
P(DATA|PARAMETER) = [LIKELIHOOD] OR the relative prob of the data given different parameter values
P(PARAMETER) = [PRIOR] OR the prob of different parameters b/f seeing the data
PARAMETER = a parameter combination

4.2 Grid approximation

A method to define a grid over all the parameter combinations you need to evaluate
Impossible to try all combinations

4.3 Mathematical notion of the generative model used so far

n_ads_shown = 100
proportion_clicks = unif(0.0,0.2)
n_visitors = binom(n_ads_shown, proportion_clicks)

Chapter 5. More parameters, more data, and more Bayes

1. Normal distribution

1.1 rnorm() and dnorm()

rnorm(n = INT, mean = NUM, SD = NUM) # Returns sample of data
LIKELIHOOD <- dnorm(n = INT, mean = NUM, SD = NUM) # Returns a vector of likelihoods of the sampled data
prod(LIKELIHOOD) # Calculate likelihood of all members in the vector
log(LIKELIHOOD) # Transform likelihood into log as the number is too small.

2. Build a Bayesian model

2.1 Define the model by starting with PRIOR

MU = normal(mean = 18, sd = 5)
SIGMA = uniform(min = 0, max = 10)
VALUE_INTEREST = normal(MU, SIGMA)
DATA = c(X1,X2,X3,...) # Data should be "what we know"

2.2 Fit the model with normal distribution (replace codes from the last chapter)

Set parameters and combinations of the unknown

DATA <- c(X1,X2,X3,...) # Set the number we know
MU = <- seq(8, 30, by = 0.5) # Set the Unknown
SIGMA <- seq(0.1, 100=, by = 0.3) # Set the Unknown
PARAMETERS <- expand.grid(MU = MU, SIGMA = SIGMA) # List of all combinations (parameter space)

Sample prior

PARAMETERSMU, mean = 18, sd = 5) # Change function to dnorm and arguments
PARAMETERSSIGMA, min = 0, max = 10)
PARAMETERSMU_PRIOR * PARAMETERS$SIGMA_PRIOR

Calculate the likelihoods of the data with for loop

for (i in 1:nrow(PARAMETERS)) { LIKELIHOODS <- dnorm(DATA, PARAMETERSSIGMA[i]) PARAMETERS$LIKELIHOOD[i] <- prod(LIKELIHOODS) }

Calculate probability (posterior)

PARAMETERSLIKELIHOOD * PARAMETERS$PRIOR
PARAMETERSPROBABILITY / sum(PARAMETERS$PROBABILITY) # Normalize

Condition parameter we want to know in the model and retrieve calculated posterior 5.1 Translate what we want to know into questions (e.g. temperature to decide whether should have a beach party) 5.1.1) Q1: What's likely the average water temperature on 20th of Julys? 5.1.2) Q2: What's the prob that wather is going to be 18 or more on the next 20th (threshold value of interest = 18 or more) 5.2 Sample posteriors with a large number (make it easier)

SAMPLE_INDICE <- sample(1:nrow(PARAMETERS), size = INT, # Set sample size large enough replace = TRUE, prob = PARAMETERS$PROBABILITY) 5.3 Select MU and PARS
PARARAMTERS_SAMPLE <- PARAMETERS[SAMPLE_INDICE, c("MU", "SIGMA")] 5.4 Overview the sample selected
hist(PARARAMTERS_SAMPLE$MU, 30)
quantile(PARARAMTERS_SAMPLE$MU, c(0.05,0.95)) # Calc. 95% CI 5.5 Predict the posteriors and conditions according to 5.1
PRED_VALUE_INTEREST <- rnorm(INT, # Set values to simulate mean = PARARAMTERS_SAMPLESIGMA)
SUM(PRED_VALUE_INTEREST >= 18 / length(PRED_VALUE_INTEREST)) # Return prob

3. Practical tool for building Bayesian models (BEST)

3.1 Background and concept

BEST (Bayesian Estimation Supersedes the t-test) = a Bayesian model developed by John Kruschke
Assumption = use of t-distribution (like normal dist with an extra paramter, degree of freedom) IF df > 30, t-distribution takes shape like normal distribution; if df less, it has more outliers. (adding an outlier into the model affects means)
Application: 3.1) Estimates MEAN, SD, DEGREE OF FREEDOM parameters 3.2) Use Markov chain Monte Carlo (MCMC) to fit the model (returns a table of samples for posterior) 3.3) Compare estimated parameters btw two groups

3.2 Steps and codes

library(BEST)
DATA_1 <- c(X1,X2,...) # Specify group 1 data
DATA_2 <- c(X1,X2,...) # Specify group 2 data
FIT <- BESTmcmc(DATA_1,DATA_2) # Returns mean (mu), df (nu) and sd (sigma)
plot(FIT)

4. Summary

Bayesian Inference

Data
Bayesian Model 2.1) Priors: the prior prob over unknown parameters 2.2) Generative model 2.2.1) rbinom: binomial distribution, poisson distribution 2.2.2) rnorm: normal distribution, t-distribution
Computational method 3.1) Rejection sampling: sampling from the generative model -> filter samples that don't match 3.2) Grid approximation: directly use Bayesian theorem by calculating probabilities over grid of parameter combinations 3.3) Markov chain Monte Carlo (MCMC)

What were not covered?

Other types of advanced machine learning: deep learning, time series
How to decide what priors and models to use
How Bayesian statistics relate to classical statistics
More advanced computational methods (besides MCMC)

Exercise (Course 7: Fundamentals of Bayesian Data Analysis in R)

1. Coin flips with prop_model

data <- c(1, 0, 0, 1)
prop_model(data)

2. Zombie drugs with prop_model

#Update the data and rerun prop_model (2 cured zombies out of 13)

data = c(1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0)
prop_model(data)

3. Looking at samples from prop_model

data = c(1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0) #Extract and explore the posterior
posterior <- prop_model(data)
print(head(posterior)) #Extract and explore the posterior
posterior <- prop_model(data)
head(posterior) #Plot the histogram of the posterior
hist(posterior) data = c(1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0) #Extract and explore the posterior
posterior <- prop_model(data)
head(posterior) #Edit the histogram
hist(posterior, breaks = 30, xlim = c(0,1), col = "palegreen4")

4. Summarizing the zombie drug experiment

data = c(1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0)
posterior <- prop_model(data)
hist(posterior, breaks = 30, xlim = c(0, 1), col = "palegreen4") #Calculate the median
median(posterior) #Calculate the credible interval
quantile(posterior, c(0.05,0.95)) #Calculate the sum (posterior > 7%)
sum(posterior > 0.07) #Calculate the probability
sum(posterior >0.07)/length(posterior)

5. Take a generative model for a spin

#The generative zombie drug model
#Set parameters

prop_success <- 0.42
n_zombies <- 100 #Simulating data
data <- c()
for(zombie in 1:n_zombies) { data[zombie] <- runif(1, min = 0, max = 1) < prop_success }
data <- as.numeric(data) #Count cured
data <- sum(data)

6. Take the binomial distribution for a spin

rbinom(n = 1, size = 100, prob = 0.42)
rbinom(n = 200, size = 100, prob = 0.42)

7. How many visitors could your site get (1)?

#Fill in the parameters

n_samples <- 100000
n_ads_shown <- 100
proportion_clicks <- 0.1
n_visitors <- rbinom(n_samples, size = n_ads_shown, prob = proportion_clicks) #Visualize n_visitors
hist(n_visitors)

8. Adding a prior to the model

n_samples <- 100000
n_ads_shown <- 100 #Update proportion_clicks
proportion_clicks <- runif(n = n_samples, min = 0.0, max = 0.2)
n_visitors <- rbinom(n = n_samples, size = n_ads_shown, prob = proportion_clicks) #Visualize proportion clicks
hist(proportion_clicks) # See uniform distribution #Visualize n_visitors
hist(n_visitors)

9. Update a Bayesian model with data

#Create the prior data frame

prior <- data.frame(proportion_clicks, n_visitors) #Examine the prior data frame
head(prior) #Create the posterior data frame
posterior <- prior[prior$n_visitors == 13, ] #Visualize posterior proportion clicks
hist(posterior$proportion_clicks)

10. How many visitors could your site get (3)?

#Assign posterior to a new variable called prior

prior <- posterior #Take a look at the first rows in prior
head(prior) #Replace prior$n_visitors with a new sample and visualize the result
n_samples <- nrow(prior)
n_ads_shown <- 100
priorproportion_clicks)
hist(prior$n_visitors) #Calculate the probability that you will get 5 or more visitors
(priorn_visitors)

11. Explore using the Beta distribution as a prior

#Explore using the rbeta function

beta_sample <- rbeta(n = 1000000, shape1 = 1, shape2 = 1) #Visualize the results
hist(beta_sample) #Modify the parameters
beta_sample <- rbeta(n = 1000000, shape1 = -1, shape2 = 1) #Explore the results
head(beta_sample) #Modify the parameters
beta_sample <- rbeta(n = 1000000, shape1 = 100, shape2 = 100) #Visualize the results
hist(beta_sample) #Modify the parameters
beta_sample <- rbeta(n = 1000000, shape1 = 100, shape2 = 20) #Visualize the results
hist(beta_sample)

12. Change the model to use an informative prior

n_draws <- 100000
n_ads_shown <- 100 #Change the prior on proportion_clicks
proportion_clicks <- rbeta(n_draws, shape1 = 5, shape2 = 95)
n_visitors <- rbinom(n_draws, size = n_ads_shown, prob = proportion_clicks)
prior <- data.frame(proportion_clicks, n_visitors)
posterior <- prior[prior$n_visitors == 13, ] #This plots the prior and the posterior in the same plot
par(mfcol = c(2, 1))
hist(prior$proportion_clicks, xlim = c(0, 0.25))
hist(posterior$proportion_clicks, xlim = c(0, 0.25))

13. Fit the model using another dataset

#Define parameters

n_draws <- 100000
n_ads_shown <- 100
proportion_clicks <- runif(n_draws, min = 0.0, max = 0.2)
n_visitors <- rbinom(n = n_draws, size = n_ads_shown, prob = proportion_clicks)
prior <- data.frame(proportion_clicks, n_visitors) #Create the posteriors for video and text ads
posterior_video <- prior[prior$n_visitors == 13, ]
posterior_text <- prior[prior$n_visitors == 6, ] #Visualize the posteriors
hist(posterior_video$proportion_clicks, xlim = c(0, 0.25))
hist(posterior_text$proportion_clicks, xlim = c(0, 0.25))

14. Calculating the posterior difference

posterior <- data.frame(video_prop = posterior_videoproportion_click[1:4000]) #Calculate the posterior difference: video_prop - text_prop
posteriorvideo_prop - posterior$text_prop #Visualize prop_diff
hist(posterior$prop_diff) #Calculate the median of prop_diff
median(posterior$prop_diff) #Calculate the proportion
sum(posteriorprop_diff)

15. A small decision analysis 1

visitor_spend <- 2.53
video_cost <- 0.25
text_cost <- 0.05 #Add the column posterior$video_profit
posteriorvideo_prop * visitor_spend - video_cost #Add the column posterior$text_profit
posteriortext_prop * visitor_spend - text_cost #Visualize the video_profit and text_profit columns
hist(posterior$video_profit)
hist(posterior$text_profit)

16. A small decision analysis 2

#Add the column posterior$profit_diff

posteriorvideo_profit - posterior $text_profit #Visualize posterior$ profit_diff
hist(posterior$profit_diff) #Calculate a "best guess" for the difference in profits
median(posterior$profit_diff) #Calculate the probability that text ads are better than video ads
sum(posteriorprofit_diff)

17. The Poisson distribution

#Simulate from a Poisson distribution and visualize the result

x <- rpois(n = 10000, lambda = 3)
hist(x) #Simulate from a Poisson distribution and visualize the result
x <- rpois(n = 10000, lambda = 11.5)
hist(x) #Calculate the probability of break-even assuming you won't break even unless you sell 15 or more ice creams
sum(x >=15)/length(x)

18. Clicks per day instead of clicks per ad

#Replace proportion_clicks with mean_clicks and change the parameters

n_draws <- 100000
n_ads_shown <- 100
mean_clicks <- runif(n_draws, min = 0.0, max = 80)
n_visitors <- rbinom(n_draws, size = n_ads_shown, prob = proportion_clicks)
prior <- data.frame(proportion_clicks, n_visitors)
posterior <- prior[prior$n_visitors == 13, ] #Change this model so that it uses a Poisson distribution
n_draws <- 100000
n_ads_shown <- 100
mean_clicks <- runif(n_draws, min = 0, max = 80)
n_visitors <- rpois(n_draws, lambda = mean_clicks)
prior <- data.frame(mean_clicks, n_visitors)
posterior <- prior[prior$n_visitors == 19, ] #Visualize mean_clicks
hist(prior$mean_clicks)
hist(posterior$mean_clicks)

19. Cards and the sum rule

#Calculate the probability of drawing any of the four aces

prob_to_draw_ace <- 1/52 + 1/52 + 1/52 + 1/52

20. Cards and the product rule

#Calculate the probability of picking four aces in a row

prob_to_draw_four_aces <- 4/52 * 3/51 * 2/50 * 1/49

21. From rbinom to dbinom

#rbinom version

n_ads_shown <- 100
proportion_clicks <- 0.1
n_visitors <- rbinom(n = 99999, size = n_ads_shown, prob = proportion_clicks) prob_13_visitors <- sum(n_visitors == 13) / length(n_visitors) #dbinom version
n_ads_shown <- 100
proportion_clicks <- 0.1
prob_13_visitors <- dbinom(x = 13, size = n_ads_shown, prob = proportion_clicks)

22. Calculating probabilities with dbinom

#Change the code to calculate probability distribution of n_visitors

n_ads_shown <- 100
proportion_clicks <- 0.1
n_visitors <- seq(0, 100, by = 1)
prob <- dbinom(n_visitors, size = n_ads_shown, prob = proportion_clicks) #Plot the distribution
plot(x = n_visitors, y = prob, type ="h") #Flip the chart (fix n_visitor and vary proportion_clicks -> want to know proportion_click)
n_ads_shown <- 100
proportion_clicks <- seq(0, 1, by = 0.01)
n_visitors <- 13
prob <- dbinom(n_visitors, size = n_ads_shown, prob = proportion_clicks) prob
plot(proportion_clicks, prob, type = "h")

23. Calculating a joint distribution

n_ads_shown <- 100
proportion_clicks <- seq(0, 1, by = 0.01)
n_visitors <- seq(0, 100, by = 1)
pars <- expand.grid(proportion_clicks = proportion_clicks, n_visitors = n_visitors)
parsproportion_clicks, min = 0, max = 0.2)
parsn_visitors, size = n_ads_shown, prob = pars $proportion_clicks) #Add the column pars$ probability and normalize it
parslikelihood * parslikelihood * pars$prior))

24. Conditioning on the data (again)

n_ads_shown <- 100
proportion_clicks <- seq(0, 1, by = 0.01)
n_visitors <- seq(0, 100, by = 1)
pars <- expand.grid(proportion_clicks = proportion_clicks, n_visitors = n_visitors)
parsproportion_clicks, min = 0, max = 0.2)
parsn_visitors, size = n_ads_shown, prob = pars$proportion_clicks)
parslikelihood * pars$prior
parsprobability / sum(pars$probability) #Condition on the data
pars <- pars[pars$n_visitor == 6, ] #Normalize again
parsprobability / sum(pars $probability) #Plot the posterior pars$ probability
plot(parsprobability, type = "h")

25. A conditional shortcut

#Simplify the code below by directly conditioning on the data

n_ads_shown <- 100
proportion_clicks <- seq(0, 1, by = 0.01)
n_visitors <- 6
pars <- expand.grid(proportion_clicks = proportion_clicks, n_visitors = n_visitors)
parsproportion_clicks, min = 0, max = 0.2)
parsn_visitors, size = n_ads_shown, prob = pars$proportion_clicks)
parslikelihood * pars$prior
parsprobability / sum(pars$probability)
pars <- pars[pars$n_visitors == 6, ]
parsprobability / sum(pars$probability)
plot(parsprobability, type = "h")

26. rnorm, dnorm, and the weight of newborns

#Assign mu and sigma

mu <- mean(c(3164, 3362, 4435, 3542, 3578, 4529))
sigma <- sd(c(3164, 3362, 4435, 3542, 3578, 4529))
weight_distr <- rnorm(n = 100000, mean = mu, sd = sigma)
hist(weight_distr, 60, xlim = c(0, 6000), col = "lightgreen") #Assign mu and sigma
mu <- 3500
sigma <- 600
weight_distr <- rnorm(n = 100000, mean = mu, sd = sigma)
hist(weight_distr, 60, xlim = c(0, 6000), col = "lightgreen") #Create weight
weight <- seq(0, 6000, by = 100) #Calculate likelihood
likelihood <- dnorm(weight, mu, sigma) #Plot the distribution of weight
plot(weight, likelihood, type = "h")

27. A Bayesian model of Zombie IQ

#The IQ of a bunch of zombies

iq <- c(55, 44, 34, 18, 51, 40, 40, 49, 48, 46) #Defining the parameter grid pars <- expand.grid(mu = seq(0, 150, length.out = 100), sigma = seq(0.1, 50, length.out = 100)) #Defining and calculating the prior density for each parameter combination
parsmu, mean = 100, sd = 100)
parssigma, min = 0.1, max = 50)
parsmu_prior * pars$sigma_prior #Calculating the likelihood for each parameter combination
for(i in 1:nrow(pars)) { likelihoods <- dnorm(iq, parssigma[i]) pars$likelihood[i] <- prod(likelihoods) } #Calculate the probability of each parameter combination
parslikelihood * parslikelihood * pars$prior)

28. Sampling from the zombie posterior

head(pars)
sample_indices <- sample( nrow(pars), size = 10000, replace = TRUE, prob = pars$probability)
head(sample_indices) #Sample from pars to calculate some new measures
pars_sample <- pars[sample_indices, c("mu", "sigma")] #Visualize the mean IQ
hist(pars_sample$mu, 100, col = 'blue') #Calculate quantiles
quantile(pars_sample$mu, c(0.025, 0.5, 0.975))

29. But how smart will the next zombie be?

head(pars_sample)
pred_iq <- rnorm(10000, mean = pars_samplesigma) #Visualize pred_iq
hist(pred_iq) #Calculate the probability of a zombie being "smart" (+60 IQ)
sum(pred_iq >= 60)/length(pred_iq)

30. The BEST models and zombies on a diet

#The IQ of zombies on a regular diet and a brain based diet.

iq_brains <- c(44, 52, 42, 66, 53, 42, 55, 57, 56, 51)
iq_regular <- c(55, 44, 34, 18, 51, 40, 40, 49, 48, 46) #Calculate the mean difference in IQ between the two groups
mean(iq_brains - iq_regular) #Fit the BEST model to the data from both groups
library(BEST)
best_posterior <- BESTmcmc(iq_brains, iq_regular) #Plot the model result
plot(best_posterior)

31. BEST is robust

#The IQ of zombies given a regular diet and a brain based diet.

iq_brains <- c(44, 52, 42, 66, 53, 42, 55, 57, 56, 51)
iq_regular <- c(55, 44, 34, 18, 51, 40, 40, 49, 48, 150) #Modify the data above and calculate the difference in means
mean(iq_brains - iq_regular) #The IQ of zombies given a regular diet and a brain based diet.
iq_brains <- c(44, 52, 42, 66, 53, 42, 55, 57, 56, 51)
iq_regular <- c(55, 44, 34, 18, 51, 40, 40, 49, 48, 150) # <- Mutant zombie #Modify the data above and calculate the difference in means
mean(iq_brains) - mean(iq_regular) #Fit the BEST model to the modified data and plot the result
library(BEST)
best_posterior <- BESTmcmc(iq_brains, iq_regular)
plot(best_posterior)

Course 8: Hyperparameter Tuning in R

Chapter 1. Introduction to hyperparameters

1. Parameters vs hyperparameters

1.1 Parameters (model parameters)

Definition = model parameters are being fit DURING training, and they are the result of model fitting or training
Examples in linear model: coefficients (found during fitting)
Examples in machine learning model: weights, biases of neural nets that are optimized during training (model parameters)

1.2 Hyperparameters

Definition = parameters being set BEFORE training, and they specify HOW the training is supposed to happen. (most are optional)
Examples in linear model: method (an option to set before fitting)
Examples in machine learning model: learning rates, weight decay, number of trees in random forest
Why need tuning? = Want to find the best combination of hyperparameters

2. Recap of machine learning basics

2.1 Splitting data (ML with caret)

Load

library(caret)set.seed(42)

Set partition index (proportion for training dataset)

INDEX <- createDataPartition(FULL_DF$RESP_VAR, p =.70, # Set partition e.g. 70:30 list=FALSE)

Split data into training and testing dataset

TRAINING_DATA <- FULL_DF[INDEX,]
TESTING_DATA <- FULL_DF[-INDEX,] Note: training dataset should have enough power and test set should be representative

2.2 Train ML with caret

Set up cross-validation

library(carert)
library(tictoc) # Use for estimating how long it taks to run
fitControl <- trainControl(method ="repeatedcv", number =3, repeats =5) # Set cross validation (number of folds and number of repeats) Note: Valid resampling methods in caret
- LGOCV (leave-group-out cross-validation)
- boot (adabost is not a resampling scheme)
- cv

Train a random forest model with caret (automatic tuning)

tic()
set.seed(INT)
rf_model <- train(Y ~ X, data = TRAINING_DATA, method ="rf", trControl = fitControl, # Specify valiation
verbose =FALSE)
toc()

2.3 List of mode methods in caret

"gbm" = Stochastic Gradient Boosting model
"nnet" = Regulat neural network
"rf" = Random forest
"svmPoly" = Support Vector Machines

3. Hyperparameter tuning

3.1 Specific hyperparameters for model algorithm

modelLookup("gbm") # Specify model
https://topepo.github.io/caret/available-models.html

3.2 Hyperparameter tuning in caret (e.g. gbm)

Define hyperparameter grid (Cartesian grid)

HYPERPARAMETERS <- expand.grid(n.trees = 200, # Number of trees interaction.depth = 1, # Tree complexity shrinkage = 0.1, # Learning rate n.minobsinnode = 10) # Min when splitting

Apply hyperparameter grid to train().

set.seed(INT)
gbm_model <- train(Y ~ X, data = TRAINING_DATA, method = "gbm", trControl = trainControl(method = "repeatedcv", number = 5, repeats = 3), verbose = FALSE, tuneGrid = HYPERPARAMETERS)

3.3 Hyperparameter tuning in Support Vector Machines (SVM)

tuneLength

tic()
set.seed(INT)
SVM <- train(Y ~ X,
data = TRAINING_DATA,
method ="svmPoly",
trControl = fitControl,
verbose =FALSE,
tuneLength =5) # Specify maximum number of tuning parameter combinations
toc()

tuneGrid and expand.grid (manual method)

HYPERPARAMETERS <- expand.grid(degree =4, scale =1, C =1) # Specify grid or hyperparameter(s) to tune
set.seed(INT)
SVM <- train(Y ~ X,
data = TRAINING_DATA,
method ="svmPoly",
trControl = fitControl,
verbose =FALSE, tuneGrid = HYPERPARAMETERS # Specify tuneGrid )
toc()

Chapter 2. Hyperparameter tuning with caret

1. Hyperparameter tuning in caret

1.1 Method 1: Cartesian grid search with caret

Concept: define a grid of possible tune parameters and run; cons = takes longer time to run
Commands

MANUAL_GRID <- expand.grid(n.trees = c(100, 200, 250), interaction.depth = c(1, 4, 6), shrinkage = 0.1, n.minobsinnode = 10)

1.2 Method 2: Plot hyperparameter models

plot(GBM_MODEL) # plot graph with each line representing each value of hyperparameter
plot(GBM_MODEL, metric = "Kappa",plotType = "level") # Use level-plot Note: kappa used to evaluate performane of classification model (comparing expected vs observed accuracy; higher kappa = better)

2. Grid Search (Method 1) vs. Random Search (Method 3)

2.1 Grid search (Cartesian grid)

Concept: need to specify values; cons = get slow and computationally expensive (esp. big grid)
Commands 3.1) Define a grid

BIG_GRID <- expand.grid(n.trees = seq(from = 10, to = 300, by = 50), interaction.depth = seq(from = 1, to = 10, length.out = 6), shrinkage = 0.1, n.minobsinnode = 10)
fitControl <- trainControl(method = "repeatedcv", number = 3, repeats = 5, search = "grid") # Specify argument search "grid" 3.2) Set "tuneGrid" argument
gbm_model <- train(Y ~ X, data = TRAINING_DATA, method = "gbm", trControl = fitControl verbose = FALSE, tuneGrid = BIG_GRID)

2.2 Method 3: Random search

Concept: random values; pros = faster than grid search
Note: in caret, random search cannot be combined with grid search
Commands 3.1) Define random search in trainControl function

library(caret)
fitControl <- trainControl(method = "repeatedcv", number = 3, repeats = 5, search = "random") # Specify argument search "random" 3.2) Set "tuneLength" argument
tic()
set.seed(INT)
GBM_MODEL <- train(Y ~X, data = TRAINING_DATA, method = "gbm", trControl = fitControl, verbose = FALSE, tuneLength = 5) # Set tuneLength or maximum number of tuning parameter combinations
toc()

3. Method 4: Adaptive resampling

3.1 Definition and concept

Grid search (method 1) = all hyperparameters are computed.
Random search (method 3) = random subsets of hyperparameters are computed.
Adaptive resampling = hyperparameter combinations are resampled with values near combinations that performed well (suboptimal is not tested). Pros = faster and more efficient, but not better than methods 1 or 3

3.2 Steps and commands

Set adaptive resampling in trainControl()

fitControl <- trainControl(method = "adaptive_cv", adaptive = list(min = 2, # min no of resamples per hyperparam alpha = 0.05, # confidence level for removing hyperparam method = "gls", # "gls" = linear; "BT = Bradley Terry complete = TRUE), # TRUE = generate full resampling set search = "random")

Apply trainControl() and set tuneLength argument

tic()
set.seed(INT)
GBM_MODEL <- train(Y ~ X, data = TRAINING_DATA, method = "gbm", trControl = fitControl, verbose = FALSE, tuneLength = 7) # Set tuneLength or maximum number of tuning parameter combinations (in reality, maybe 100)(default = 3)
toc()

Chapter 3. Hyperparameter tuning with mlr

1. Machine learning with mlr

1.1 Basics

mlr = another framework for ML in R
Model training follows 3 steps 2.1) Define the task 2.2) Define the learner 2.3) Fit the model

1.2 Tasks in mlr for supervised learning

RegTask() : for regression
ClassifTask(): for binary and multi-class classification
MultilabelTask(): for multi-label classification problems
CostSensTask(): for general cost-sensitive classification

1.3 Learners in mlr

listLearners() # Return available learners: classif. = classification learner, regr. = regression learner, multilabel. = multi-label classification

1.3 Steps and commands

Define the task

TASK <- makeClassifTask(data = DF, target = "RESP_VAR")

Define the learner

LRN <- makeLearner("classif.h2o.deeplearning", # Enter a learner fix.factors.prediction = TRUE, # Handle missing data predict.type = "prob")

Fit the model

MODEL <- train(LRN, TASK)

2. Hyperparameter tuning with mlr (grid and random search)

2.1 Requirements for tuning in mlr

Search space for every hyperparameter
Tuning method (e.g. grid or random)
Resampling strategy

2.2 Steps and commands

Define the search space

getParamSet("classif.h2o.deeplearning") # Returns hyperparameter in the model
PARAM_SET <- makeParamSet( makeDiscreteParam("hidden", values =list(one =10, two =c(10,5,10))), makeDiscreteParam("activation", values =c("Rectifier","Tanh")), makeNumericParam("l1", lower =0.0001, upper =1),
makeNumericParam("l2", lower =0.0001, upper =1))

Define the tuning method 2.1) Grid search: ONLY deal with discrete hyperparameter sets

CTRL_GRID <- makeTuneControlGrid() 2.2) Random search: CAN deal with all kinds of hyperparameter
CTRL_RANDOM <- makeTuneControlRandom(maxit = INT) # Specify max number

Define resampling strategy

CROSS_VAL <-makeResampleDesc("RepCV" # Can be "Holdout" predict = "both", # If want to predict training AND validation data folds = 5 * 3)

Apply 3 steps of model training

TASK <- makeClassifTask(data = DF, target = "RESP_VAR")
LRN <- makeLearner("classif.h2o.deeplearning", # Enter a learner fix.factors.prediction = TRUE, # Handle missing data predict.type = "prob")
MODEL <- train(LRN, TASK)

Tune the model

LRN_TUNE <- tuneParams(LRN, TASK, resampling = CROSS_VAL control = CTRL_GRID, par.set = PARAM_SET)

3. Evaluating hyperparameters with mlr

3.1 Benefits of hyperparameter evaluation

How differert hyperparameters affect the performance of our model
Which hyperparameters have a strong or weak impact on our model performance
Whether our hyperparameter search converged / whether we can be confident that we found the most optimal hyperparameter combination

3.2 Steps and commands

See tuning results

HYPERPAR_EFFECTS <- generateHyperParsEffectData(LRN_TUNE, partial.dep =TRUE) # Set to TRUE if tuning > 2 params

Plot hyperparameter tuning results

plotHyperParsEffect(HYPERPAR_EFFECTS, partial.dep.learn ="regr.randomForest", x ="l1", y ="mmce.test.mean", z ="hidden",
plot.type ="line")

4. Advanced tuning with mlr

4.1 List of advanced tuning controls

1)makeTuneControlCMAES: CMA Evolution Strategy 2) makeTuneControlDesign: Predefined data frame of hyperparameters 3) makeTuneControlGenSA: Generalized simulated annealing (non-linear) 4) makeTuneControlIrace: Tuning with iterated F-Racing (config) 5) makeTuneControlMBO: Model-based/ Bayesian optimization (Use Bayesian statistics to optimize)

4.2 Steps and commands (examples with makeTuneControlGenSA)

Generalized simulated annealing

CTRL_GENSA <- makeTuneControlGenSA()

Create holdout sampling

BOOTSTRAP <- makeResampleDesc("Bootstrap", predict ="both")

Perform tuning

LRN_TUNE <- tuneParams(learner = LRN, task = TASK, resampling = BOOTSTRAP,
control = CTRL_GENSA,
par.set = PARAM_SET,
measures =list(acc, # Customize measures setAggregation(acc, train.mean), mmce, setAggregation(mmce, train.mean))

Nested cross-validation &nested resampling 4.1) Train directly

MODEL_NESTED <- train(LRN, TASK)
getTuneResult(MODEL_NESTED) 4.2) Add 2x nested validation
CV2 <- makeResampleDesc("CV", iters =2)
RES <- resample(LRN, TASK, resampling = CV2,
extract = getTuneResult)
generateHyperParsEffectData(RES)

Choose hyperparameters from a tuning set

LRN_BEST <- setHyperPars(LRN, par.vals =list(minsplit =4, minbucket =3, maxdepth =6))
MODEL_BEST <- train(LRN_BEST, TASK)
predict(MODEL_BEST, newdata = DF)

Chapter 4. Hyperparameter tuning with h2o

1. Machine learning with h2o

1.1 Basics

H2o = a open source for ML in R
Difference from caret and mlr = scalability

1. Steps and commands

Initiate the package

library(h2o)
h2o.init() # Local initiate

Prepare the data 2.1) Define H2o frame

DF_HF <- as.h2o(DF) 2.2) Define features and taget variable
TARGET <- "VAR"
FEATURE <- setdiff(colnames(DF_HF), TARGET) Note: for classification target variable
DF_HF[, TARGET] <- as.factor(DF_HF[, TARGET])

Split training, validation and test sets 3.1) Split data

SFRAME <- h2o.splitFrame(data = DF_HF, ration = c(0.7, 0.15), # Default = 0.75, 0.25 seed = INT) # Set seed
TRAIN_DATA <- SFRAME[[1]]
VALIDATE_DATA <- SFRAME[[2]]
TEST_DATA <- SFRAME[[3]] 3.2) See ratio
summary(TRAIN_DATA$VAR, exact_quantiles = TRUE)

Train model

GBM_MODEL <- h2o.gbm (X = FEATURE, y = TARGET, training_frame = TRAIN_DATA, validation_frame = VALIDATE_DATA) Note: available model for H2o h2o.gbm() & h2o.xgboost(): Gradient boosted models h2o.glm() Generalized linear models h2o.randomForest(): Random forest models h2o.deeplearning(): Neural networks

Predict new data

h2o.predict(GBM_MODEL, TEST_DATA)

Evaluate the model performance

PERFORMANCE <- h2o.performance(GBM_MODEL, TEST_DATA)
h2o.confusionMatrix(PERFORMANCE)
h2o.logloss(PERFORMANCE)

2. Hyperparameter tuning with h2o (grid and random search)

2.1 Hyperparameters in H2O models

?h2o.gbm # Search hyperparameters for gradient boosting

2.2 Steps and commands [search in step 6)]

Initiate the package

library(h2o)
h2o.init() # Local initiate

Prepare the data 2.1) Define H2o frame

DF_HF <- as.h2o(DF) 2.2) Define features and taget variable
TARGET <- "VAR"
FEATURE <- setdiff(colnames(DF_HF), TARGET) Note: for classification target variable
DF_HF[, TARGET] <- as.factor(DF_HF[, TARGET])

Split training, validation and test sets 3.1) Split data

SFRAME <- h2o.splitFrame(data = DF_HF, ration = c(0.7, 0.15), # Default = 0.75, 0.25 seed = INT) # Set seed
TRAIN_DATA <- SFRAME[[1]]
VALIDATE_DATA <- SFRAME[[2]]
TEST_DATA <- SFRAME[[3]] 3.2) See ratio
summary(TRAIN_DATA$VAR, exact_quantiles = TRUE)

Train model

GBM_MODEL <- h2o.gbm (X = FEATURE, y = TARGET, training_frame = TRAIN_DATA, validation_frame = VALIDATE_DATA)

Manageperparameter grid 5.1) Define hyperparameters

GBM_PARAMS <- list(ntrees = c(100, 150, 200), max_depth = c(3, 5, 7), learn_rate = c(0.001, 0.01, 0.1)) 5.2) Define grid
GBM_GRID <- h2o.grid("gbm", grid_id = "gbm_grid", x = FEATURE, y = TARGET, training_ frame = train, validation _ frame = valid, seed = 42, hyper _params = gbm _params)

Predict new data

h2o.predict(GBM_MODEL, TEST_DATA)

Evaluate the model performance

PERFORMANCE <- h2o.performance(GBM_MODEL, TEST_DATA)
h2o.confusionMatrix(PERFORMANCE)
h2o.logloss(PERFORMANCE)

3.

3.1

Exercise (Course 8: Hyperparameter Tuning in R)

1. Model parameters vs. hyperparameters

#Fit a linear model on the breast_cancer_data.

linear_model <- lm(concavity_mean ~ symmetry_mean, data = breast_cancer_data) #Look at the summary of the linear_model.
summary(linear_model) #Extract the coefficients.
linear_model$coefficients

2. What are the coefficients?

library(ggplot2) #Plot linear relationship.
ggplot(data = breast_cancer_data, aes(x = symmetry_mean, y = concavity_mean)) + geom_point(color = "grey") + geom_abline(slope = linear_modelcoefficients[1])

3. Machine learning with caret

#Create partition index

index <- createDataPartition(breast_cancer_data$diagnosis, p = 0.7, list = FALSE) #Subset breast_cancer_data with index
bc_train_data <- breast_cancer_data[index, ]
bc_test_data <- breast_cancer_data[-index, ] #Define 3x5 folds repeated cross-validation
fitControl <- trainControl(method = "repeatedcv", number = 5, repeats = 3) #Run the train() function
gbm_model <- train(diagnosis ~ ., data = bc_train_data, method = "gbm", trControl = fitControl, verbose = FALSE)

4. Changing the number of hyperparameters to tune

#Set seed.

set.seed(42) #Start timer.
tic() #Train model.
gbm_model <- train(diagnosis ~ ., data = bc_train_data, method = "gbm", trControl = trainControl(method = "repeatedcv", number = 5, repeats = 3), verbose = FALSE, tuneLength = 4) #Stop timer.
toc()

5. Tune hyperparameters manually

#Define hyperparameter grid.

hyperparams <- expand.grid(n.trees = 200, interaction.depth = 1, shrinkage = 0.1, n.minobsinnode = 10)

#Apply hyperparameter grid to train().
set.seed(42)
gbm_model <- train(diagnosis ~ ., data = bc_train_data, method = "gbm", trControl = trainControl(method = "repeatedcv", number = 5, repeats = 3), verbose = FALSE, tuneGrid = hyperparams)

6. Cartesian grid search in caret

#Define Cartesian grid

man_grid <- expand.grid(degree = c(1, 2, 3), scale = c(0.1, 0.01, 0.001), C = 0.5) #Start timer, set seed & train model
tic()
set.seed(42)
svm_model_voters_grid <- train(turnout16_2016 ~ ., data = voters_train_data, method = "svmPoly", trControl = fitControl, verbose= FALSE, tuneGrid = man_grid)
toc()

7. Plot hyperparameter model output

#Plot default

plot(svm_model_voters_grid) #Plot Kappa level-plot
plot(svm_model_voters_grid, metric = "Kappa", plotType = "level")

8. Grid search with range of hyperparameters

#Define the grid with hyperparameter ranges

big_grid <- expand.grid(size = seq(from = 1, to = 5, by = 1), decay = c(0, 1)) #Train control with grid search
fitControl <- trainControl(method = "repeatedcv", number = 3, repeats = 5, search = "grid") #Train neural net
tic()
set.seed(42)
nn_model_voters_big_grid <- train(turnout16_2016 ~ ., data = voters_train_data, method = "nnet", trControl = fitControl, verbose = FALSE, tuneGrid = big_grid)
toc()

9. Adaptive Resampling with caret

#Define trainControl function

fitControl <- trainControl(method = "adaptive_cv", number = 3, repeats = 3, adaptive = list(min = 3, alpha = 0.05, method = "BT", complete = FALSE), search = "random") #Start timer & train model
tic()
svm_model_voters_ar <- train(turnout16_2016 ~ ., data = voters_train_data, method = "nnet", trControl = fitControl, verbose = FALSE, tuneLength = 6)
toc()

10. Modeling with mlr

#Create classification taks

task <- makeClassifTask(data = knowledge_train_data, target = "UNS") #Call the list of learners
listLearners() %>% as.data.frame() %>% select(class, short.name, package) %>% filter(grepl("classif.", class)) #Create learner
lrn <- makeLearner("classif.randomForest", predict.type = "prob", fix.factors.prediction = TRUE)

11. Random search with mlr

#Get the parameter set for neural networks of the nnet package

getParamSet("classif.nnet") #Define set of parameters
param_set <- makeParamSet( makeDiscreteParam("size", values = c(2,3,5)), makeNumericParam("decay", lower = 0.0001, upper = 0.1) ) #Print parameter set
print(param_set) #Define a random search tuning method.
ctrl_random <- makeTuneControlRandom()

12. Perform hyperparameter tuning with mlr

#Define a random search tuning method.

ctrl_random <- makeTuneControlRandom(maxit = 6) #Define a 3 x 3 repeated cross-validation scheme
cross_val <- makeResampleDesc("RepCV", folds = 3 * 3) #Tune hyperparameters
tic()
lrn_tune <- tuneParams(lrn, task, resampling = cross_val, control = ctrl_random, par.set = param_set)
toc()

13. Evaluating hyperparameter tuning results

#Create holdout sampling

holdout <- makeResampleDesc("Holdout") #Perform tuning
lrn_tune <- tuneParams(learner = lrn, task = task, resampling = holdout, control = ctrl_random, par.set = param_set) #Generate hyperparameter effect data
hyperpar_effects <- generateHyperParsEffectData(lrn_tune, partial.dep = TRUE) #Plot hyperparameter effects
plotHyperParsEffect(hyperpar_effects, partial.dep.learn = "regr.glm", x = "minsplit", y = "mmce.test.mean", z = "maxdepth", plot.type = "line")

14. Define aggregated measures

#Create holdout sampling

holdout <- makeResampleDesc("Holdout", predict = "both") #Perform tuning
lrn_tune <- tuneParams(learner = lrn, task = task, resampling = holdout, control = ctrl_random, par.set = param_set, measures = list(acc, setAggregation(acc, train.mean), mmce, setAggregation(mmce, train.mean)))

15. Setting hyperparameters

#Set hyperparameters

lrn_best <- setHyperPars(lrn, par.vals = list(size = 1, maxit = 150, decay = 0)) #Train model
model_best <- train(lrn_best, task)

16. Prepare data for modelling with h2o

#Initialise h2o cluster

h2o.init() #Convert data to h2o frame
seeds_train_data_hf <- as.h2o(seeds_train_data) #Identify target and features
y <- "seed_type"
x <- setdiff(colnames(seeds_train_data_hf), y) #Split data into train & validation sets
sframe <- h2o.splitFrame(seeds_train_data_hf, seed = 42)
train <- sframe[[1]]
valid <- sframe[[2]] #Calculate ratio of the target variable in the training set
summary(train$seed_type, exact_quantiles = TRUE)

17. Modeling with h2o

#Train random forest model

rf_model <- h2o.randomForest(x = x, y = y, training_frame = train, validation_frame = valid) #Calculate model performance
perf <- h2o.performance(rf_model, valid = TRUE) #Extract confusion matrix
h2o.confusionMatrix(perf) #Extract logloss
h2o.logloss(perf)

18.

19.

20.

21.

22.

23.

24.

25.

26.

27.

28.

29.

30.

[Template

Course 9:

Chapter 1.

1.

1.1

2.

2.1

3.

3.1

Chapter 2.

1.

1.1

2.

2.1

3.

3.1

Chapter 3.

1.

1.1

Chapter 4.

1.

1.1

2.

2.1

3.

3.1

Exercise (Course 9: xxxx)

1.

2.

3.

4.

5.

6.

7.

8.

9.

10.

11.

12.

13.

14.

15.

16.

17.

18.

19.

20.

21.

22.

23.

24.

25.

26.

27.

28.

29. 30.]

Resource center

Machine learning in R https://topepo.github.io/caret/model-training-and-tuning.html#basic

Unknown integration

Data frameavailable as

df

variable

This query is taking long to finish...Consider adding a LIMIT clause or switching to Query mode to preview the result.

Machine Learning Scientist with R Course 1 - 7

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Course 1 Machine Learning in the Tidyverse

Course 2 Cluster Analysis in R

Course 3 Machine Learning with caret in R

Course 4 Modeling with tidymodels in R

Course 5 Machine learning with Tree-based models in R

Chapter 1. Classification Trees

1.1 Tree-based models

1.2 Create a decision tree and commands

For further details Go to ## More model measures in Chapter 3.

Chapter 2. Regression Trees and Cross-Validation

2.1 Construct a regression tree with hyperparameters

2.2 Predict using the regression tree

2.3 Evaluate the regression tree using metrics [how far from the truth]

2.4 Cross validation: give more robust out-of-sample performance

2.5 Note: Bias-variance tradeoff

Chapter 3.Hyperparameters and Ensemble Models

3.1 Tuning hyperparameters

3.2 More model measures

3.3 Bagged trees

3.4 Random forest (improved version of bagged trees)

Chapter 4.Boosted Trees

4.1 Introduction to boosting

4.1.1 Definition = Improving model based on the previous model(s)

4.1.2 Adaptive boosting (Adaboost)

4.2 Gradient boosting

4.2.1 Adaboost

4.2.2 Gradient boosting

4.2.3 Pros of boosting

4.2.4 Cons of boosting

4.2.5 Hyperparameters for gradient boosting

4.3 Optimization of the boosted ensemble

4.3.1 Tuning workflow and their commands

4.4 Model comparison

4.4.1 Steps of comparing models and their commands

Exercise

1. Specify that tree

2. Train that model

3. Train/test split

4. Avoiding class imbalances

5. From zero to hero

6. Make predictions

7. Crack the matrix

8. Are you predicting correctly?

9. Train a regression tree

10. Predict new values

11. In-sample performance

12. Out-of-sample performance

13. Bigger mistakes, bigger penalty

14. Create the folds

15. Fit the folds

16. Evaluate the folds

17. Adjust model complexity

18. In-sample and out-of-sample performance

19. Generate a tuning grid

20. Tune along the grid

21. Pick the winner

22. Calculate specificity

23. Draw the ROC curve

24. Area under the ROC curve

25. Create bagged trees

26. In-sample ROC and AUC

27. Check for overfitting

28. Variable importance

29. Specify a boosted ensemble

30. Train a boosted ensemble

31. Evaluate the ensemble

31. Compare to a single classifier

32. Tuning preparation

33. The actual tuning

34. Finalize the model

35. Compare AUC

36. Plot ROC curves

Course 6 Support Vector Machines (SVMs) in R

Chapter 1. Introduction

1.1 Maximum margin separator

Definition

Concept

Command: Visualize the maximal margin separator

Add mm boundary to previous plot

Course 1 Machine Learning in the Tidyverse

12. Visualizing decision & margin bounds using `plot()`

23. Using `tune.svm()`