Machine Learning Scientist with R Course 1 - 7
  • AI Chat
  • Code
  • Report
  • Beta
    Spinner
    # Start coding here...

    Course 1 Machine Learning in the Tidyverse

    Course 2 Cluster Analysis in R

    Course 3 Machine Learning with caret in R

    Course 4 Modeling with tidymodels in R

    Course 5 Machine learning with Tree-based models in R

    Chapter 1. Classification Trees

    1.1 Tree-based models

    1.1.1 Pros

    1. Easy to explain and understand
    2. Can capture non-linear relationship
    3. Don't require normalization of numeric features standardization of numeric features dummy indicator variables
    4. Robust to outlier
    5. Fast for large datasets 1.1.2 Cons
    6. Hard to interpret and take long time if large or ensembled
    7. Prone to overfitting in high variance, complex trees

    1.2 Create a decision tree and commands

    1.2.1 Pick a model class wit specifications of engine and model

    • library(tidymodels)
    • TREE_SPEC <- decision_tree() %>%
      set_engine("rpart")%>% # engine that powers your model
      set_mode("classification") # Classification / regression 1.2.2 Fit the model using training data
    • TREE_MODEL<- TREE_SPEC %>% fit(formula = Y ~ X, data = TRAINING_DATA) 1.2.3 Split data into training and test datasets
    • set.seed(INT)
    • SPLIT_OBJ <- initial_split(FULL_DATA, prop =0.9, strata = Y) # Specify proportion (default = 0.75) and strate = outcome variable to be equally distributed
    • TRAINING_DATA <- training(SPLIT_OBJ)
    • TEST_DATA <- testing(SPLIT_OBJ)
    • nrow(TRAINING_DATA)/nrow(FULL_DATA) # Verify overall proportion
    • nrow(TEST_DATA)/nrow(FULL_DATA)
    • counts_train <- table(TRAINING_DATA$outcome) # Check imbalance
    • prop_yes_train <- counts_train["yes"]/sum(counts_train)
    • counts_test <- table(TEST_DATA$outcome)
    • prop_yes_test <- counts_test["yes"]/sum(counts_test) #Note: Avoid imbalance esp. for disease with rare outcomes 1.2.4 Predict new data
    • PREDICTIONS <- predict(TREE_MODEL, new_data = TEST_DATA, type ="class")
    • predict(TREE_MODEL, new_data = TEST_DATA, type ="prob") 1.2.5 Evaluate
    1. Confusion matrix (reveals how confusing a model is: diagona = correct) 1.1) Values True Positive(TP):prediction is yes, truth is yes True Negative (TN):prediction is no,truth is no False Positive (FP):prediction is yes,truth is no False Negative (FN):prediction is no,truth is yes 1.2) Metrics [ALL = TP + TN + FP +FN] 1.2.1) Accuracy = (TP + TN) / ALL

      For further details Go to ## More model measures in Chapter 3.

    1.2.2) Sensitivity 1.2.3) Specificity 1.2.4) ROC 1.2.5) AUC 2) Commands

    • PRED_COMBINED <- PREDICTIONS %>%
      mutate(true_class = TEST_DATA$outcome)
    • conf_mat(data = PRED_COMBINED, estimate = .pred_class, truth = true_class) # from yardstick package
    • accuracy(data = PRED_COMBINED, estimate = .pred_class, truth = true_class)

    Chapter 2. Regression Trees and Cross-Validation

    2.1 Construct a regression tree with hyperparameters

    • TREE_SPEC <- decision_tree(min_n = NUM_1, # No of data points in node for further split (default = 20) tree_depth = NUM_2, # max depth of tree (default = 30) cost_complexity = NUM_3) %>% # penalty for complexity (default = 0.01)
      set_engine("rpart")%>%
      set_mode("regression")
    • MODEL <- TREE_SPEC %>%
      fit(formula = Y ~ X, data = TRAINING_DATA)

    2.2 Predict using the regression tree

    • PREDICTIONS <- predict(MODEL, new_data = TEST_DATA)

    2.3 Evaluate the regression tree using metrics [how far from the truth]

    Out-of-sample performance (training - testing) Vs In-of-sample performance (within training) 3.1) Mean Absolute Error (MAE) = average length of sum(abs[actual-predict])

    • mae(PREDICTIONS, estimate = .pred, truth = Y) 3.2) Root Mean Square Error (RMSE) = Root of (MAE)^2, large error get higher weight
    • rmse(PREDICTIONS, estimate = .pred, truth = Y)

    2.4 Cross validation: give more robust out-of-sample performance

    4.1) Set seed and folds

    • set.seed(INT) # Set seed for reproducibility
    • DATA_FOLDS <- vfold_cv(TRAINING_DATA, v = INT) # Specify no of folds 4.2) Fit the folds
    • FIT_CV <- fit_resamples(TREE_SPEC,
      Y ~ X,
      resamples = DATA_FOLDS, metrics = metric_set(mae, rmse)) 4.3) Collect all metrics (errors)
    • ALL_ERRORS <- collect_metrics(FIT_CV, summarize =FALSE) # If default(true), will calculate mean of each metric 4.4) Visualize the errors
    • ggplot(ALL_ERRORS, aes(.estimate, fill = .metric)) + geom_histogram()

    2.5 Note: Bias-variance tradeoff

    1. Simple vs complex model e.g. more tree_depth -> more complex
    • simple_spec <- decision_tree(tree_depth =2)%>%
      set_mode("regression")
    • simple_spec <- decision_tree(tree_depth =15)%>%
      set_mode("regression")
    1. Overfitting = high variance (high MAE) vs Underfitting = high bias (high RMSE on both training and test sets)
    • bind_rows(training = mae(train_results, estimate = .pred, truth = final_grade), test = mae(test_results, estimate = .pred, truth = final_grade),
      .id ="dataset") Note: Best model = around the sweet spot between bias and variance
    1. Detecting overfitting and underfitting 3.1) Overfitting -> reduce complexity (hyperparameter tuning)
    • collect_metrics(FIT_CV) # Out-sample/CV (found high CV/error, high variance)
    • mae(training_pred, # In-sample (found small MAE) estimate = .pred,
      truth = final_grade) 3.2) Underfitting -> increase complexity (hyperparameter tuning)
    • mae(TRIANING_PRED, estimate = .pred, truth = Y) # In-sample (found high MAE)

    Chapter 3.Hyperparameters and Ensemble Models

    3.1 Tuning hyperparameters

    3.1.1 Hyperparameters

    1. Characteristics = influence shape and complexity of trees
    2. Hyperparameteris in parsnip decision tree
    • min_n # minimum no of samples required to further split a node
    • tree_depth # max allowed depth of the tree
    • cost_complexity # no penalizes size of tree 3.1.2 Goal of tuning = finding the optimal set of hyperparameter values 3.1.3 Steps of tuning
    1. Create placeholders: tune()
    • SPEC_UNTUNED <- decision_tree(
      min_n = tune(),
      tree_depth = tune() )%>%
      set_engine("rpart")%>%
      set_mode("classification")
    1. Create a tuning grid: grid_regular()
    • TREE_GRID <- grid_regular(
      parameters(SPEC_UNTUNED),
      levels =3) # No of grid points for each parameter
    1. Tune the grid: tune_grid() 3.1) Build a model for every grid point + evaluate model out-of-sample (cross validation, CV)
    • TUNE_RESULTS <- tune_grid(SPEC_UNTUNED,
      Y ~ X,
      resamples = DATA_FOLDS,
      grid = TREE_GRID,
      metrics = metric_set(accuracy)) 3.2) Visualize the tuning results
    • autoplot(TUNE_RESULTS)
    1. Use the best parameters: finalized_model() 4.1) Select the best performing parameteris
    • FINAL_PARAMS <- select_best(TUNE_RESULTS) 4.2) Plug them into specifiction
    • BEST_SPEC <- finalize_model(SPEC_UNTUNED, FINAL_PARAMS)
    1. Build the final model
    • FINAL_MODEL <- fit(BEST_SPEC, Y ~ X, data = FULL_DATA)

    3.2 More model measures

    3.2.1 Sensitivity = [TP / (TP+FN)]

    • sens(PREDICTIONS, estimate = .pred_class, truth = true_class) #PREDICTIONS = a tibble .pred_class and true_class #Similar argument as accuracy() and conf_mat() 3.2.2 Specificity = [TN / (TN+FP)]
    • spec(PREDICTIONS, estimate = .pred_class, truth = true_class) 3.2.3 Receiver-operating-characteristic (ROC) curve # Sen (True positive rate) vs 1 - Specificity (False positive rate)
    1. Add probabilities on test data # As classifcation use a constant threshold, so a test for different threshold needed
    • PREDICTIONS <- predict(MODEL, TEST_DATA, type = "prob")%>% bind_cols(TEST_DATA)
    1. Calculate the ROC curve for all thresholds
    • ROC <- roc_curve(PREDICTIONS, estimate =.pred_yes, truth = Y)
    1. Plot the ROC curve
    • autoplot(ROC) 3.2.4 Area under (the ROC) curve (AUC)
    1. Calculate AUC # only change function
    • roc_auc(PREDICTIONS, estimate =.pred_yes, truth = Y)
    1. Interpretation 0 = every example incorrectly classified 0.5 = performing not better than random chance 1 = all examples correctly for every threshold (perfect model) or 100% accuracy + 0% false positive rate

    3.3 Bagged trees

    3.3.1 Background

    1. High variance among trees -> many heads better than one (bagging)
    2. Bagging = short for Boostrap Aggregation 3.3.2 Steps
    3. Bootstrap and train (random sampling w/ replacement and build K models on K samples in parallel; more K = more time needed) 1.1) Specify the bagged trees
    • library(baguette)
    • SPEC_BAGGED <- bag_tree() %>% set_mode("classification") %>% set_engine("rpart", times = INT) # Set K in times

    1.2) Train all trees

    • MODEL_BAGGED <- fit(SPEC_BAGGED, formula = Y ~ X, data = TRAINING_DATA) #Note: the higher std.eror for a variable, the more important it is
    1. Aggregate (generate predictions from each tree and ensemble prediction, reduce variance among trees or higher performance, justified by average (regression) or majority vote (classification))
    • PREDICTIONS <- predict(MODEL, TEST_DATA, type = "prob")%>% bind_cols(TRAINING_DATA)
    • ROC <- roc_curve(PREDICTIONS, estimate =.pred_yes, truth = Y)
    • roc_auc(PREDICTIONS, estimate =.pred_yes, truth = Y)
    1. Check for overfitting using out-of-sample evaluation 3.1) Estimate AUC using cross-validation (esp. for model with high AUC)
    • set.seed(INT)
    • CV_RESULTS <- fit_resamples(SPEC_BAGGED, Y ~ X, resamples = vfold_cv(TRAINING_DATA, v = INT), metrics = metric_set(roc_auc)) 3.2) Collect metrics
    • collect_metrics(CV_RESULTS) 3.3) Interpret If mean of AUC (out-of-sample) < that of in-sample, it can be overfitting -> need more penality (tuning cost_complexity)

    3.4 Random forest (improved version of bagged trees)

    3.4.1 Characteristics

    1. Suitable for high-dimensional data
    2. Easy to use
    3. Out-of-the-box performance
    4. Use in many packages: ranger, randomForest - tidymodels interface to these packages: rand_forest() in parsnip package 5.4.2 Concept
    5. Identical to bagging: train trees on bootstrap samples
    6. Difference = random predictors across trees -> random forest X dataset predicted by N features -> Majority vote (classification) -> Final class 3.4.3 Steps
    7. Specify a random forest model through hyperparameters
    • RAND_SPEC <- rand_forest(mtry = INT, # Predictors seen at each node (default = sqrt(NO_VAR_X)) trees = INT, # No of trees in the forest min_n = 10) %>% # smallest node size allowed set_mode("classification") %>% # Set the mode set_engine("ranger") # Use engine ranger or randomForest
    1. Train a forest
    • RAND_SPEC %>% fit(Y ~ X data = TRIANING_DATA)
    1. In case of wanting to see variable importance, use the codes:
    • rand_forest(mode = "classification") %>% set_engine("ranger", importance = "impurity") %>% # Add impurity fit(still_customer ~ ., data = customers_train) %>% vip::vip() # Add variable importance

    Chapter 4.Boosted Trees

    4.1 Introduction to boosting

    4.1.1 Definition = Improving model based on the previous model(s)

    4.1.2 Adaptive boosting (Adaboost)

    1. Concept: weight of wrongly classified training examples in subsequent trainings
    2. How it works: 2.1) start by training a decision with equal weight for each observation 2.2) Increase weight of observations hard to classifiy and lower weight of eaily-classified observations
    3. Further improved by adding gradient descent
    4. Overall steps and commands 4.1) Specify spec
    • BOOST_SPEC <- boost_tree()%>% # Specify the model class set_mode("classification") %>% # Set the mode
      set_engine("xgboost") # Set the engine 4.2) Specify model
    • BOOST_MODEL <- fit(BOOST_SPEC, formula = Y ~ X, data = TRAINING_DATA) 4.3) Evaluate the ensemble
    • set.seed(INT)
    • DATA_FOLDS <- vfold_cv(TRAINING_DATA, v = INT)
    • CV_RESULTS <- fit_resamples(BOOST_SPEC, Y ~ X, resamples = DATA_FOLDS, metrics = metric_set(roc_auc))

    4.2 Gradient boosting

    4.2.1 Adaboost

    1. Uses "decision stumps" as weak learners
    2. Attaches weight to observations: high weight for difficult observations, low weight for correct predictions

    4.2.2 Gradient boosting

    1. Uses "small decision trees" as weak learners (start w/ less complex trees)
    2. Lose function instead of weights (arrow function)
    3. Loss function optimization by gradient descent

    4.2.3 Pros of boosting

    1. Among the best-performing machine learning models
    2. Good option for unbalanced data

    4.2.4 Cons of boosting

    1. Prone to overfitting
    2. Training can be slow (depending on learning rate hyperparameters)
    3. Mang tunging hyperparameters

    4.2.5 Hyperparameters for gradient boosting

    1. min_n: (from simple decision tree) min no of observation for further node split
    2. tree_depth: (from simple decision tree) max depth of the tree / no of split
    3. sample_size: (from random forest) amount of data exposeted to fitting routine
    4. trees: (from random forest) no of trees in the ensemble
    5. mtry: (from random forest) no of predictors randomly sampled at each split
    6. learn_rate: (from boosted tres) rate at which the boosting algorithm adapts from iteration to iteration
    7. loss_reduction: (from boosted tres) rediction in the loss function required to split further
    8. stop_iter: (from boosted tres) no of iterations w/o improvement before stopping

    4.3 Optimization of the boosted ensemble

    4.3.1 Tuning workflow and their commands

    1. Create the tuning spec: tune()
    • BOOST_SPEC <- boost_tree( trees = INT, # No of trees learn_rate = tune(), tree_depth = tune(), sample_size = tune()) %>% set_mode("classification") %>% set_engine("xgboost")
    1. Create the tuning grid: grid_regular() OR grid_random() 2.1) With grid_regular()
    • TUNEGRID_BOOST <- grid_regular(parameters(BOOST_SPEC),levels = INT) # specify level; no of combination = level ^ (no of tuned hyper parameters) 2.2) With grid_random() #not evenly spread grid, random comb
    • TUNEGRID_BOOST <- grid_random(parameters(BOOST_SPEC),size = INT) # Specify size (no of combinations)
    1. Split data: vfold_cv()
    • DATA_FOLDS <- vfold_cv(TRAINING_DATA, v = INT)
    1. Tune alnog the grid: tune_grid()
    • TUNE_RESULTS <- tune_grid(BOOST_SPEC, Y ~ X, resamples = DATA_FOLDS, grid = TUNEGRID_BOOST metrics = metric_set(roc_auc))

    • autoplot(TUNE_RESULTS)

    1. Finalize the model: select_best() and finalize_model() 5.1) Select the final hyperparameters
    • BEST_PARAMS <- select_best(TUNE_RESULTS) 5.2) Finalize the specification
    • FINAL_SPEC <- finalize_model(BOOST_SPEC, BEST_PARAMS)
    1. Train the final model: fit()
    • FINAL_MODEL <- FINAL_SPEC %>% fit( formula = Y ~ X, data = TRAINING_DATA)

    4.4 Model comparison

    4.4.1 Steps of comparing models and their commands

    1. Combine predictions
    • bind_cols(decision_tree, bagged_trees, random_forest, boosted_trees. TEST_DATA %>% select(Y))
    1. Calculate AUC 2.1) For decision trees
    • AUC_TREE <- roc_auc(PRED_COMBINED, truth = Y, estimate = PREDS_TREE) 2.2) For bagged trees
    • AUC_BAGGED <- roc_auc(PRED_COMBINED, truth = Y, estimate = PREDS_BAGGING) 2.3) For random forest
    • AUC_FOREST <- roc_auc(PRED_COMBINED, truth = Y, estimate = PREDS_FOREST) 2.4) For boosted trees
    • AUC_BOOST <- roc_auc(PRED_COMBINED, truth = Y, estimate = PREDS_BOOSTING)
    1. Combine all AUCs # Specify .id = "model"
    • bind_rows(AUC_TREE,AUC_BAGGED,AUC_FOREST,AUC_BOOST, .id = "model")
    1. Reformat the results by reshaping the predictiongs into long format
    • PREDICTIONS_LONG <- tidyr::pivot_longer(PRED_COMBINED, cols = starts_with("preds_"), names_to = "model", values_to = "predictions")
    1. Calculate cutoff values
    • CUTOFFS <- PREDICTIONS_LONG %>% group_by(model) %>% roc_curve(truth = Y, estimate = predictions)
    1. Plot ROC curves
    • autoplot(CUTOFFS)

    Exercise

    1. Specify that tree

    #Load the package
    • library(tidymodels) #Pick a model class
    • tree_spec <- decision_tree() %>% set_engine("rpart") %>% #Set the engine set_mode("classification") # Set the mode

    2. Train that model

    #Create the specification
    • tree_spec <- decision_tree() %>% set_engine("rpart") %>% set_mode("classification") #Train the model
    • tree_model_bmi <- tree_spec %>% fit(formula = outcome ~ bmi, data = diabetes) # Outcome = yes/no

    3. Train/test split

    #Create the split
    • diabetes_split <- initial_split(diabetes, prop = 0.8) #Extract the training and test set
    • diabetes_train <- training(diabetes_split)
    • diabetes_test <- testing(diabetes_split) #Verify the proportions of both sets
    • round(nrow(diabetes_train) / nrow(diabetes), 2) == 0.80
    • round(nrow(diabetes_test) / nrow(diabetes), 2) == 0.20

    4. Avoiding class imbalances

    #Preparation
    • set.seed(9888)
    • diabetes_split <- initial_split(diabetes, prop = 0.75) #Proportion of 'yes' outcomes in the training data
    • counts_train <- table(training(diabetes_split)$outcome)
    • prop_yes_train <- counts_train["yes"] / sum(counts_train) #Proportion of 'yes' outcomes in the test data
    • counts_test <- table(testing(diabetes_split)$outcome)
    • prop_yes_test <- counts_test["yes"] / sum(counts_test)
    • paste("Proportion of positive outcomes in training set:", round(prop_yes_train, 2))
    • paste("Proportion of positive outcomes in test set:", round(prop_yes_test, 2)) #Create a split with a constant outcome distribution [specify strata]
    • diabetes_split <- initial_split(diabetes, prop = 0.75, strata = outcome)
    • counts_train <- table(training(diabetes_split)$outcome)
    • prop_yes_train <- counts_train["yes"] / sum(counts_train)
    • counts_test <- table(testing(diabetes_split)$outcome)
    • prop_yes_test <- counts_test["yes"] / sum(counts_test)
    • paste("Proportion of positive outcomes in training set:", round(prop_yes_train, 2))
    • paste("Proportion of positive outcomes in test set:", round(prop_yes_test, 2))

    5. From zero to hero

    • set.seed(9) #Create the balanced data split
    • diabetes_split <- initial_split(diabetes, prop = 0.75, strata = outcome) #Build the specification of the model
    • tree_spec <- decision_tree() %>% set_engine("rpart") %>% set_mode("classification") #Train the model
    • model_trained <- tree_spec %>%
    • fit(formula = outcome ~ bmi + skin_thickness, data = training(diabetes_split))

    6. Make predictions

    #Train your model

    -model <- tree_spec %>% fit(formula = outcome ~ ., data = diabetes_train) #Generate predictions

    • predictions <- predict(model, new_data = diabetes_test, type ="class") #Add the true outcomes
    • predictions_combined <- predictions %>% mutate(true_class = diabetes_test$outcome)

    7. Crack the matrix

    #The confusion matrix
    • diabetes_matrix <- conf_mat(data = predictions_combined, estimate = .pred_class, truth = true_class)

    8. Are you predicting correctly?

    #Calculate the number of correctly predicted classes
    • correct_predictions <- 75 + 34 #Calculate the number of all predicted classes

    • all_predictions <- 75 + 34 + 18 + 27 #Calculate and print the accuracy

    • acc_manual <- correct_predictions / all_predictions

      #The accuracy calculated by a function

    • acc_auto <- accuracy(predictions_combined, estimate = .pred_class, truth = true_class)

    • acc_auto$.estimate

    9. Train a regression tree

    • library(tidymodels) #Build the specification
    • model_spec <- decision_tree() %>% set_mode("regression") %>% set_engine("rpart") #Fit to the data
    • model_fit <- model_spec %>% fit(formula = final_grade ~ cocoa_percent + review_date , data = chocolate_train)

    10. Predict new values

    #Train the model
    • chocolate_model <- model_spec %>% fit(formula = final_grade ~., data = chocolate_train) #Predict new data
    • predictions <- predict(chocolate_model, new_data = chocolate_test) %>% #Add the test set bind_cols(chocolate_test

    11. In-sample performance

    #Predict using the training set
    • in_sample_predictions <- predict(model, new_data = chocolate_train) #Calculate the vector of absolute differences
    • abs_diffs <- abs(chocolate_train.pred) #Calculate the mean absolute error
    • 1 / length(abs_diffs) * sum(abs_diffs)

    12. Out-of-sample performance

    #Predict ratings on test set and add true grades
    • test_enriched <- predict(model, new_data = chocolate_test) %>% bind_cols(chocolate_test) #Compute the mean absolute error using one single function
    • mae(test_enriched, estimate = .pred, truth = final_grade)

    13. Bigger mistakes, bigger penalty

    #Calculate the squared differences
    • squared_diffs <- (test_enriched.pred)^2 #Compute the RMSE using the formula
    • rmse_manual <- sqrt(1 / length(squared_diffs) * sum(squared_diffs)) #Compute the RMSE using a function
    • rmse_auto <- rmse(test_enriched, estimate = .pred, truth = final_grade)

    14. Create the folds

    #Set seed for reproducibility
    • set.seed(20) #Build 10 folds
    • chocolate_folds <- vfold_cv(chocolate_train, v = 10)

    15. Fit the folds

    #Create a specification
    • tree_spec <- decision_tree() %>% set_engine("rpart") %>% set_mode("regression") #Fit all folds to the specification
    • fits_cv <- fit_resamples(tree_spec, final_grade ~ ., resamples = chocolate_folds, metrics = metric_set(mae, rmse))

    16. Evaluate the folds

    • library(ggplot2) #Collect the errors
    • all_errors <- collect_metrics(fits_cv, summarize = FALSE) #Plot an error histogram
    • ggplot(all_errors, aes(.estimate, fill = .metric)) + geom_histogram() #Collect and print error statistics
    • collect_metrics(fits_cv, summarize = TRUE)

    17. Adjust model complexity

    #Create a model that can grow arbitrarily complex
    • chocolate_model <- decision_tree(min_n = 2, cost_complexity = 0) %>% set_mode("regression") %>% set_engine("rpart") %>% fit(final_grade ~ ., data = chocolate_train)

    18. In-sample and out-of-sample performance

    #Predict on and combine with training data and calculate the error [In-of-sample]
    • predict(complex_model, new_data = chocolate_train) %>% bind_cols(chocolate_train) %>% mae(estimate = .pred, truth = final_grade) #Predict on and combine with test data and calculate the error [Out-of-sample]
    • predict(complex_model, new_data = chocolate_test) %>% bind_cols(chocolate_test) %>% mae(estimate = .pred, truth = final_grade)

    19. Generate a tuning grid

    #Create a specification with tuning placeholders
    • tune_spec <- decision_tree(tree_depth = tune(), cost_complexity = tune()) %>% set_mode("classification") %>% set_engine("rpart") #Create a regular grid
    • tree_grid <- grid_regular(parameters(tune_spec), levels = 2)

    20. Tune along the grid

    • set.seed(275) #Create CV folds of the customers tibble
    • folds <- vfold_cv(customers, v = 3) #Tune along the grid
    • tune_results <- tune_grid(tune_spec, still_customer ~ ., resamples = folds, grid = tree_grid, metrics = metric_set(accuracy)) #Plot the tuning results
    • autoplot(tune_results)

    21. Pick the winner

    #Select the parameters that perform best
    • final_params <- select_best(tune_results) #Finalize the specification
    • best_spec <- finalize_model(tune_spec, final_params) #Build the final model
    • final_model <- fit(best_spec, still_customer ~ ., data = customers)

    22. Calculate specificity

    #Create CV folds of the training data
    • folds <- vfold_cv(customers_train, v = 3) #Calculate CV specificity
    • specificities <- fit_resamples(tree_spec, still_customer ~ ., resamples = folds, metrics = metric_set(specificity)) #Collect the metrics
    • collect_metrics(specificities)

    23. Draw the ROC curve

    #Predict probabilities on test set
    • predictions <- predict(model, customers_test, type = "prob") %>% bind_cols(customers_test) #Calculate the ROC curve for all thresholds
    • roc <- roc_curve(predictions, estimate = .pred_yes, truth = still_customer) #Plot the ROC curve
    • autoplot(roc)

    24. Area under the ROC curve

    #Calculate area under the curve
    • auc_result <- roc_auc(predictions, estimate = .pred_yes, truth = still_customer)
    • print(paste("The area under the ROC curve is", round(auc_result$.estimate, 3)))

    25. Create bagged trees

    #Create the specification
    • library(baguette)
    • spec_bagged <- bag_tree() %>% set_mode("classification") %>% set_engine("rpart", times = 20) #Fit to the training data
    • model_bagged <- fit(spec_bagged, still_customer ~ total_trans_amt + customer_age + education_level, data = customers_train)

    26. In-sample ROC and AUC

    #Predict on training set and add to training set
    • predictions <- predict(model_bagged, new_data = customers_train, type = "prob") %>% bind_cols(customers_train) #Create and plot the ROC curve
    • roc_curve(predictions, estimate = .pred_yes, truth = still_customer) %>% autoplot() #Calculate the AUC
    • roc_auc(predictions, estimate = .pred_yes, truth = still_customer)

    27. Check for overfitting

    • set.seed(55) #Estimate AUC using cross-validation
    • cv_results <- fit_resamples(spec_bagged, still_customer ~ total_trans_amt + customer_age + education_level, resamples = vfold_cv(customers_train, v = 3), metrics = metric_set(roc_auc)) #Collect metrics
    • collect_metrics(cv_results)

    28. Variable importance

    #Specify a random forest
    • spec <- rand_forest() %>% set_mode("classification") %>% set_engine("ranger", importance = "impurity") #Train the forest
    • model <- spec %>% fit(still_customer ~ ., data = customers_train) #Plot the variable importance
    • vip::vip(model)

    29. Specify a boosted ensemble

    #Specify the model class
    • boost_spec <- boost_tree() %>% #Set the mode set_mode("classification") %>% #Set the engine set_engine("xgboost")

    30. Train a boosted ensemble

    • boost_model <- fit(boost_spec, still_customer ~ ., data = customers_train)

    31. Evaluate the ensemble

    • set.seed(99) #Create CV folds
    • folds <- vfold_cv(customers_train, v = 5) #Fit and evaluate models for all folds
    • cv_results <- fit_resamples(boost_spec, still_customer ~ ., resamples = folds, metrics = metric_set(roc_auc)) #Collect cross-validated metrics
    • collect_metrics(cv_results)

    31. Compare to a single classifier

    1. Boosted tree
    • set.seed(100)
    • predictions <- boost_tree() %>% set_mode("classification") %>% set_engine("xgboost") %>% fit(still_customer ~ ., data = customers_train) %>% predict(new_data = customers_train, type = "prob") %>% bind_cols(customers_train)
    • roc_auc(predictions , truth = still_customer, estimate = .pred_yes)
    1. Simple tree
    • set.seed(100)
    • predictions <- decision_tree() %>% # Change function set_mode("classification") %>% set_engine("rpart") %>% # Change engine fit(still_customer ~ ., data = customers_train) %>% predict(new_data = customers_train, type = "prob") %>% bind_cols(customers_train)
    • roc_auc(predictions, truth = still_customer, estimate = .pred_yes)

    32. Tuning preparation

    #Create the specification with placeholders

    • boost_spec <- boost_tree( trees = 500, learn_rate = tune(), tree_depth = tune(), sample_size = tune()) %>% set_mode("classification") %>% set_engine("xgboost") #Create the tuning grid
    • tunegrid_boost <- grid_regular(parameters(boost_spec), levels = 3)

    33. The actual tuning

    #Create CV folds of training data
    • folds <- vfold_cv(customers_train, v = 6) #Tune along the grid
    • tune_results <- tune_grid(boost_spec, still_customer ~ ., resamples = folds, grid = tunegrid_boost, metrics = metric_set(roc_auc)) #Plot the results
    • autoplot(tune_results)

    34. Finalize the model

    #Select the final hyperparameters

    -best_params <- select_best(tune_results) #Finalize the specification final_spec <- finalize_model(boost_spec, best_params) #Train the final model on the full training data

    • final_model <- final_spec %>% fit( still_customer ~ ., data = customers_train)

    35. Compare AUC

    #Calculate the AUC for each model
    • auc_tree <- roc_auc(preds_combined, truth = still_customer, estimate = preds_tree)
    • auc_bagged <- roc_auc(preds_combined, truth = still_customer, estimate = preds_bagging)
    • auc_forest <- roc_auc(preds_combined, truth = still_customer, estimate = preds_forest)
    • auc_boost <- roc_auc(preds_combined, truth = still_customer, estimate = preds_boosting) #Combine AUCs into one tibble
    • combined <- bind_rows(decision_tree = auc_tree, bagged_tree = auc_bagged, random_forest = auc_forest, boosted_tree = auc_boost, .id = "model")

    36. Plot ROC curves

    #Reshape the predictions into long format
    • predictions_long <- tidyr::pivot_longer(preds_combined, cols = starts_with("preds_"), names_to = "model", values_to = "predictions") predictions_long %>% #Group by model group_by(model) %>% #Calculate values for every cutoff roc_curve(truth = still_customer, estimate = predictions) %>% #Create a plot from the calculated data autoplot

    Course 6 Support Vector Machines (SVMs) in R

    Chapter 1. Introduction

    1.1 Maximum margin separator

    Definition

    1. Separable = A dataset in which the classes do not overlap
    2. Decision boundary = a seperating boundry between separables
    3. Maximum margin separator = the best decision boundary is one that maximizes the margin

    Concept

    Maximal margin separator lies halfway between the two clusters that you want to divide

    Command: Visualize the maximal margin separator

    1. Create data frame with maximal margin separator
    • MM_SEP <- data.frame(sep = c((SEP_CLUS_1 + SEP_CLUS_2) / 2)) # SEP_CLUS_1 = extreme point for lower cluster, SEP_CLUS_2 = extreme point for higher cluster

    Add mm boundary to previous plot

    • PLOT <- PLOT + geom_point(data = MM_SEP, aes(sep, 0),color = "blue",size = 4)

    1.2 Generating a linearly separable dataset

    Concept

    1. Create a dataset (two variables) that can be used to illustrate SVMs
    2. Create a linear decision boundary that seperates the two variable in the dataset

    Steps and commands

    1. Generate a two-dimensional dataset 1.1) Set required number of data points
    • N <- 200 1.1) Set seed to ensure reproducibility
    • set.seed(INT) 1.1) Generate dataframe with two predictors x1 and x2 in (0,1)
    • DF <- data.frame(X1 = runif(n), X2 = runif(n))
    1. Create two classes (-1 or +1), separated by a straight line decision boundary
    • DFX1 - DF$X2 > 0,-1, 1), levels = c(-1, 1))
    1. Visualize using ggplot
    • PLOT <- ggplot(data = DF, aes(x = X1, y = X2, color = Y)) + geom_point() + scale_color_manual(values = c("-1" = "red","1" = "blue")) + # Distinguish by color geom_abline(slope = 1, intercept = 0) # Decision boundary is line where X1 = X2 passing through (0,0) and slope = 1
    1. Add a margin 4.1) Remove points that lie close to the boundary
    • MARGIN <- 0.05 # Create a margin of 0.05 in dataset
    • DF1 <- DF[abs(DFX2) > MARGIN, ] # Retain only those points that lie outside the margin
    • nrow(DF1) # Check number of data points remaining 4.2) Replot
    • PLOT <- ggplot(data = DF1, aes(x = X1, y = X2, color = Y)) + geom_point() + scale_color_manual(values = c("-1" = "red","1" = "blue")) + # Distinguish by color geom_abline(slope = 1, intercept = 0) # Decision boundary is line where X1 = X2 passing through (0,0) and slope = 1
    1. Plot the margin upper and lower boundaries
    • PLOT <- PLOT + geom_abline(slope = 1, intercept = DELTA, linetype = "dashed") + geom_abline(slope = 1, intercept = -DELTA, linetype = "dashed")

    Chapter 2. Support Vector Classifiers - Linear Kernels

    2.1 Linear Support Vector Machines

    2.1.1 Concept

    1. Decision boundaries can have different shapes e.g. line (previous Chapter), polynomial, or more complex function
    2. Type of decision boundary = "kernel", it must be specified upfront

    2.1.2 SVM with linear kernel

    • library(e1071)
    • svm(RESP_VAR ~ EXP_VAR, # formula (same as other models) data = DF # dataframe containing the data type = "classification" # set to C-classification kernel = "linear" # Form of decision boundary cost = "XXX" # tuning parameter gamma = "YYY" # tuning parameter scale = TRUE/FALSE # In real life, it should be set to "True" )

    2.1.3 Steps of building SVM with linear kernel and Commands

    1. Split data into training and test datasets 1.1) Set seed for reproducibility
    • set.seed(INT) 1.2) Set the upper bound for the number of rows to be in the training set
    • SAMPLE_SIZE <- floor(0.8 * nrow(DF)) 1.3) Randomly assign rows to training/test sets in 80/20 proportion
    • DATA <- sample(seq_len(nrow(DF)), size = SAMPLE_SIZE) 1.4) Separate training and test sets
    • TRAIN_DATA <- DF[DATA, ]
    • TEST_DATA <- DF[-DATA, ]
    1. Build a linear SVM
    • library(e1071)
    • SVM_MODEL <- svm(Y ~., data = TRAIN_DATA, # Use train data type = "C-classification", kernel = "linear", scale = FALSE)
    1. Look into the model 3.1) See overview of the model
    • SVM_MODEL 3.2) List components of model
    • names(SVM_MODEL) 3.3) Index of support vectors in training dataset
    • SVM_MODEL$Index 3.4) Support vectors
    • SVM_MODEL$SV 3.5) Negative intercept (unweighted)
    • SVM_MODEL$rho 3.6) Weighting coefficients for support vectors
    • SVM_MODEL$coefs
    1. Obtain class predictions for training and test datasets
    • PRED_TRAIN <- predict(SVM_MODEL, TRAIN_DATA)
    • PRED_TEST <- predict(SVM_MODEL, TEST_DATA)
    1. Evaluate the accuracy of the model
    • mean(PRED_TRAIN == TRAIN_DATA$Y)
    • mean(PRED_TEST == TEST_DATA$Y)

    2.2 Visualizing Linear SVMs

    Steps of visualizing linear SVMs (Manual method)

    1. Plot the training data
    2. Mark out the support vectors using index from the svm model
    • PLOT <- ggplot(data = TRAIN_DATA, aes(x = X1, y = X2, color = Y)) + geom_point() + scale_color_manual(values = c("red","blue"))
    1. Find slope and inter 3.1) Identify support vectors
    • DF_SV <- TRAIN_DATA[SVM_MODEL$index, ] 3.2) Mark out support vectors in plot
    • PLOT <- PLOT + geom_point(data = DF_SV ,aes(x = X1, y = X2), color = "purple", size = 4, alpha = 0.5) Note: ggplot works in layers, so geom_point add another layer to mark.
    1. Find slope and intercept of the boundary 4.1) Build weight vector from coefs and SV element of svm model
    • WEIGHT <- t(SVM_MODELSV # %*% = multiply matrices 4.2) Calculate slope
    • SLOPE_1 <- -WEIGHT[1] / WEIGHT[2] 4.3) Calculate intercept
    • INTERCEPT_1 <- SVM_MODEL$rho / WEIGHT[2]
    1. Add decision boundary using slope and intercept calculated 5.1) Plot decision boundary based on calculated slope and intercept
    • PLOT <- PLOT + geom_abline(slope = SLOPE_1, intercept = INTERCEPT_1) 5.2) Add soft margin parallel to the decision boundary, offset by 1/ WEIGHT[2] on either side
    • PLOT <- PLOT + geom_abline(slope = SLOPE_1, intercept = INTERCEPT_1 - 1 / WEIGHT[2], linetype = "dashed") + geom_abline(slope = SLOPE_1, intercept = INTERCEPT_1 + 1 / WEIGHT[2], linetype = "dashed") Note: Soft margin classifiers Allow for uncertainty in location / shape of boundary as decision boundary is neither perfectly linear or known The decision boundary is linear -> can reduce margin

    Steps of visualizing linear SVMs (Convenient method)

    • library(e1071)
    • plot(x = SVM_MODEL,data = TRAIN_DATA)

    2.3 Tuning linear SVMs

    2.3.1 Cost parameter for linear SVM

    • svm_model <- svm(y ~ ., data = TRAIN_DATA,
      type ="C-classification",
      kernel ="linear", cost = INT # Specify cost; the more, the smaller margin (shown in number of support vectors) scale =FALSE) Note: Reducing cost for increasing accuracy only works for linear SVM. In non-linear dataset, it does not improve accuracy much.

    2.4 Multiclass problems

    2.4.1 Concept: How SVM algorithm deal with multiclass problems

    1. SVM works binary classifiers
    2. SVM applies the "one-against-one classification strategy" for multiclass problems as follows: 2.1) Partition the data into subsets containing two classes each 2.2) Solve the binary classification problem for each subset 2.3) Use majority vote to assign a class to each data point

    2.4.2 Commands [same as previous]

    • library(e1071)
    • svm_model <- svm(Species ~ ., data = TRAIN_DATA,
      type ="C-classification",
      kernel ="linear")

    Chapter 3. Support Vector Classifiers - Polynomial Kernels

    1. Generating a radially separable dataset

    1.1 Generate a 2D uniformly distributed set of points

    1. Set required number of datapoints
    • n <- INT
    1. Set seed to ensure reproducibility
    • set.seed(INT)
    1. Generate dataframe with 2 predictors x1 and x2 in (-1, 1)
    • DF <- data.frame(X1 = runif(n, min = -1, max = 1), X2 = runif(n, min = -1, max = 1))

    1.2 Create data within a circular boundary of decision

    1. Create a circular decision boundary of radius INT
    • RADIUS <- INT
    • RADIUS_SQUARED <- RADIUS ^ 2
    1. Categorize data points depending on location wrt boundary
    • DFX1 ^ 2 + DF$X2 ^ 2 < RADIUS_SQUARED,-1, 1), levels = c(-1, 1))

    1.3 Plot the dataset

    • library(ggplot2)
    • PLOT <- ggplot(data = DF, aes(x = X1, y = X2, color = Y)) + geom_point() + scale_color_manual(values = c("-1" = "red","1" = "blue"))
    • PLOT

    1.4 Add a circular boundary

    1. lying on a circle of radius r
    • CIRCLE <- function(X1_CENTER, X2_CENTER, r, npoint = 100) { #Angular spacing of 2*pi/npoint between points THETA <- seq(0, 2 * pi, length.out = npoint) X1_CIRC <- X1_CENTER + r * cos(THETA) X2_CIRC <- X2_CENTER + r * sin(THETA) data.frame(X1C = X1_CIR, X2C = X2_CIRC) }
    1. Generate boundary
    • BOUNDARY <- circle(X1_CENTER = 0, X2_CENTER = 0, r = radius)
    1. Add boundary to previous plot
    • PLOT <- PLOT + geom_path(data = BOUNDARY,aes(x = X1C, y = X2C), inherit.aes = FALSE)

    2. Creating linear SVMs model on radially separable data

    2.1 Build a linear SVM model, preliminarily evaluate and visualize it

    1. Build a linear SVM model
    • SVM_MODEL <- svm(Y ~., data = TRAIN_DATA, type = "C-classification", kernel = "linear", cost = 1)
    1. Calculate accurary on test data
    • PRED_TEST <- predict(SVM_MODEL, TEST_DATA)
    • mean(PRED_TEST == TEST_DATA$Y)
    1. Plot the model with train data
    • plot(SVM_MODEL, TRAIN_DATA)

    2.2 Tune cost parameter

    1. Tune cost parameter
    • SVM_MODEL <- svm(Y ~., data = TRAIN_DATA, type = "C-classification", kernel = "linear", cost = 100) # Tune to 100 for example
    1. Evaluate the tuned model using accuracy
    • PRED_TEST <- predict(SVM_MODEL, TEST_DATA)
    • mean(PRED_TEST == TEST_DATA$Y)
    1. Plot the tuned model with train data
    • plot(SVM_MODEL, TRAIN_DATA)

    2.3 Evaluate model

    1. Concept: a better estimate of accuracy 1.1) Calculate mean accuracy over a number of independent train/test plits 1.2) Check SD of result to know variability
    2. Commands
    • ACCURACY <- rep(NA, 100)
    • set.seed(10)
    • for (i in 1:100) { SAMPLE_SIZE <- floor(0.8 * nrow(DF)) TRAIN <- sample(seq_len(nrow(DF)), size = SAMPLE_SIZE) TRAIN_DATA <- DF[TRAIN, ] TEST_DATA <- DF[-TRAIN, ] SVM_MODEL <- svm(Y ~., data = TRAIN_DATA, type = "C-classification", kernel = "linear", cost = 100) PRED_TEST <- predict(SVM_MODEL, TEST_DATA) ACCURACY[i] <- mean(PRED_TEST == TEST_DATA$Y) }
    • mean(ACCURACY)
    • sd(ACCURACYy)

    3. Applying kernel trick on radially separable data

    3.1 Transform the problem

    1. Concept = Devise a transformation that makes the problem linearly separable
    2. Equation x1 ^ 2 + x2 ^ 2 = NUM If x1^2 = X1 and x2^2 = X2, then X1 + X2 = NUM -> linear equation
    • DF_1 <- data.frame(X1SQ = DFX2^2, y = DF$Y) # DF from previous model
    1. Plot the transformed
    • TRANSFORMED_PLOT <- ggplot(data = DF_1, aes(x = X1SQ, y = X2SQ, color = Y)) + geom_point() + scale_color_manual(values = c("red","blue")) + geom_abline(slope = -1, intercept = RADIUS_SQUARED) # Example of slope while intercept

    3.2 Concept: Polynomial kernel

    1. Parameters from polynomial kernel: gamma * (u.v) + coef0) ^ degree 1.1) degree = the degree of the polynomial 1.2) gamma and coef0 = tuning parameters 1.3) u , v = vectors (datapoints) belonging to the dataset
    2. Kernel functions = functions that satisfy specific properties of transforming SVMs

    3.3 Apply quadratic(polynomial) kernal on radially separable data

    1. Split train/test dat and build a model
    • SVM_MODEL <- svm(Y ~., data = TRAIN_DATA, type = "C-classification", kernel = "polynomial", # Specify type as polynomial degree = INT) # Set the degree of the polynomial
    1. Evaluate the model
    • PRED_TEST <- predict(SVM_MODEL, TEST_DATA)
    • mean(pred_test == testset$y)
    1. Plot the model
    • plot(SVM_MODEL, TRAIN_DATA)

    4. Tuning polynomial SVMs

    4.1 Objectives of tuning

    Hard to find optimal values of parameters manually for complex kernels

    4.2 Concept: how it works

    1. Set a range of search values for each parameter
    2. For each possible combination, build an SVM model
    3. Evaluate the model accuracy
    4. Return the parameter with best accuracy

    4.3 Steps and commands

    1. Set a range of search values for each parameter
    • TUNE_OUT <- tune.svm(x = TRAIN_DATA[,-3], y = TEST_DATA[,3], type = "C-classification", kernel = "polynomial", degree = 2, cost = 10^(-1:2), # Set range gamma = c(0.1,1,10), # Set range coef0 = c(0.1,1,10)) # Set range
    • TUNE_OUTcost
    • TUNE_OUTgamma
    • TUNE_OUTcoef0
    1. Build an SVM model for each combination and evaluate its accuracy in train and test datasets
    • SVM_MODEL <- svm(Y ~., data = TRAIN_DATA, type = "C-classification", kernel = "polynomial", degree = 2, cost = TUNE_OUTcost, gamma = TUNE_OUTgamma, coef0 = TUNE_OUTcoef0)
    1. Evaluate the model accuracy for train and test datasets
    • PRED_TRAIN <- predict(SVM_MODEL, TRAIN_DATA)
    • mean(PRED_TRAIN == TRAIN_DATA$Y)
    • PRED_TEST <- predict(SVM_MODEL, TEST_DATA)
    • mean(PRED_TEST == TEST_DATA$Y)
    1. Return the parameter with best accuracy

    Chapter 4. Support Vector Classifiers - Radial Basis Function (RBF) Kernels

    1. Generating a complex dataset

    1.1 Concept: RBF kernel

    1. Highly flexible kernal - can fix complex decision boundaries
    2. Commonly used in practice

    1.2 Steps of generating a complex dataset

    1. Generate a complex dataset with points and different distributions
    • N <- INT # Set number of data points
    • set.seed(INT) # Set seed for reproducibility
    • DF <- data.frame(X1 = rnorm(n, mean = -0.5, sd = 1), X2 = runif(n, min = -1, max = 1))
    1. Generate data points with boundary (two equi-radial circles with a single point in common) 2.1) Set radius and centers
    • RADIUS <- 0.7
    • RADIUS_SQUARED <- RADIUS ^ 2
    • CENTER_1 <- c(-0.7, 0)
    • CENTER_2 <- c(0.7, 0) 2.2) Classify points
    • DFX1 - CENTER_1[1]) ^ 2 + (DFX1 - CENTER_2[1]) ^ 2 + (DF$X2 - CENTER_2[2]) ^ 2 < RADIUS_SQUARED, -1, 1), levels = c(-1, 1))
    1. Visualize the dataset
    • PLOT <- ggplot(data = DF, aes(x = X1, y = X2, color = Y)) + geom_point() + guides(color = "none") + scale_color_manual(values = c("red","blue"))
    1. Generate boundary around the data points 4.1) Generate points on a circle using function
    • CIRCLE <- function(x1_center, x2_center, r, npoint = 100) { THETA <- seq(0, 2 * pi, length.out = npoint) X1_CIRC <- X1_CENTER + r * cos(THETA) X2_CIRC <- X2_CENTER + r * sin(THETA) data.frame(X1C = X1_CIR, X2C = X2_CIRC) } 4.2) Generate the boundaries using the points in 4.1)
    • BOUNDARY_1 <- circle(X1_CENTER = CENTER_1[1], X2_CENTER = CENTER_1[2], r = RADIUS)
    • PLOT <- PLOT + geom_path(data = BOUNDARY_1, aes(x = X1C, y = X2C), inherit.aes = FALSE)
    • BOUNDARY_2 <- circle(X1_CENTER = CENTER_2[1], X2_CENTER = CENTER_2[2], r = RADIUS)
    • PLOT <- PLOT + geom_path(data = BOUNDARY_2, aes(x = X1C, y = X2C), inherit.aes = FALSE)

    2. Applying linear and polynomial kernels to the complex dataset

    Quadratic kernel

    2.1 Apply the quadratic kernel with default parameters

    1. Split train/test dat and build a model
    • SVM_MODEL <- svm(Y ~., data = TRAIN_DATA, type = "C-classification", kernel = "polynomial", # Specify type as polynomial degree = INT) # Set the degree of the polynomial
    1. Evaluate the model
    • PRED_TEST <- predict(SVM_MODEL, TEST_DATA)
    • mean(PRED_TEST == TEST_DATA$Y)
    1. Plot the model
    • plot(SVM_MODEL, TRAIN_DATA)

    2.2 Apply higher degree -> polynomial kernal doesn't fit much

    • SVM_MODEL <- svm(Y ~., data = TRAIN_DATA, type = "C-classification", kernel = "polynomial", degree = INT) # Set a higher degree of the polynomial degree = 4)

    2.3 Apply another approach

    1. Concept - Heuristic (points close to each another have the same class or K-Nearest Neighbors algorithm)
    2. How heuristic works - For a given point in the data set, e.g. X1 = (a,b): 2.1) The kernel should have a maximum at (a,b) and should decay as one moves away from (a,b) 2.2) In absence of any information, the rate of decay (gamma) should be the same in all directions 2.3) The rate of decay should be tunable.
    3. Function for heuristic - exp(-gamma * r), where r is the distance btw X1 and any other point X [RBF concept]
    • RBF <- function(r, gamma) exp(-gamma * r)
    • ggplot(data.frame(r = c(-0, 10)), aes(r)) + stat_function(fun = RBF, args = list(gamma = 0.2), aes(color = "0.2")) + stat_function(fun = RBF, args = list(gamma = 0.4), aes(color = "0.4")) + stat_function(fun = RBF, args = list(gamma = 0.6), aes(color = "0.6")) + stat_function(fun = RBF, args = list(gamma = 0.8), aes(color = "0.8")) + stat_function(fun = RBF, args = list(gamma = 1), aes(color = "1")) + stat_function(fun = RBF, args = list(gamma = 2), aes(color = "2")) + scale_color_manual("gamma", values = c("red","orange","yellow","green","blue","violet")) + ggtitle("Radial basis function (gamma = 0.2 to 2)")

    3. Applying Radial Basis Function (RBF) kernels to the complex dataset

    3.1 Concept

    1. Decreasing function of distance between two points in dataset
    2. Simulates k-NN algorith (closer point, more similar)

    3.2 Steps of building an SVM using the RBF kernel and commands

    1. Build RBF kernel SVM for complex dataset and evaluate it 1.1) Build an SVM model with RBF kernel
    • SVM_MODEL <- svm(Y ~., data = TRAIN_DATA, type = "C-classification", kernel = "radial") # Specify "radial" 1.2) Evaluate the model for both train and test datasets
    • PRED_TRAIN <- predict(SVM_MODEL, TRAIN_DATA)
    • mean(PRED_TRAIN == TRAIN_DATA$Y)
    • PRED_TEST <- predict(SVM_MODEL, TEST_DATA)
    • mean(PRED_TEST == TEST_DATA$Y) 1.3) Plot the model
    • plot(SVM_MODEL, TRAIN_DATA)
    1. Refind the decision boundary 2.1) Tune gamme and cost using tune.svm()
    • TUNE_OUT <- tune.svm(x = TRAIN_DATA[,-3], y = TRAIN_DATA[,3], gamma = 5 * 10 ^ (-2:2), cost = c(0.01, 0.1, 1, 10, 100) type = "C-classification", kernel = "radial") 2.2) Print best parameters
    • TUNE_OUTcost
    • TUNE_OUTgamma
    1. Build the tuned model, evaluate it using test dataset, and plot it using train dataset
    • SVM_MODEL <- svm(Y ~., data = TRAIN_DATA, type = "C-classification", kernel = "radial" cost = TUNE_OUTcost gamma = TUNE_OUTgamma )
    • PRED_TEST <- predict(SVM_MODEL, TEST_DATA)
    • mean(PRED_TEST == TEST_DATA$Y)
    • plot(SVM_MODEL, TRAIN_DATA)

    Exercise

    1.Visualizing a sugar content dataset

    #Load ggplot2
    • library(ggplot2) #Print variable names
    • colnames(df) #Plot sugar content along the x-axis
    • plot_df <- ggplot(data = df, aes(x = sugar_content, y = 0)) + geom_point() + geom_text(aes(label = sugar_content), size = 2.5, vjust = 2, hjust = 0.5)
    • plot_df

    2. Find the maximal margin separator

    #The maximal margin separator is at the midpoint of the two extreme points in each cluster.
    • mm_separator <- (8.9 + 10)/2

    3. Visualize the maximal margin separator

    #create data frame containing the maximum margin separator
    • separator <- data.frame(sep = mm_separator) #add ggplot layer
    • plot_sep <- plot_ + geom_point(data = separator, aes(x = sep, y = 0), color = "blue", size = 4)
    • plot_sep

    4. Generate a 2d uniformly distributed dataset

    #set seed
    • set.seed(42) #set number of data points.
    • n <- 600 #Generate data frame with two uniformly distributed predictors lying between 0 and 1.
    • df <- data.frame(x1 = runif(n), x2 = runif(n))

    5. Create a decision boundary

    #classify data points depending on location
    • dfx2 - 1.4*df$x1 < 0, -1, 1), levels = c(-1, 1))

    6. Introduce a margin in the dataset

    #set margin
    • delta <- 0.07 #retain only those points that lie outside the margin
    • df1 <- df[abs(1.4*dfx2) > delta, ] #build plot
    • plot_margins <- ggplot(data = df1, aes(x = x1, y = x2, color = y)) + geom_point() + scale_color_manual(values = c("red", "blue")) + geom_abline(slope = 1.4, intercept = 0)+ geom_abline(slope = 1.4, intercept = delta, linetype = "dashed") + geom_abline(slope = 1.4, intercept = -delta, linetype = "dashed")
    • plot_margins

    7. Creating training and test datasets

    #Set the upper bound for the length of the training set
    • sample_size <- floor(0.8 * nrow(df)) #Assign rows to training set randomly
    • train <- sample(seq_len(nrow(df)), size = sample_size) #Yield training and test sets
    • trainset <- df[train, ]
    • testset <- df[-train, ]

    8. Building a linear SVM classifier

    • library(e1071) #build svm model, setting required parameters
    • svm_model<- svm(y ~ ., data = trainset, type = "C-classification", kernel = "linear", scale = FALSE)

    9. Exploring the model and calculating accuracy

    #list components of model
    • names(svm_model) #list components of model
    • names(svm_model) #list values of the SV, index and rho
    • svm_model$SV
    • svm_model$index
    • svm_model$rho #compute training accuracy
    • pred_train <- predict(svm_model, trainset)
    • mean(pred_train == trainset$y) #compute test accuracy
    • pred_test <- predict(svm_model, testset)
    • mean(pred_test == testset$y)

    10. Visualizing support vectors using ggplot

    #load ggplot
    • library(ggplot2) #build scatter plot of training dataset
    • scatter_plot <- ggplot(data = trainset, aes(x = x1, y = x2, color = y)) + geom_point() + scale_color_manual(values = c("red", "blue")) #add plot layer marking out the support vectors
    • layered_plot <- scatter_plot + geom_point(data = trainset[svm_model$index, ], aes(x = x1, y = x2), color = "purple", size = 4, alpha = 0.5)
    • layered_plot

    11. Visualizing decision & margin bounds using ggplot2

    #calculate slope and intercept of decision boundary from weight vector and svm model
    • slope_1 <- -w[1]/w[2]
    • intercept_1 <- svm_model$rho/w[2] #build scatter plot of training dataset
    • scatter_plot <- ggplot(data = trainset, aes(x = x1, y = x2, color = y)) + geom_point() + scale_color_manual(values = c("red", "blue")) #add decision boundary
    • plot_decision <- scatter_plot + geom_abline(slope = slope_1, intercept = intercept_1) #add margin boundaries
    • plot_margins <- plot_decision + geom_abline(slope = slope_1, intercept = intercept_1 - 1/w[2], linetype = "dashed")+ geom_abline(slope = slope_1, intercept = intercept_1 + 1/w[2], linetype = "dashed")
    • plot_margins

    12. Visualizing decision & margin bounds using plot()

    #load required library
    • library(e1071) #build svm model
    • svm_model<- svm(y ~ ., data = trainset, type = "C-classification", kernel = "linear", scale = FALSE) #plot decision boundaries and support vectors for the training data
    • plot(x = svm_model, data = trainset)

    13. Tuning a linear SVM

    #build svm model, cost = 1
    • svm_model_1 <- svm(y ~ ., data = trainset, type = "C-classification", cost = 1, kernel = "linear", scale = FALSE)
    • svm_model_1 #build svm model, cost = 100
    • svm_model_100 <- svm(y ~ ., data = trainset, type = "C-classification", cost = 100, kernel = "linear", scale = FALSE)
    • svm_model_100

    14. Visualizing decision boundaries and margins

    [Cost = 1] #add decision boundary and margins for cost = 1 to training data scatter plot

    • train_plot_with_margins <- train_plot + geom_abline(slope = slope_1, intercept = intercept_1) + geom_abline(slope = slope_1, intercept = intercept_1-1/w_1[2], linetype = "dashed")+ geom_abline(slope = slope_1, intercept = intercept_1+1/w_1[2], linetype = "dashed")
    • train_plot_with_margins [Cost = 100] #add decision boundary and margins for cost = 100 to training data scatter plot
    • train_plot_with_margins <- train_plot_100 + geom_abline(slope = slope_100, intercept = intercept_100, color = "goldenrod") + geom_abline(slope = slope_100, intercept = intercept_100-1/w_100[2], linetype = "dashed", color = "goldenrod")+ geom_abline(slope = slope_100, intercept = intercept_100+1/w_100[2], linetype = "dashed", color = "goldenrod")
    • train_plot_with_margins

    15. A multiclass classification problem

    #load library and build svm model
    • library(e1071)
    • svm_model<- svm(y ~ ., data = trainset, type = "C-classification", kernel = "linear", scale = FALSE) #compute training accuracy
    • pred_train <- predict(svm_model, trainset)
    • mean(pred_train == trainset$y) #compute test accuracy
    • pred_test <- predict(svm_model, testset)
    • mean(pred_test == testset$y) #plot
    • plot(x = svm_model, trainset)

    16. Iris redux - a more robust accuracy

    • for (i in 1:100){ #assign 80% of the data to the training set sample_size <- floor(0.8 * nrow(iris)) train <- sample(seq_len(nrow(iris)), size = sample_size) trainset <- iris[train, ] testset <- iris[-train, ] #build model using training data svm_model <- svm(Species~ ., data = trainset, type = "C-classification", kernel = "linear") #calculate accuracy on test data pred_test <- predict(svm_model, testset) accuracy[i] <- mean(pred_test == testset$Species) }
    • mean(accuracy)
    • sd(accuracy)

    17. Generating a 2d radially separable dataset

    #set number of variables and seed
    • n <- 400
    • set.seed(1) #Generate data frame with two uniformly distributed predictors, x1 and x2
    • df <- data.frame(x1 = runif(n, min = -1, max = 1), x2 = runif(n, min = -1, max = 1)) #We want a circular boundary. Set boundary radius
    • radius <- 0.8
    • radius_squared <- radius^2 #create dependent categorical variable, y, with value -1 or 1 depending on whether point lies #within or outside the circle
    • dfx1 ^2 + df$x2 ^2 < radius_squared, -1, 1), levels = c(-1, 1))

    18.Visualizing the dataset

    #load ggplot
    • library(ggplot2) #build scatter plot, distinguish class by color
    • scatter_plot <- ggplot(data = df, aes(x = x1, y = x2, color = "___")) + geom_point() + scale_color_manual(values = c("red", "blue"))
    • scatter_plot

    19. Linear SVM for a radially separable dataset

    [Cost = 1] #default cost mode;

    • svm_model_1 <- svm(y ~ ., data = trainset, type = "C-classification", cost = 1, kernel = "linear") #training accuracy
    • pred_train <- predict(svm_model_1, trainset)
    • mean(pred_train == trainset$y) #test accuracy
    • pred_test <- predict(svm_model_1, testset)
    • mean(pred_test == testset$y) [Cost = 100] #cost = 100 model
    • svm_model_2 <- svm(y ~ ., data = trainset, type = "C-classification", cost = 100, kernel = "linear") #accuracy
    • pred_train <- predict(svm_model_2, trainset)
    • mean(pred_train == trainset$y) #test accuracy
    • pred_test <- predict(svm_model_2, testset)
    • mean(pred_test == testset$y)

    20. Average accuracy for linear SVM

    #Print average accuracy and standard deviation
    • accuracy <- rep(NA, 100)
    • set.seed(2) #Calculate accuracies for 100 training/test partitions
    • for (i in 1:100){ sample_size <- floor(0.8 * nrow(df)) train <- sample(seq_len(nrow(df)), size = sample_size) trainset <- df[train, ] testset <- df[-train, ] svm_model <- svm(y ~ ., data = trainset, type = "C-classification", kernel = "linear") pred_test <- predict(svm_model, testset) accuracy[i] <- mean(pred_test == testset$y) } #Print average accuracy and standard deviation
    • mean(accuracy)
    • sd(accuracy)

    21. Visualizing transformed radially separable data

    #transform data
    • df1 <- data.frame(x1sq = dfx2^2, y = df$y) #plot data points in the transformed space
    • plot_transformed <- ggplot(data = df1, aes(x = x1sq, y = x2sq, color = y)) + geom_point()+ guides(color = "none") + scale_color_manual(values = c("red", "blue")) #add decision boundary and visualize
    • plot_decision <- plot_transformed + geom_abline(slope = -1, intercept = 0.64)
    • plot_decision

    22.SVM with polynomial kernel

    • svm_model<- svm(y ~ ., data = trainset, type = "C-classification", kernel = "polynomial", degree = 2) #measure training and test accuracy
    • pred_train <- predict(svm_model, trainset)
    • mean(pred_train == trainset$y)
    • pred_test <- predict(svm_model, testset)
    • mean(pred_test == testset$y) #plot
    • plot(svm_model, trainset)

    23. Using tune.svm()

    #tune model
    • tune_out <- tune.svm(x = trainset[, -3], y = trainset[, 3], type = "C-classification", kernel = "polynomial", degree = 2, cost = 10^(-1:2), gamma = c(0.1, 1, 10),
      coef0 = c(0.1, 1, 10)) #list optimal values
    • tune_outcost
    • tune_outgamma
    • tune_outcoef0

    24. Building and visualizing the tuned model

    #Build tuned model
    • svm_model <- svm(y~ ., data = trainset, type = "C-classification", kernel = "polynomial", degree = 2, cost = tune_outcost, gamma = tune_outgamma, coef0 = tune_outcoef0) #Calculate training and test accuracies
    • pred_train <- predict(svm_model, trainset)
    • mean(pred_train == trainset$y)
    • pred_test <- predict(svm_model, testset)
    • mean(pred_test == testset$y) #plot model
    • plot(svm_model, trainset)

    25. Generating a complex dataset - part 1

    #number of data points
    • n <- 1000 #set seed
    • set.seed(1) #create dataframe
    • df <- data.frame(x1 = rnorm(n, mean = -0.5, sd = 1), x2 = runif(n, min = -1, max = 1))

    26. Generating a complex dataset - part 2

    #set radius and centers
    • radius <- 0.8
    • center_1 <- c(-0.8, 0)
    • center_2 <- c(0.8, 0)
    • radius_squared <- radius^2 #create binary classification variable
    • dfx1-center_1[1])^2 + (dfx1-center_2[1])^2 + (df$x2-center_2[2])^2 < radius_squared, -1, 1),levels = c(-1, 1))

    27. Visualizing the dataset

    #Load ggplot2
    • library(ggplot2) #Plot x2 vs. x1, colored by y
    • scatter_plot<- ggplot(data = df, aes(x = x1, y = x2, color = y)) + #Add a point layer geom_point() + scale_color_manual(values = c("red", "blue")) + #Specify equal coordinates coord_fixed()
    • scatter_plot

    28. Linear SVM for complex dataset

    #build model
    • svm_model<- svm(y ~ ., data = trainset, type = "C-classification", kernel = "linear") #accuracy
    • pred_train <- predict(svm_model, trainset)
    • mean(pred_train == trainset$y)
    • pred_test <- predict(svm_model, testset)
    • mean(pred_test == testset$y) #plot model against testset
    • plot(svm_model, testset)

    29. Quadratic SVM for complex dataset

    #build model
    • svm_model<- svm(y ~ ., data = trainset, type = "C-classification", kernel = "polynomial", degree = 2) #accuracy
    • pred_train <- predict(svm_model, trainset)
    • mean(pred_train == trainset$y)
    • pred_test <- predict(svm_model, testset)
    • mean(pred_test == testset$y) #plot model
    • plot(svm_model, trainset)

    30. Polynomial SVM on a complex dataset

    #create vector to store accuracies and set random number seed
    • accuracy <- rep(NA, 100)
    • set.seed(2) #calculate accuracies for 100 training/test partitions
    • for (i in 1:100){ sample_size <- floor(0.8 * nrow(df)) train <- sample(seq_len(nrow(df)), size = sample_size) trainset <- df[train, ] testset <- df[-train, ] svm_model<- svm(y ~ ., data = trainset, type = "C-classification", kernel = "polynomial", degree = 2) pred_test <- predict(svm_model, testset) accuracy[i] <- mean(pred_test == testset$y) } #print average accuracy and standard deviation
    • mean(accuracy)
    • sd(accuracy)

    31. RBF SVM on a complex dataset

    #create vector to store accuracies and set random number seed
    • accuracy <- rep(NA, 100)
    • set.seed(2) #calculate accuracies for 100 training/test partitions
    • for (i in 1:100){ sample_size <- floor(0.8 * nrow(df)) train <- sample(seq_len(nrow(df)), size = sample_size) trainset <- df[train, ] testset <- df[-train, ] svm_model<- svm(y ~ ., data = trainset, type = "C-classification", kernel = "radial") pred_test <- predict(svm_model, testset) accuracy[i] <- mean(pred_test == testset$y) } #print average accuracy and standard deviation
    • mean(accuracy)
    • sd(accuracy)

    32. Tuning an RBF kernel SVM

    #tune model
    • tune_out <- tune.svm(x = trainset[, -3], y = trainset[, 3], gamma = 5*10^(-2:2), cost = c(0.01, 0.1, 1, 10, 100), type = "C-classification", kernel = "radial") #build tuned model
    • svm_model <- svm(y~ ., data = trainset, type = "C-classification", kernel = "radial", cost = tune_outcost, gamma = tune_outgamma) #calculate test accuracy
    • pred_test <- predict(svm_model, testset)
    • mean(pred_test == testset$y) #Plot decision boundary against test data
    • plot(svm_model, testset)

    Course 7: Fundamentals of Bayesian Data Analysis in R

    Chapter 1. What is Bayesian Data Analysis?

    1. Introduction

    1.1 History background of Baysian

    Enigma machine (by Alan Turing) uses Baysian concept to decode Nazi's encrypted codes during WW2

    1.2 Bayesian interference

    1. Definition = A method for figuring out unobservable quantities given known facts that uses probability to describe the uncertainty over what the values of [the unknown quantities] could be.
    2. For enigma machine, [the unknown quantities] = configuration of three wheels, the three wheels were selected from a pool of eight and their positions - problem = don't know which wheel and what position it has
    3. Bayesian interference was applied backwards (from the codes) to the wheel setting (which and position).

    2. Bayesian Data Analysis

    2.1 Basics

    1. Definition = The use of Bayesian inference to learn from data
    2. Background = Thomas Bayes wrote the first article, at first Bayesian inference = probablistic inference
    3. Characteristics 3.1) Can be used for hypothesis testing, linear regression, etc 3.2) Is flexible and allows you to construct problem-specific models
    4. Probability 4.1) A steatement about certainty (complete = 1) or uncertainty (complete = 0) 4.2) Not only about yes/no events 4.3) Role of Probability distribution vs Bayesian interfernce 4.3.1) Role of probability distribution = represent uncertainty 4.3.2) Role of Bayesian interfernce = update probability distributions to reflect what has been learned from data

    2.2 Simple Bayesian model [for proportion of success]

    1. Assumption of Bayesian model 1.1) Data is a vector for success and failure represented by 0 or 1 1.2) There is an unknown underlying proportion of success 1.3) If data point is a success is only affected by this proportion. 1.4) Prior to being updated with any data, any underlying proportion of success is equally likely 1.5) The result is a probability distribution that represents what the model knows about the underlying proportion of success.
    2. Steps of building a simple Bayesian model and their commands 2.1) Specify a vector for success and failure represented by 0 or 1
    • DATA <- c()
      2.2) Build a simple Bayesian model
    • prop_model(DATA) # Return a graph of proportion of success (more data will update the probabily distribution)

    3. Samples and posterior summaries

    3.1 Prior and Posteriors

    1. Prior = a probability distribution that represents what the model knows before seeing the data (n=0)
    2. Posterior = a probability distribution that represents what the model knows after having seen the data (n= X)

    Chapter 2. How does Bayesian inference work?

    1. Components of Bayesian inference

    1.1 Basic components of inference

    1. Generative model
    2. Priors = what the model knows b/f seeing the data
    3. Data Note: 1) + 2) = Bayesian model

    1.2 Generative model

    1. Definition = computer program or mathematic expression or set of rules that you can use to generate data from the parameters
    2. How to build a basic generative model 2.1) Set parameters
    • PROP_SUCCESS <- 0.15 # 15% of curing success
    • N_ZOMBIES <- 13 2.2) Simulate data by sampling prob (0-1) if < 15% = success
    • DATA <- c()
    • for(zombie in 1:n_zombies) { DATA[zombie] <- runif(1, min = 0, max = 1) < prop_success }
    • DATA <- as.numeric(DATA) # Turn TRUE/FALSE into 1/0

    2. Component 1: Generative model

    2.1 Concept of Generative model

    1. Work as the same as binomial distribution function (success/fail)
    2. Other distributions (normal, poission) can be also seen as small generative models

    2.2 Steps of Binomial generative model

    1. Parameters: prob, size
    2. Generative model: rbinom to generate data
    • rbinom ( # Returns vector for probs of how many success each time n = INT, # number of times you want to run the generative model size = INT, # number of trials e.g. number of zombies prob = X) # underlying proportion of success between 0 and 1
    1. Data (vector of prob)

    2.3 Application of generative model

    Concept = We know data, but want to know parameter(s) by using Bayesian inference

    3. Component 2: Priors

    3.1 Definition and concept

    1. Prior = represents uncertainty b/f seeing the data
    2. Concept 2.1) We know approxiates of parameters and difference in parameters 2.2) So, we want to include uncertainty in binomial model 2.3) Assume uncertainty in form of probability (0, x) -> uniform distribution

    3.2 Steps of Binomial generative model (2)

    1. Assign a large number of samples using uniform distribution: runif (=random uniform) returns probs
    • PROP_UPDATE <- runif(n = N_SAMPLES, min = 0.0, max = NUM) # Specify uncertainty to create uniform distribution
    1. Insert probs into rbinom (combine variability btw rbinom and runif)
    • rbinom ( # Returns vector for probs of how many success each time n = INT,
      size = INT, prob = PROP_UPDATE)

    3. Bayesian models and conditioning (Component 3: Data and Bayesian inference)

    3.1 Build a Bayesian model (Prior + generative model)

    • N_SAMPLES <- 100000
    • N_TIMES <- 100
    • PROP_UPDATE <- runif(N_SAMPLES, min = 0.0, max = NUM)
    • DATA <- rbinom(n = N_SAMPLES, size = N_TIMES, prob = PROP_UPDATE)
    • PRIOR <- data.frame(proportion_clicks, n_visitors)

    3.2 Condition the Bayesian model (Bayesian inference)

    1. Concept if we know an approxiate of parameter, we can remove all samples not fulfilling the approxiate X % (e.g. % of click on website) to reduce uncertainty and so on
    2. Problem
    3. Don't know the underlying X%
    4. Can't condition data (component 3)
    5. Essence of Bayesian inference Definition = Conditioning on data in order to learn about parameter values
    6. Example of Bayesian model updates 4.1) Define prior
    • prior <- data.frame(proportion_clicks, n_visitors) 4.2) Define certainty (we have data that 13 people clicked and visited our site)
    • posterior <- prior[prior$n_visitors == 13, ]
    • prior <- posterior #Replace prior$n_visitors with a new sample and visualize the result
    • n_samples <- nrow(prior)
    • n_ads_shown <- 100
    • priorproportion_clicks)
    • hist(prior$n_visitors) 4.3) See updated posteior (prob of 5 or more people visiting the site)
    • (priorn_visitors)

    Chapter 3. Why use Bayesian Data Analysis?

    1. Benefits of Bayesian data analysis

    Flexible

    1. Can include information sources in addition to the data (Note: the data can utilize pre-defined distribution e.g. beta distribution by defining alpha and beta) -> See ### Beta distribution 1.1) Background information 1.2) Expert opinion 1.3) Common knowledge
    2. Can compare between groups or between data sets -> See ## Comparisons between groups and datasets
    3. Can use results to inform business decisions -> See ## Decision analysis
    4. Can change the underlying statistical model -> ## 4. Change underlying statistical model
    5. Is Optimal (probably, no guaranteed) -> ## 5. Bayes is optimal, kind of...

    Beta distribution

    1. Basic of beta distribution
    • beta_sample <- rbeta(n = INT, # Specify number of draws shape1 = POS_INT, # The larger, the closer to 1.0 shape2 = POS_INT) # The larger, the closer to 0.0 Note: If shape 1 or shape 2 <0, Nan Note2: Beta(1,1) and uniform distribution (0,1) are both non-informative prior as they express any value from 0 to 1 equally likely
    1. Translation of data into beta distribution Example of data: Most ads get clicked on 5% of the time, but for some ads it is as low as 2% and for others as high as 8%.
    • rbeta(shape1 = 5, shape2 = 95 -> returns distribution according to the example
    1. Model update using beta distribution
    • n_draws <- 100000
    • n_ads_shown <- 100
    • proportion_clicks <- rbeta(n_draws, shape1 = 5, shape2 = 95) # Replace uniform distribution with beta distribution
    • n_visitors <- rbinom(n_draws, size = n_ads_shown, prob = proportion_clicks)
    • prior <- data.frame(proportion_clicks, n_visitors)
    • posterior <- prior[prior$n_visitors == 13, ]

    2. Comparisons between groups and datasets

    2.1 Contrasts

    1. Old prior -> data -> old posterior
    2. Informed prior -> data -> informed posterior Note: which one to choose depends on how much you believe in data.

    2.2 Comparisons (between groups or datasets)

    Example: compare between video or text ads

    1. Set informed posterior
    • POSTERIOR <- data.frame(video_prop = posterior_videoproportion_click[1:4000])
    1. Calculate the posterior difference: video_prop - text_prop
    • posteriorvideo_prop - posterior$text_prop
    • hist(posterior$prop_diff) # Visualize prop_diff
    • median(posterior$prop_diff) #Calculate the median of prop_diff
    • sum(posteriorprop_diff) # Calculate the proportion (prob video better than text)

    3. Decision analysis

    3.1 Concept

    Summary statistics of posterior not equal to that of data Summary statistics of posterior = summary of some parameteris Summary statistics of data = summary of dataset

    3.2 Steps of small decision analysis

    1. Set parameter of interest -> e.g. profit and cost (comparing between video and text ads)
    • visitor_spend <- 2.53
    • video_cost <- 0.25
    • text_cost <- 0.05
    1. Calculate the parameters
    • posteriorvideo_prop * visitor_spend - video_cost
    • posteriortext_prop * visitor_spend - text_cost
    1. Calculate difference between groups
    • posteriorvideo_profit - posterior$text_profit
    • median(posterior$profit_diff)
    1. Make a decision based on the probabalistic difference (e.g. text better than video -> < 0; if > 0 means video better than text)
    • sum(posteriorprofit_diff)

    4. Change underlying statistical model

    4.1 Alternative sceneario: completely switch out the binomial model -> Poisson model

    1. Have new data e.g. trial showed 19 click per day
    2. Change in practice e.g. don't pay per view but pay per day

    4.2 Poisson model

    1. The Poisson distribution: two parameters 1.1) lambda = mean number of events per time unit 1.2) n = number of times to simulate
    2. Steps of applying poisson model 2.1) Replace proportion with means and change the parameters
    • n_draws <- 100000
    • n_ads_shown <- 100
    • mean_clicks <- runif(n_draws, min = 0.0, max = 80) # Specify mean 2.2) Change the model from binomial to poisson
    • n_visitors <- rpois(n_draws, lambda = mean_clicks) # Specify lambda
    • prior <- data.frame(mean_clicks, n_visitors)
    • posterior <- prior[prior$n_visitors == 19, ] # Update data if needed

    5. Bayes is optimal, kind of...

    Bayes is optimal in the small world of the model

    1. If uniform distribution assumption is wrong, the model is always wrong no matter how much data (e.g. 20% click proportion, what if it is not)
    2. Bayesian data analyis: there is a separation btw model and computation.

    Chapter 4. Bayesian inference with Bayes' theorem

    1. Probability rule

    1.1 Benefits and drawbacks about basic inference (curing out impossible prob)

    1. Benefits 1.1) Bayesian computation is a hot research topic. 1.2) There are many methods to fit Bayesian models more efficiently. 1.3) The result will be the same, you'll just get it faster
    2. Drawbacks The computation method we've used scales horribly.

    1.2 Probability theory

    1. Probability
    2. Mathematical notation 2.1) Probability e.g. P(n_visitors = 13) 2.2) Probability distribition covering all possible numbers e.g. P(n_visitors) 2.3) Conditional probability, meaning "if" e.g. P(n_visitors = 13 |prop_clicks = 10%) 2.4) Conditional probability distribution e.g. P(n_visitors | prop_clicks = 10%), same prob distribution as simulated before

    1.3 Manipulating probability

    1. The sum rule: if two possible outcomes are mutually exclusive, can sum them. E.g. p(1 or 2 or 3) = 1/6 + 1/6 + 1/6
    2. The product rule: if two possible outcomes are independent, can multiply them. E.g p(6 and 6) = 1/6 * 1/6

    2. Calculating likelihoods

    2.1 Simulation (simulate samples -> calculate prob)

    Example: P(n_visitors = 13 | prob_success = 10%)

    1. Simulate
    • n_visitors <- rbinom(n = 100000, size = 100, prob = 0.1)
    1. Calculate
    • sum(n_visitors == 13) / length(n_visitors)

    2.2 Calculation (in R, you can directly calculate prob, not requiring simulation)

    Example: P(n_visitors = 13 | prob_success = 10%)

    • dbinom(13, size = 100, prob = 0.1)

    2.3 Applications of the calculation

    1. The sum rule Example: P(n_visitors = 13 | prob_success = 10%)
    • dbinom(13, size = 100, prob = 0.1) + dbinom(14, size = 100, prob = 0.1)
    1. Getting whole probability distribution
    • n_visitors = seq(0, 100, by = 1)
    • probability <- dbinom(n_visitors, size = 100, prob = 0.1)
    1. Getting uniform continuous distribution
    • x = seq(0, 0.2, by=0.01)
    • dunif(x, min = 0.0, max = 0.2) # Returns probability density (not actual prob as prob must be c(0,1); d stands for density)
    1. Maximum likelihood estimation 4.1) Vary n_visitors and fix prob (want to know max visitor)
    • n_ads_shown <- 100
    • proportion_clicks <- 0.1
    • n_visitors <- seq(0, 100, by = 1)
    • prob <- dbinom(n_visitors, size = n_ads_shown, prob = proportion_clicks) #Plot the distribution
    • plot(x = n_visitors, y = prob, type ="h") 4.2) Vary prob and fix n_visitors (want to know proportion_click)
    • n_ads_shown <- 100
    • proportion_clicks <- seq(0, 1, by = 0.01)
    • n_visitors <- 13
    • prob <- dbinom(n_visitors, size = n_ads_shown, prob = proportion_clicks) prob
    • plot(proportion_clicks, prob, type = "h")

    3. Bayesian calculation

    Bayesian inference by calculation

    1. Set parameters and combinations of the unknown
    • n_ads_shown <- 100 # Set number of ads to shown (we know)
    • n_visitors <- seq(0, 100, by = 1) # Set the Unknown
    • proportion_clicks <- seq(0, 1, by = 0.01) # Set the Unknown
    • PARAMETERS <- expand.grid(proportion_clicks = proportion_clicks, n_visitors = n_visitors) # List of all combinations
    1. Sample prior
    • PARAMETERSproportion_clicks, min = 0, max = 0.2)
    1. Calculate the likelihood of the data
    • PARAMETERSn_visitors, size = n_ads_shown, prob = PARAMETERS$proportion_clicks)
    1. Calculate probability of clicks
    • PARAMETERSLIKELIHOOD * PARAMETERS$PRIOR
    • PARAMETERSPROBABILITY / sum(PARAMETERS$PROBABILITY) # Normalize
    1. Condition parameter we want to know in the model (e.g. visitor = 13) and retrieve calculated posterior 5.1) Filter method
    • PARAMETERS <- PARAMETERS[PARAMETERS$n_visitor == 13, ]
    • PARAMETERSPROBABILITY / sum(PARAMETERS$PROBABILITY)
    • plot(PARAMETERSPROBABILITY, type = "h") 5.2) Shortcut method
    • n_visitors <- 13 # Replace with a number of interest

    4. Bayes' theorem

    4.1 Bayes' theorem

    P(PARAMETER|DATA) = [P(DATA|PARAMETER) x P(PARAMETER)] / sum([P(DATA|PARAMETER) x P(PARAMETER)]) Note:

    1. P(PARAMETER|DATA) = [POSTERIOR] the prob of different parameter values given some data (what we want)
    2. P(DATA|PARAMETER) = [LIKELIHOOD] OR the relative prob of the data given different parameter values
    3. P(PARAMETER) = [PRIOR] OR the prob of different parameters b/f seeing the data
    4. PARAMETER = a parameter combination

    4.2 Grid approximation

    1. A method to define a grid over all the parameter combinations you need to evaluate
    2. Impossible to try all combinations

    4.3 Mathematical notion of the generative model used so far

    1. n_ads_shown = 100
    2. proportion_clicks = unif(0.0,0.2)
    3. n_visitors = binom(n_ads_shown, proportion_clicks)

    Chapter 5. More parameters, more data, and more Bayes

    1. Normal distribution

    1.1 rnorm() and dnorm()

    • rnorm(n = INT, mean = NUM, SD = NUM) # Returns sample of data
    • LIKELIHOOD <- dnorm(n = INT, mean = NUM, SD = NUM) # Returns a vector of likelihoods of the sampled data
    • prod(LIKELIHOOD) # Calculate likelihood of all members in the vector
    • log(LIKELIHOOD) # Transform likelihood into log as the number is too small.

    2. Build a Bayesian model

    2.1 Define the model by starting with PRIOR

    1. MU = normal(mean = 18, sd = 5)
    2. SIGMA = uniform(min = 0, max = 10)
    3. VALUE_INTEREST = normal(MU, SIGMA)
    4. DATA = c(X1,X2,X3,...) # Data should be "what we know"

    2.2 Fit the model with normal distribution (replace codes from the last chapter)

    1. Set parameters and combinations of the unknown
    • DATA <- c(X1,X2,X3,...) # Set the number we know
    • MU = <- seq(8, 30, by = 0.5) # Set the Unknown
    • SIGMA <- seq(0.1, 100=, by = 0.3) # Set the Unknown
    • PARAMETERS <- expand.grid(MU = MU, SIGMA = SIGMA) # List of all combinations (parameter space)
    1. Sample prior
    • PARAMETERSMU, mean = 18, sd = 5) # Change function to dnorm and arguments
    • PARAMETERSSIGMA, min = 0, max = 10)
    • PARAMETERSMU_PRIOR * PARAMETERS$SIGMA_PRIOR
    1. Calculate the likelihoods of the data with for loop
    • for (i in 1:nrow(PARAMETERS)) { LIKELIHOODS <- dnorm(DATA, PARAMETERSSIGMA[i]) PARAMETERS$LIKELIHOOD[i] <- prod(LIKELIHOODS) }
    1. Calculate probability (posterior)
    • PARAMETERSLIKELIHOOD * PARAMETERS$PRIOR
    • PARAMETERSPROBABILITY / sum(PARAMETERS$PROBABILITY) # Normalize
    1. Condition parameter we want to know in the model and retrieve calculated posterior 5.1 Translate what we want to know into questions (e.g. temperature to decide whether should have a beach party) 5.1.1) Q1: What's likely the average water temperature on 20th of Julys? 5.1.2) Q2: What's the prob that wather is going to be 18 or more on the next 20th (threshold value of interest = 18 or more) 5.2 Sample posteriors with a large number (make it easier)
    • SAMPLE_INDICE <- sample(1:nrow(PARAMETERS), size = INT, # Set sample size large enough replace = TRUE, prob = PARAMETERS$PROBABILITY) 5.3 Select MU and PARS
    • PARARAMTERS_SAMPLE <- PARAMETERS[SAMPLE_INDICE, c("MU", "SIGMA")] 5.4 Overview the sample selected
    • hist(PARARAMTERS_SAMPLE$MU, 30)
    • quantile(PARARAMTERS_SAMPLE$MU, c(0.05,0.95)) # Calc. 95% CI 5.5 Predict the posteriors and conditions according to 5.1
    • PRED_VALUE_INTEREST <- rnorm(INT, # Set values to simulate mean = PARARAMTERS_SAMPLESIGMA)
    • SUM(PRED_VALUE_INTEREST >= 18 / length(PRED_VALUE_INTEREST)) # Return prob

    3. Practical tool for building Bayesian models (BEST)

    3.1 Background and concept

    1. BEST (Bayesian Estimation Supersedes the t-test) = a Bayesian model developed by John Kruschke
    2. Assumption = use of t-distribution (like normal dist with an extra paramter, degree of freedom) IF df > 30, t-distribution takes shape like normal distribution; if df less, it has more outliers. (adding an outlier into the model affects means)
    3. Application: 3.1) Estimates MEAN, SD, DEGREE OF FREEDOM parameters 3.2) Use Markov chain Monte Carlo (MCMC) to fit the model (returns a table of samples for posterior) 3.3) Compare estimated parameters btw two groups

    3.2 Steps and codes

    • library(BEST)
    • DATA_1 <- c(X1,X2,...) # Specify group 1 data
    • DATA_2 <- c(X1,X2,...) # Specify group 2 data
    • FIT <- BESTmcmc(DATA_1,DATA_2) # Returns mean (mu), df (nu) and sd (sigma)
    • plot(FIT)

    4. Summary

    Bayesian Inference

    1. Data
    2. Bayesian Model 2.1) Priors: the prior prob over unknown parameters 2.2) Generative model 2.2.1) rbinom: binomial distribution, poisson distribution 2.2.2) rnorm: normal distribution, t-distribution
    3. Computational method 3.1) Rejection sampling: sampling from the generative model -> filter samples that don't match 3.2) Grid approximation: directly use Bayesian theorem by calculating probabilities over grid of parameter combinations 3.3) Markov chain Monte Carlo (MCMC)

    What were not covered?

    1. Other types of advanced machine learning: deep learning, time series
    2. How to decide what priors and models to use
    3. How Bayesian statistics relate to classical statistics
    4. More advanced computational methods (besides MCMC)

    Exercise (Course 7: Fundamentals of Bayesian Data Analysis in R)

    1. Coin flips with prop_model

    • data <- c(1, 0, 0, 1)
    • prop_model(data)

    2. Zombie drugs with prop_model

    #Update the data and rerun prop_model (2 cured zombies out of 13)
    • data = c(1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0)
    • prop_model(data)

    3. Looking at samples from prop_model

    • data = c(1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0) #Extract and explore the posterior
    • posterior <- prop_model(data)
    • print(head(posterior)) #Extract and explore the posterior
    • posterior <- prop_model(data)
    • head(posterior) #Plot the histogram of the posterior
    • hist(posterior) data = c(1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0) #Extract and explore the posterior
    • posterior <- prop_model(data)
    • head(posterior) #Edit the histogram
    • hist(posterior, breaks = 30, xlim = c(0,1), col = "palegreen4")

    4. Summarizing the zombie drug experiment

    • data = c(1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0)
    • posterior <- prop_model(data)
    • hist(posterior, breaks = 30, xlim = c(0, 1), col = "palegreen4") #Calculate the median
    • median(posterior) #Calculate the credible interval
    • quantile(posterior, c(0.05,0.95)) #Calculate the sum (posterior > 7%)
    • sum(posterior > 0.07) #Calculate the probability
    • sum(posterior >0.07)/length(posterior)

    5. Take a generative model for a spin

    #The generative zombie drug model #Set parameters
    • prop_success <- 0.42
    • n_zombies <- 100 #Simulating data
    • data <- c()
    • for(zombie in 1:n_zombies) { data[zombie] <- runif(1, min = 0, max = 1) < prop_success }
    • data <- as.numeric(data) #Count cured
    • data <- sum(data)

    6. Take the binomial distribution for a spin

    • rbinom(n = 1, size = 100, prob = 0.42)
    • rbinom(n = 200, size = 100, prob = 0.42)

    7. How many visitors could your site get (1)?

    #Fill in the parameters
    • n_samples <- 100000
    • n_ads_shown <- 100
    • proportion_clicks <- 0.1
    • n_visitors <- rbinom(n_samples, size = n_ads_shown, prob = proportion_clicks) #Visualize n_visitors
    • hist(n_visitors)

    8. Adding a prior to the model

    • n_samples <- 100000
    • n_ads_shown <- 100 #Update proportion_clicks
    • proportion_clicks <- runif(n = n_samples, min = 0.0, max = 0.2)
    • n_visitors <- rbinom(n = n_samples, size = n_ads_shown, prob = proportion_clicks) #Visualize proportion clicks
    • hist(proportion_clicks) # See uniform distribution #Visualize n_visitors
    • hist(n_visitors)

    9. Update a Bayesian model with data

    #Create the prior data frame
    • prior <- data.frame(proportion_clicks, n_visitors) #Examine the prior data frame
    • head(prior) #Create the posterior data frame
    • posterior <- prior[prior$n_visitors == 13, ] #Visualize posterior proportion clicks
    • hist(posterior$proportion_clicks)

    10. How many visitors could your site get (3)?

    #Assign posterior to a new variable called prior
    • prior <- posterior #Take a look at the first rows in prior
    • head(prior) #Replace prior$n_visitors with a new sample and visualize the result
    • n_samples <- nrow(prior)
    • n_ads_shown <- 100
    • priorproportion_clicks)
    • hist(prior$n_visitors) #Calculate the probability that you will get 5 or more visitors
    • (priorn_visitors)

    11. Explore using the Beta distribution as a prior

    #Explore using the rbeta function
    • beta_sample <- rbeta(n = 1000000, shape1 = 1, shape2 = 1) #Visualize the results
    • hist(beta_sample) #Modify the parameters
    • beta_sample <- rbeta(n = 1000000, shape1 = -1, shape2 = 1) #Explore the results
    • head(beta_sample) #Modify the parameters
    • beta_sample <- rbeta(n = 1000000, shape1 = 100, shape2 = 100) #Visualize the results
    • hist(beta_sample) #Modify the parameters
    • beta_sample <- rbeta(n = 1000000, shape1 = 100, shape2 = 20) #Visualize the results
    • hist(beta_sample)

    12. Change the model to use an informative prior

    • n_draws <- 100000
    • n_ads_shown <- 100 #Change the prior on proportion_clicks
    • proportion_clicks <- rbeta(n_draws, shape1 = 5, shape2 = 95)
    • n_visitors <- rbinom(n_draws, size = n_ads_shown, prob = proportion_clicks)
    • prior <- data.frame(proportion_clicks, n_visitors)
    • posterior <- prior[prior$n_visitors == 13, ] #This plots the prior and the posterior in the same plot
    • par(mfcol = c(2, 1))
    • hist(prior$proportion_clicks, xlim = c(0, 0.25))
    • hist(posterior$proportion_clicks, xlim = c(0, 0.25))

    13. Fit the model using another dataset

    #Define parameters
    • n_draws <- 100000
    • n_ads_shown <- 100
    • proportion_clicks <- runif(n_draws, min = 0.0, max = 0.2)
    • n_visitors <- rbinom(n = n_draws, size = n_ads_shown, prob = proportion_clicks)
    • prior <- data.frame(proportion_clicks, n_visitors) #Create the posteriors for video and text ads
    • posterior_video <- prior[prior$n_visitors == 13, ]
    • posterior_text <- prior[prior$n_visitors == 6, ] #Visualize the posteriors
    • hist(posterior_video$proportion_clicks, xlim = c(0, 0.25))
    • hist(posterior_text$proportion_clicks, xlim = c(0, 0.25))

    14. Calculating the posterior difference

    • posterior <- data.frame(video_prop = posterior_videoproportion_click[1:4000]) #Calculate the posterior difference: video_prop - text_prop
    • posteriorvideo_prop - posterior$text_prop #Visualize prop_diff
    • hist(posterior$prop_diff) #Calculate the median of prop_diff
    • median(posterior$prop_diff) #Calculate the proportion
    • sum(posteriorprop_diff)

    15. A small decision analysis 1

    • visitor_spend <- 2.53
    • video_cost <- 0.25
    • text_cost <- 0.05 #Add the column posterior$video_profit
    • posteriorvideo_prop * visitor_spend - video_cost #Add the column posterior$text_profit
    • posteriortext_prop * visitor_spend - text_cost #Visualize the video_profit and text_profit columns
    • hist(posterior$video_profit)
    • hist(posterior$text_profit)

    16. A small decision analysis 2

    #Add the column posterior$profit_diff
    • posteriorvideo_profit - posteriortext_profit #Visualize posteriorprofit_diff
    • hist(posterior$profit_diff) #Calculate a "best guess" for the difference in profits
    • median(posterior$profit_diff) #Calculate the probability that text ads are better than video ads
    • sum(posteriorprofit_diff)

    17. The Poisson distribution

    #Simulate from a Poisson distribution and visualize the result
    • x <- rpois(n = 10000, lambda = 3)
    • hist(x) #Simulate from a Poisson distribution and visualize the result
    • x <- rpois(n = 10000, lambda = 11.5)
    • hist(x) #Calculate the probability of break-even assuming you won't break even unless you sell 15 or more ice creams
    • sum(x >=15)/length(x)

    18. Clicks per day instead of clicks per ad

    #Replace proportion_clicks with mean_clicks and change the parameters
    • n_draws <- 100000
    • n_ads_shown <- 100
    • mean_clicks <- runif(n_draws, min = 0.0, max = 80)
    • n_visitors <- rbinom(n_draws, size = n_ads_shown, prob = proportion_clicks)
    • prior <- data.frame(proportion_clicks, n_visitors)
    • posterior <- prior[prior$n_visitors == 13, ] #Change this model so that it uses a Poisson distribution
    • n_draws <- 100000
    • n_ads_shown <- 100
    • mean_clicks <- runif(n_draws, min = 0, max = 80)
    • n_visitors <- rpois(n_draws, lambda = mean_clicks)
    • prior <- data.frame(mean_clicks, n_visitors)
    • posterior <- prior[prior$n_visitors == 19, ] #Visualize mean_clicks
    • hist(prior$mean_clicks)
    • hist(posterior$mean_clicks)

    19. Cards and the sum rule

    #Calculate the probability of drawing any of the four aces
    • prob_to_draw_ace <- 1/52 + 1/52 + 1/52 + 1/52

    20. Cards and the product rule

    #Calculate the probability of picking four aces in a row
    • prob_to_draw_four_aces <- 4/52 * 3/51 * 2/50 * 1/49

    21. From rbinom to dbinom

    #rbinom version
    • n_ads_shown <- 100
    • proportion_clicks <- 0.1
    • n_visitors <- rbinom(n = 99999, size = n_ads_shown, prob = proportion_clicks) prob_13_visitors <- sum(n_visitors == 13) / length(n_visitors) #dbinom version
    • n_ads_shown <- 100
    • proportion_clicks <- 0.1
    • prob_13_visitors <- dbinom(x = 13, size = n_ads_shown, prob = proportion_clicks)

    22. Calculating probabilities with dbinom

    #Change the code to calculate probability distribution of n_visitors
    • n_ads_shown <- 100
    • proportion_clicks <- 0.1
    • n_visitors <- seq(0, 100, by = 1)
    • prob <- dbinom(n_visitors, size = n_ads_shown, prob = proportion_clicks) #Plot the distribution
    • plot(x = n_visitors, y = prob, type ="h") #Flip the chart (fix n_visitor and vary proportion_clicks -> want to know proportion_click)
    • n_ads_shown <- 100
    • proportion_clicks <- seq(0, 1, by = 0.01)
    • n_visitors <- 13
    • prob <- dbinom(n_visitors, size = n_ads_shown, prob = proportion_clicks) prob
    • plot(proportion_clicks, prob, type = "h")

    23. Calculating a joint distribution

    • n_ads_shown <- 100
    • proportion_clicks <- seq(0, 1, by = 0.01)
    • n_visitors <- seq(0, 100, by = 1)
    • pars <- expand.grid(proportion_clicks = proportion_clicks, n_visitors = n_visitors)
    • parsproportion_clicks, min = 0, max = 0.2)
    • parsn_visitors, size = n_ads_shown, prob = parsproportion_clicks) #Add the column parsprobability and normalize it
    • parslikelihood * parslikelihood * pars$prior))

    24. Conditioning on the data (again)

    • n_ads_shown <- 100
    • proportion_clicks <- seq(0, 1, by = 0.01)
    • n_visitors <- seq(0, 100, by = 1)
    • pars <- expand.grid(proportion_clicks = proportion_clicks, n_visitors = n_visitors)
    • parsproportion_clicks, min = 0, max = 0.2)
    • parsn_visitors, size = n_ads_shown, prob = pars$proportion_clicks)
    • parslikelihood * pars$prior
    • parsprobability / sum(pars$probability) #Condition on the data
    • pars <- pars[pars$n_visitor == 6, ] #Normalize again
    • parsprobability / sum(parsprobability) #Plot the posterior parsprobability
    • plot(parsprobability, type = "h")

    25. A conditional shortcut

    #Simplify the code below by directly conditioning on the data
    • n_ads_shown <- 100
    • proportion_clicks <- seq(0, 1, by = 0.01)
    • n_visitors <- 6
    • pars <- expand.grid(proportion_clicks = proportion_clicks, n_visitors = n_visitors)
    • parsproportion_clicks, min = 0, max = 0.2)
    • parsn_visitors, size = n_ads_shown, prob = pars$proportion_clicks)
    • parslikelihood * pars$prior
    • parsprobability / sum(pars$probability)
    • pars <- pars[pars$n_visitors == 6, ]
    • parsprobability / sum(pars$probability)
    • plot(parsprobability, type = "h")

    26. rnorm, dnorm, and the weight of newborns

    #Assign mu and sigma
    • mu <- mean(c(3164, 3362, 4435, 3542, 3578, 4529))
    • sigma <- sd(c(3164, 3362, 4435, 3542, 3578, 4529))
    • weight_distr <- rnorm(n = 100000, mean = mu, sd = sigma)
    • hist(weight_distr, 60, xlim = c(0, 6000), col = "lightgreen") #Assign mu and sigma
    • mu <- 3500
    • sigma <- 600
    • weight_distr <- rnorm(n = 100000, mean = mu, sd = sigma)
    • hist(weight_distr, 60, xlim = c(0, 6000), col = "lightgreen") #Create weight
    • weight <- seq(0, 6000, by = 100) #Calculate likelihood
    • likelihood <- dnorm(weight, mu, sigma) #Plot the distribution of weight
    • plot(weight, likelihood, type = "h")

    27. A Bayesian model of Zombie IQ

    #The IQ of a bunch of zombies
    • iq <- c(55, 44, 34, 18, 51, 40, 40, 49, 48, 46) #Defining the parameter grid pars <- expand.grid(mu = seq(0, 150, length.out = 100), sigma = seq(0.1, 50, length.out = 100)) #Defining and calculating the prior density for each parameter combination
    • parsmu, mean = 100, sd = 100)
    • parssigma, min = 0.1, max = 50)
    • parsmu_prior * pars$sigma_prior #Calculating the likelihood for each parameter combination
    • for(i in 1:nrow(pars)) { likelihoods <- dnorm(iq, parssigma[i]) pars$likelihood[i] <- prod(likelihoods) } #Calculate the probability of each parameter combination
    • parslikelihood * parslikelihood * pars$prior)

    28. Sampling from the zombie posterior

    • head(pars)
    • sample_indices <- sample( nrow(pars), size = 10000, replace = TRUE, prob = pars$probability)
    • head(sample_indices) #Sample from pars to calculate some new measures
    • pars_sample <- pars[sample_indices, c("mu", "sigma")] #Visualize the mean IQ
    • hist(pars_sample$mu, 100, col = 'blue') #Calculate quantiles
    • quantile(pars_sample$mu, c(0.025, 0.5, 0.975))

    29. But how smart will the next zombie be?

    • head(pars_sample)
    • pred_iq <- rnorm(10000, mean = pars_samplesigma) #Visualize pred_iq
    • hist(pred_iq) #Calculate the probability of a zombie being "smart" (+60 IQ)
    • sum(pred_iq >= 60)/length(pred_iq)

    30. The BEST models and zombies on a diet

    #The IQ of zombies on a regular diet and a brain based diet.
    • iq_brains <- c(44, 52, 42, 66, 53, 42, 55, 57, 56, 51)
    • iq_regular <- c(55, 44, 34, 18, 51, 40, 40, 49, 48, 46) #Calculate the mean difference in IQ between the two groups
    • mean(iq_brains - iq_regular) #Fit the BEST model to the data from both groups
    • library(BEST)
    • best_posterior <- BESTmcmc(iq_brains, iq_regular) #Plot the model result
    • plot(best_posterior)

    31. BEST is robust

    #The IQ of zombies given a regular diet and a brain based diet.
    • iq_brains <- c(44, 52, 42, 66, 53, 42, 55, 57, 56, 51)
    • iq_regular <- c(55, 44, 34, 18, 51, 40, 40, 49, 48, 150) #Modify the data above and calculate the difference in means
    • mean(iq_brains - iq_regular) #The IQ of zombies given a regular diet and a brain based diet.
    • iq_brains <- c(44, 52, 42, 66, 53, 42, 55, 57, 56, 51)
    • iq_regular <- c(55, 44, 34, 18, 51, 40, 40, 49, 48, 150) # <- Mutant zombie #Modify the data above and calculate the difference in means
    • mean(iq_brains) - mean(iq_regular) #Fit the BEST model to the modified data and plot the result
    • library(BEST)
    • best_posterior <- BESTmcmc(iq_brains, iq_regular)
    • plot(best_posterior)

    Course 8: Hyperparameter Tuning in R

    Chapter 1. Introduction to hyperparameters

    1. Parameters vs hyperparameters

    1.1 Parameters (model parameters)

    1. Definition = model parameters are being fit DURING training, and they are the result of model fitting or training
    2. Examples in linear model: coefficients (found during fitting)
    3. Examples in machine learning model: weights, biases of neural nets that are optimized during training (model parameters)

    1.2 Hyperparameters

    1. Definition = parameters being set BEFORE training, and they specify HOW the training is supposed to happen. (most are optional)
    2. Examples in linear model: method (an option to set before fitting)
    3. Examples in machine learning model: learning rates, weight decay, number of trees in random forest
    4. Why need tuning? = Want to find the best combination of hyperparameters

    2. Recap of machine learning basics

    2.1 Splitting data (ML with caret)

    1. Load
    • library(caret)set.seed(42)
    1. Set partition index (proportion for training dataset)
    • INDEX <- createDataPartition(FULL_DF$RESP_VAR, p =.70, # Set partition e.g. 70:30 list=FALSE)
    1. Split data into training and testing dataset
    • TRAINING_DATA <- FULL_DF[INDEX,]
    • TESTING_DATA <- FULL_DF[-INDEX,] Note: training dataset should have enough power and test set should be representative

    2.2 Train ML with caret

    1. Set up cross-validation
    • library(carert)
    • library(tictoc) # Use for estimating how long it taks to run
    • fitControl <- trainControl(method ="repeatedcv", number =3, repeats =5) # Set cross validation (number of folds and number of repeats) Note: Valid resampling methods in caret
      • LGOCV (leave-group-out cross-validation)
      • boot (adabost is not a resampling scheme)
      • cv
    1. Train a random forest model with caret (automatic tuning)
    • tic()
    • set.seed(INT)
    • rf_model <- train(Y ~ X, data = TRAINING_DATA, method ="rf", trControl = fitControl, # Specify valiation
      verbose =FALSE)
    • toc()

    2.3 List of mode methods in caret

    • "gbm" = Stochastic Gradient Boosting model
    • "nnet" = Regulat neural network
    • "rf" = Random forest
    • "svmPoly" = Support Vector Machines

    3. Hyperparameter tuning

    3.1 Specific hyperparameters for model algorithm

    • modelLookup("gbm") # Specify model
    • https://topepo.github.io/caret/available-models.html

    3.2 Hyperparameter tuning in caret (e.g. gbm)

    1. Define hyperparameter grid (Cartesian grid)
    • HYPERPARAMETERS <- expand.grid(n.trees = 200, # Number of trees interaction.depth = 1, # Tree complexity shrinkage = 0.1, # Learning rate n.minobsinnode = 10) # Min when splitting
    1. Apply hyperparameter grid to train().
    • set.seed(INT)
    • gbm_model <- train(Y ~ X, data = TRAINING_DATA, method = "gbm", trControl = trainControl(method = "repeatedcv", number = 5, repeats = 3), verbose = FALSE, tuneGrid = HYPERPARAMETERS)

    3.3 Hyperparameter tuning in Support Vector Machines (SVM)

    1. tuneLength
    • tic()
    • set.seed(INT)
    • SVM <- train(Y ~ X,
      data = TRAINING_DATA,
      method ="svmPoly",
      trControl = fitControl,
      verbose =FALSE,
      tuneLength =5) # Specify maximum number of tuning parameter combinations
    • toc()
    1. tuneGrid and expand.grid (manual method)
    • HYPERPARAMETERS <- expand.grid(degree =4, scale =1, C =1) # Specify grid or hyperparameter(s) to tune
    • set.seed(INT)
    • SVM <- train(Y ~ X,
      data = TRAINING_DATA,
      method ="svmPoly",
      trControl = fitControl,
      verbose =FALSE, tuneGrid = HYPERPARAMETERS # Specify tuneGrid )
    • toc()

    Chapter 2. Hyperparameter tuning with caret

    1. Hyperparameter tuning in caret

    1.1 Method 1: Cartesian grid search with caret

    1. Concept: define a grid of possible tune parameters and run; cons = takes longer time to run
    2. Commands
    • MANUAL_GRID <- expand.grid(n.trees = c(100, 200, 250), interaction.depth = c(1, 4, 6), shrinkage = 0.1, n.minobsinnode = 10)

    1.2 Method 2: Plot hyperparameter models

    • plot(GBM_MODEL) # plot graph with each line representing each value of hyperparameter
    • plot(GBM_MODEL, metric = "Kappa",plotType = "level") # Use level-plot Note: kappa used to evaluate performane of classification model (comparing expected vs observed accuracy; higher kappa = better)

    2. Grid Search (Method 1) vs. Random Search (Method 3)

    2.1 Grid search (Cartesian grid)

    1. Concept: need to specify values; cons = get slow and computationally expensive (esp. big grid)
    2. Commands 3.1) Define a grid
    • BIG_GRID <- expand.grid(n.trees = seq(from = 10, to = 300, by = 50), interaction.depth = seq(from = 1, to = 10, length.out = 6), shrinkage = 0.1, n.minobsinnode = 10)
    • fitControl <- trainControl(method = "repeatedcv", number = 3, repeats = 5, search = "grid") # Specify argument search "grid" 3.2) Set "tuneGrid" argument
    • gbm_model <- train(Y ~ X, data = TRAINING_DATA, method = "gbm", trControl = fitControl verbose = FALSE, tuneGrid = BIG_GRID)

    2.2 Method 3: Random search

    1. Concept: random values; pros = faster than grid search
    2. Note: in caret, random search cannot be combined with grid search
    3. Commands 3.1) Define random search in trainControl function
    • library(caret)
    • fitControl <- trainControl(method = "repeatedcv", number = 3, repeats = 5, search = "random") # Specify argument search "random" 3.2) Set "tuneLength" argument
    • tic()
    • set.seed(INT)
    • GBM_MODEL <- train(Y ~X, data = TRAINING_DATA, method = "gbm", trControl = fitControl, verbose = FALSE, tuneLength = 5) # Set tuneLength or maximum number of tuning parameter combinations
    • toc()

    3. Method 4: Adaptive resampling

    3.1 Definition and concept

    1. Grid search (method 1) = all hyperparameters are computed.
    2. Random search (method 3) = random subsets of hyperparameters are computed.
    3. Adaptive resampling = hyperparameter combinations are resampled with values near combinations that performed well (suboptimal is not tested). Pros = faster and more efficient, but not better than methods 1 or 3

    3.2 Steps and commands

    1. Set adaptive resampling in trainControl()
    • fitControl <- trainControl(method = "adaptive_cv", adaptive = list(min = 2, # min no of resamples per hyperparam alpha = 0.05, # confidence level for removing hyperparam method = "gls", # "gls" = linear; "BT = Bradley Terry complete = TRUE), # TRUE = generate full resampling set search = "random")
    1. Apply trainControl() and set tuneLength argument
    • tic()
    • set.seed(INT)
    • GBM_MODEL <- train(Y ~ X, data = TRAINING_DATA, method = "gbm", trControl = fitControl, verbose = FALSE, tuneLength = 7) # Set tuneLength or maximum number of tuning parameter combinations (in reality, maybe 100)(default = 3)
    • toc()

    Chapter 3. Hyperparameter tuning with mlr

    1. Machine learning with mlr

    1.1 Basics

    1. mlr = another framework for ML in R
    2. Model training follows 3 steps 2.1) Define the task 2.2) Define the learner 2.3) Fit the model

    1.2 Tasks in mlr for supervised learning

    1. RegTask() : for regression
    2. ClassifTask(): for binary and multi-class classification
    3. MultilabelTask(): for multi-label classification problems
    4. CostSensTask(): for general cost-sensitive classification

    1.3 Learners in mlr

    • listLearners() # Return available learners: classif. = classification learner, regr. = regression learner, multilabel. = multi-label classification

    1.3 Steps and commands

    1. Define the task
    • TASK <- makeClassifTask(data = DF, target = "RESP_VAR")
    1. Define the learner
    • LRN <- makeLearner("classif.h2o.deeplearning", # Enter a learner fix.factors.prediction = TRUE, # Handle missing data predict.type = "prob")
    1. Fit the model
    • MODEL <- train(LRN, TASK)

    2. Hyperparameter tuning with mlr (grid and random search)

    2.1 Requirements for tuning in mlr

    1. Search space for every hyperparameter
    2. Tuning method (e.g. grid or random)
    3. Resampling strategy

    2.2 Steps and commands

    1. Define the search space
    • getParamSet("classif.h2o.deeplearning") # Returns hyperparameter in the model
    • PARAM_SET <- makeParamSet( makeDiscreteParam("hidden", values =list(one =10, two =c(10,5,10))), makeDiscreteParam("activation", values =c("Rectifier","Tanh")), makeNumericParam("l1", lower =0.0001, upper =1),
      makeNumericParam("l2", lower =0.0001, upper =1))
    1. Define the tuning method 2.1) Grid search: ONLY deal with discrete hyperparameter sets
    • CTRL_GRID <- makeTuneControlGrid() 2.2) Random search: CAN deal with all kinds of hyperparameter
    • CTRL_RANDOM <- makeTuneControlRandom(maxit = INT) # Specify max number
    1. Define resampling strategy
    • CROSS_VAL <-makeResampleDesc("RepCV" # Can be "Holdout" predict = "both", # If want to predict training AND validation data folds = 5 * 3)
    1. Apply 3 steps of model training
    • TASK <- makeClassifTask(data = DF, target = "RESP_VAR")
    • LRN <- makeLearner("classif.h2o.deeplearning", # Enter a learner fix.factors.prediction = TRUE, # Handle missing data predict.type = "prob")
    • MODEL <- train(LRN, TASK)
    1. Tune the model
    • LRN_TUNE <- tuneParams(LRN, TASK, resampling = CROSS_VAL control = CTRL_GRID, par.set = PARAM_SET)

    3. Evaluating hyperparameters with mlr

    3.1 Benefits of hyperparameter evaluation

    1. How differert hyperparameters affect the performance of our model
    2. Which hyperparameters have a strong or weak impact on our model performance
    3. Whether our hyperparameter search converged / whether we can be confident that we found the most optimal hyperparameter combination

    3.2 Steps and commands

    1. See tuning results
    • HYPERPAR_EFFECTS <- generateHyperParsEffectData(LRN_TUNE, partial.dep =TRUE) # Set to TRUE if tuning > 2 params
    1. Plot hyperparameter tuning results
    • plotHyperParsEffect(HYPERPAR_EFFECTS, partial.dep.learn ="regr.randomForest", x ="l1", y ="mmce.test.mean", z ="hidden",
      plot.type ="line")

    4. Advanced tuning with mlr

    4.1 List of advanced tuning controls

    1)makeTuneControlCMAES: CMA Evolution Strategy 2) makeTuneControlDesign: Predefined data frame of hyperparameters 3) makeTuneControlGenSA: Generalized simulated annealing (non-linear) 4) makeTuneControlIrace: Tuning with iterated F-Racing (config) 5) makeTuneControlMBO: Model-based/ Bayesian optimization (Use Bayesian statistics to optimize)

    4.2 Steps and commands (examples with makeTuneControlGenSA)

    1. Generalized simulated annealing
    • CTRL_GENSA <- makeTuneControlGenSA()
    1. Create holdout sampling
    • BOOTSTRAP <- makeResampleDesc("Bootstrap", predict ="both")
    1. Perform tuning
    • LRN_TUNE <- tuneParams(learner = LRN, task = TASK, resampling = BOOTSTRAP,
      control = CTRL_GENSA,
      par.set = PARAM_SET,
      measures =list(acc, # Customize measures setAggregation(acc, train.mean), mmce, setAggregation(mmce, train.mean))
    1. Nested cross-validation &nested resampling 4.1) Train directly
    • MODEL_NESTED <- train(LRN, TASK)
    • getTuneResult(MODEL_NESTED) 4.2) Add 2x nested validation
    • CV2 <- makeResampleDesc("CV", iters =2)
    • RES <- resample(LRN, TASK, resampling = CV2,
      extract = getTuneResult)
    • generateHyperParsEffectData(RES)
    1. Choose hyperparameters from a tuning set
    • LRN_BEST <- setHyperPars(LRN, par.vals =list(minsplit =4, minbucket =3, maxdepth =6))
    • MODEL_BEST <- train(LRN_BEST, TASK)
    • predict(MODEL_BEST, newdata = DF)

    Chapter 4. Hyperparameter tuning with h2o

    1. Machine learning with h2o

    1.1 Basics

    1. H2o = a open source for ML in R
    2. Difference from caret and mlr = scalability

    1. Steps and commands

    1. Initiate the package
    • library(h2o)
    • h2o.init() # Local initiate
    1. Prepare the data 2.1) Define H2o frame
    • DF_HF <- as.h2o(DF) 2.2) Define features and taget variable
    • TARGET <- "VAR"
    • FEATURE <- setdiff(colnames(DF_HF), TARGET) Note: for classification target variable
    • DF_HF[, TARGET] <- as.factor(DF_HF[, TARGET])
    1. Split training, validation and test sets 3.1) Split data
    • SFRAME <- h2o.splitFrame(data = DF_HF, ration = c(0.7, 0.15), # Default = 0.75, 0.25 seed = INT) # Set seed
    • TRAIN_DATA <- SFRAME[[1]]
    • VALIDATE_DATA <- SFRAME[[2]]
    • TEST_DATA <- SFRAME[[3]] 3.2) See ratio
    • summary(TRAIN_DATA$VAR, exact_quantiles = TRUE)
    1. Train model
    • GBM_MODEL <- h2o.gbm (X = FEATURE, y = TARGET, training_frame = TRAIN_DATA, validation_frame = VALIDATE_DATA) Note: available model for H2o h2o.gbm() & h2o.xgboost(): Gradient boosted models h2o.glm() Generalized linear models h2o.randomForest(): Random forest models h2o.deeplearning(): Neural networks
    1. Predict new data
    • h2o.predict(GBM_MODEL, TEST_DATA)
    1. Evaluate the model performance
    • PERFORMANCE <- h2o.performance(GBM_MODEL, TEST_DATA)
    • h2o.confusionMatrix(PERFORMANCE)
    • h2o.logloss(PERFORMANCE)

    2. Hyperparameter tuning with h2o (grid and random search)

    2.1 Hyperparameters in H2O models

    • ?h2o.gbm # Search hyperparameters for gradient boosting

    2.2 Steps and commands [search in step 6)]

    1. Initiate the package
    • library(h2o)
    • h2o.init() # Local initiate
    1. Prepare the data 2.1) Define H2o frame
    • DF_HF <- as.h2o(DF) 2.2) Define features and taget variable
    • TARGET <- "VAR"
    • FEATURE <- setdiff(colnames(DF_HF), TARGET) Note: for classification target variable
    • DF_HF[, TARGET] <- as.factor(DF_HF[, TARGET])
    1. Split training, validation and test sets 3.1) Split data
    • SFRAME <- h2o.splitFrame(data = DF_HF, ration = c(0.7, 0.15), # Default = 0.75, 0.25 seed = INT) # Set seed
    • TRAIN_DATA <- SFRAME[[1]]
    • VALIDATE_DATA <- SFRAME[[2]]
    • TEST_DATA <- SFRAME[[3]] 3.2) See ratio
    • summary(TRAIN_DATA$VAR, exact_quantiles = TRUE)
    1. Train model
    • GBM_MODEL <- h2o.gbm (X = FEATURE, y = TARGET, training_frame = TRAIN_DATA, validation_frame = VALIDATE_DATA)
    1. Manageperparameter grid 5.1) Define hyperparameters
    • GBM_PARAMS <- list(ntrees = c(100, 150, 200), max_depth = c(3, 5, 7), learn_rate = c(0.001, 0.01, 0.1)) 5.2) Define grid
    • GBM_GRID <- h2o.grid("gbm", grid_id = "gbm_grid", x = FEATURE, y = TARGET, training_ frame = train, validation _ frame = valid, seed = 42, hyper _params = gbm _params)
    1. Predict new data
    • h2o.predict(GBM_MODEL, TEST_DATA)
    1. Evaluate the model performance
    • PERFORMANCE <- h2o.performance(GBM_MODEL, TEST_DATA)
    • h2o.confusionMatrix(PERFORMANCE)
    • h2o.logloss(PERFORMANCE)

    3.

    3.1

    Exercise (Course 8: Hyperparameter Tuning in R)

    1. Model parameters vs. hyperparameters

    #Fit a linear model on the breast_cancer_data.
    • linear_model <- lm(concavity_mean ~ symmetry_mean, data = breast_cancer_data) #Look at the summary of the linear_model.
    • summary(linear_model) #Extract the coefficients.
    • linear_model$coefficients

    2. What are the coefficients?

    • library(ggplot2) #Plot linear relationship.
    • ggplot(data = breast_cancer_data, aes(x = symmetry_mean, y = concavity_mean)) + geom_point(color = "grey") + geom_abline(slope = linear_modelcoefficients[1])

    3. Machine learning with caret

    #Create partition index
    • index <- createDataPartition(breast_cancer_data$diagnosis, p = 0.7, list = FALSE) #Subset breast_cancer_data with index
    • bc_train_data <- breast_cancer_data[index, ]
    • bc_test_data <- breast_cancer_data[-index, ] #Define 3x5 folds repeated cross-validation
    • fitControl <- trainControl(method = "repeatedcv", number = 5, repeats = 3) #Run the train() function
    • gbm_model <- train(diagnosis ~ ., data = bc_train_data, method = "gbm", trControl = fitControl, verbose = FALSE)

    4. Changing the number of hyperparameters to tune

    #Set seed.
    • set.seed(42) #Start timer.
    • tic() #Train model.
    • gbm_model <- train(diagnosis ~ ., data = bc_train_data, method = "gbm", trControl = trainControl(method = "repeatedcv", number = 5, repeats = 3), verbose = FALSE, tuneLength = 4) #Stop timer.
    • toc()

    5. Tune hyperparameters manually

    #Define hyperparameter grid.
    • hyperparams <- expand.grid(n.trees = 200, interaction.depth = 1, shrinkage = 0.1, n.minobsinnode = 10)

      #Apply hyperparameter grid to train().

    • set.seed(42)

    • gbm_model <- train(diagnosis ~ ., data = bc_train_data, method = "gbm", trControl = trainControl(method = "repeatedcv", number = 5, repeats = 3), verbose = FALSE, tuneGrid = hyperparams)

    6. Cartesian grid search in caret

    #Define Cartesian grid
    • man_grid <- expand.grid(degree = c(1, 2, 3), scale = c(0.1, 0.01, 0.001), C = 0.5) #Start timer, set seed & train model
    • tic()
    • set.seed(42)
    • svm_model_voters_grid <- train(turnout16_2016 ~ ., data = voters_train_data, method = "svmPoly", trControl = fitControl, verbose= FALSE, tuneGrid = man_grid)
    • toc()

    7. Plot hyperparameter model output

    #Plot default
    • plot(svm_model_voters_grid) #Plot Kappa level-plot
    • plot(svm_model_voters_grid, metric = "Kappa", plotType = "level")

    8. Grid search with range of hyperparameters

    #Define the grid with hyperparameter ranges
    • big_grid <- expand.grid(size = seq(from = 1, to = 5, by = 1), decay = c(0, 1)) #Train control with grid search
    • fitControl <- trainControl(method = "repeatedcv", number = 3, repeats = 5, search = "grid") #Train neural net
    • tic()
    • set.seed(42)
    • nn_model_voters_big_grid <- train(turnout16_2016 ~ ., data = voters_train_data, method = "nnet", trControl = fitControl, verbose = FALSE, tuneGrid = big_grid)
    • toc()

    9. Adaptive Resampling with caret

    #Define trainControl function
    • fitControl <- trainControl(method = "adaptive_cv", number = 3, repeats = 3, adaptive = list(min = 3, alpha = 0.05, method = "BT", complete = FALSE), search = "random") #Start timer & train model
    • tic()
    • svm_model_voters_ar <- train(turnout16_2016 ~ ., data = voters_train_data, method = "nnet", trControl = fitControl, verbose = FALSE, tuneLength = 6)
    • toc()

    10. Modeling with mlr

    #Create classification taks
    • task <- makeClassifTask(data = knowledge_train_data, target = "UNS") #Call the list of learners
    • listLearners() %>% as.data.frame() %>% select(class, short.name, package) %>% filter(grepl("classif.", class)) #Create learner
    • lrn <- makeLearner("classif.randomForest", predict.type = "prob", fix.factors.prediction = TRUE)

    11. Random search with mlr

    #Get the parameter set for neural networks of the nnet package
    • getParamSet("classif.nnet") #Define set of parameters
    • param_set <- makeParamSet( makeDiscreteParam("size", values = c(2,3,5)), makeNumericParam("decay", lower = 0.0001, upper = 0.1) ) #Print parameter set
    • print(param_set) #Define a random search tuning method.
    • ctrl_random <- makeTuneControlRandom()

    12. Perform hyperparameter tuning with mlr

    #Define a random search tuning method.
    • ctrl_random <- makeTuneControlRandom(maxit = 6) #Define a 3 x 3 repeated cross-validation scheme
    • cross_val <- makeResampleDesc("RepCV", folds = 3 * 3) #Tune hyperparameters
    • tic()
    • lrn_tune <- tuneParams(lrn, task, resampling = cross_val, control = ctrl_random, par.set = param_set)
    • toc()

    13. Evaluating hyperparameter tuning results

    #Create holdout sampling
    • holdout <- makeResampleDesc("Holdout") #Perform tuning
    • lrn_tune <- tuneParams(learner = lrn, task = task, resampling = holdout, control = ctrl_random, par.set = param_set) #Generate hyperparameter effect data
    • hyperpar_effects <- generateHyperParsEffectData(lrn_tune, partial.dep = TRUE) #Plot hyperparameter effects
    • plotHyperParsEffect(hyperpar_effects, partial.dep.learn = "regr.glm", x = "minsplit", y = "mmce.test.mean", z = "maxdepth", plot.type = "line")

    14. Define aggregated measures

    #Create holdout sampling
    • holdout <- makeResampleDesc("Holdout", predict = "both") #Perform tuning
    • lrn_tune <- tuneParams(learner = lrn, task = task, resampling = holdout, control = ctrl_random, par.set = param_set, measures = list(acc, setAggregation(acc, train.mean), mmce, setAggregation(mmce, train.mean)))

    15. Setting hyperparameters

    #Set hyperparameters
    • lrn_best <- setHyperPars(lrn, par.vals = list(size = 1, maxit = 150, decay = 0)) #Train model
    • model_best <- train(lrn_best, task)

    16. Prepare data for modelling with h2o

    #Initialise h2o cluster
    • h2o.init() #Convert data to h2o frame
    • seeds_train_data_hf <- as.h2o(seeds_train_data) #Identify target and features
    • y <- "seed_type"
    • x <- setdiff(colnames(seeds_train_data_hf), y) #Split data into train & validation sets
    • sframe <- h2o.splitFrame(seeds_train_data_hf, seed = 42)
    • train <- sframe[[1]]
    • valid <- sframe[[2]] #Calculate ratio of the target variable in the training set
    • summary(train$seed_type, exact_quantiles = TRUE)

    17. Modeling with h2o

    #Train random forest model
    • rf_model <- h2o.randomForest(x = x, y = y, training_frame = train, validation_frame = valid) #Calculate model performance
    • perf <- h2o.performance(rf_model, valid = TRUE) #Extract confusion matrix
    • h2o.confusionMatrix(perf) #Extract logloss
    • h2o.logloss(perf)

    18.

    19.

    20.

    21.

    22.

    23.

    24.

    25.

    26.

    27.

    28.

    29.

    30.

    [Template

    Course 9:

    Chapter 1.

    1.

    1.1

    2.

    2.1

    3.

    3.1

    Chapter 2.

    1.

    1.1

    2.

    2.1

    3.

    3.1

    Chapter 3.

    1.

    1.1

    Chapter 4.

    1.

    1.1

    2.

    2.1

    3.

    3.1

    Exercise (Course 9: xxxx)

    1.

    2.

    3.

    4.

    5.

    6.

    7.

    8.

    9.

    10.

    11.

    12.

    13.

    14.

    15.

    16.

    17.

    18.

    19.

    20.

    21.

    22.

    23.

    24.

    25.

    26.

    27.

    28.

    29.

    30.]

    Resource center

    1. Machine learning in R https://topepo.github.io/caret/model-training-and-tuning.html#basic
    Unknown integration
    Data frameavailable as
    df
    variable
    This query is taking long to finish...Consider adding a LIMIT clause or switching to Query mode to preview the result.