Data Scientist Associate Case Study
Company Background
EMO is a manufacturer of motorcycles. The company successfully launched its first electric moped in India in 2019. The product team knows how valuable owner reviews are in making improvements to their mopeds.
Unfortunately they often get reviews from people who never owned the moped. They don’t want to consider this feedback, so would like to find a way to identify reviews from these people. They have obtained data from other mopeds, where they know if the reviewer owned the moped or not. They think this is equivalent to their own reviews.
Customer Question
Your manager has asked you to answer the following:
- Can you predict which reviews come from people who have never owned the moped before?
Dataset
The dataset contains reviews about other mopeds from a local website. The data you will use for this analysis can be accessed here: "data/moped.csv"
Column Name | Criteria |
---|---|
Used it for | Character, the purpose of the electric moped for the user, one of “Commuting”, “Leisure”. |
Owned for | Character, duration of ownership of vehicle one of “<= 6 months”, “> 6 months”, “Never Owned”. Rows that indicate ownership should be combined into the category “Owned”. |
Model name | Character, the name of the electric moped. |
Visual Appeal | Numeric, visual appeal rating (on a 5 point scale, replace missing values with 0). |
Reliability | Numeric, reliability rating (on a 5 point scale, replace missing values with 0). |
Extra Feature | Numeric, extra feature rating (on a 5 point scale, replace missing values with 0). |
Comfort | Numeric, comfort rating (on a 5 point scale, replace missing values with 0). |
Maintenance cost | Numeric, maintenance cost rating (on a 5 point scale, replace missing values with 0). |
Value for money | Numeric, value for money rating (on a 5 point scale, replace missing values with 0). |
Data Scientist Associate Case Study Submission
Use this template to complete your analysis and write up your summary for submission.
# Data Validation
# Check all variables in the data against the criteria in the dataset above
# Start coding here...
data <- read.csv("data/moped.csv")
head(data)
# Change the data type of character variables to get different view of the data
data$Used.it.for <- as.factor(data$Used.it.for)
data$Owned.for <- as.factor(data$Owned.for)
data$Model.Name <- as.factor(data$Model.Name)
# Create new variable indicating if the moped was owned
library(dplyr)
data <- data %>% mutate(Owned = as.factor(case_when(data$Owned.for == '<= 6 months' | data$Owned.for == '> 6 months' ~ "Yes",
data$Owned.for == 'Never owned' ~ "No")))
# Change NA values to 0
data <- data %>% mutate_at(c(4:9), ~replace(., is.na(.), 0))
Data Validation
Describe the validation tasks you completed and what you found. Have you made any changes to the data to enable further analysis? Remember to describe what you did for every column in the data.
Write your description here
The data are containing the variables of interest. However I had to make some changes to fulfil the description and make the data easier to use for analysis.
Changes by a variable:
Used.it.for <- Changed to factor, to ensure better readability and usage during the analysis
Owned.for <- Also changed to factor as variable 'Used.it.for', also I have created a new variable called 'Owned' that describes the status of possible owner based on the duration of ownership i.e. 'Never owned' = 'No', Other answers = 'Yes'
Model.Name <- Changed to factor as the first two variables
Visual.Appeal, Reliability, Extra.Features, Comfort, Maintenance.cost, Value.for.Money <- In these variables I have replaced the missing values by 0
# Exploratory Analysis
# Explore the characteristics of the variables in the data
# Start coding here...
library(ggplot2)
summary(data)
data %>% ggplot(aes(Owned, fill = Used.it.for)) +
geom_bar() +
labs(title = 'The purpose of the electric moped for the user by owning status')
plot.visual <- data %>% ggplot(aes(Visual.Appeal, fill = Owned)) + geom_histogram()
plot.reliability <- data %>% ggplot(aes(Reliability, fill = Owned)) + geom_histogram()
install.packages('gridExtra')
library(gridExtra)
grid.arrange(plot.visual, plot.reliability)
plot.visual_maintenance <- data %>% ggplot(aes(Maintenance.cost, Visual.Appeal, color = Owned)) + geom_jitter()
plot.visual_cost <- data %>% ggplot(aes(Extra.Features, Visual.Appeal, color = Owned)) + geom_jitter()
plot.visual_comfort <- data %>% ggplot(aes(Comfort, Visual.Appeal, color = Owned)) + geom_jitter()
plot.visual_value <- data %>% ggplot(aes(Value.for.Money, Visual.Appeal, color = Owned)) + geom_jitter()
plot.model_reliabilit <- data %>% ggplot(aes(Used.it.for, Reliability, fill = Owned)) + geom_violin()
plot.model_reliabilit
grid.arrange(plot.visual_maintenance, plot.visual_cost, plot.visual_comfort, plot.visual_value, ncol =2)
Exploratory Analysis
Describe what you found in the exploratory analysis. In your description you should:
- Reference at least two different data visualizations you created above to demonstrate the characteristics of variables
- Reference at least one data visualization you created above to demonstrate the relationship between two or more variables
- Describe what your exploratory analysis has told you about the data
- Describe any changes you have made to the data to enable modeling
Write your description here
Bar chart
I have found that the answered question considering the use of the moped differs between people who own it and who don't, whether people who own it report in 3/4 cases the usage for commute, the reported answer for people who don't own it is fifty-fifty.
Histograms The distribution for reliability and visual appeal are receiving mostly the rating of 5 points and with 2 points being the least given.
Violin chart Also, it seems like people who don't own the moped are more likely to give a higher rating in reliability, however, the ratings from people who own it are more spread across the rating scale.
Scatter plots It seems like there are some visual patterns in the relationship between visual appeal and other variables. It is visible that there is a positive correlation, that the positively rated visual appeal is associated with a positive rating in other variables. However, there are a lot of missing values in the variables of interest so the visualization might be a little bit distorted.
# Model Fitting
# Choose and fit a baseline model
# Choose and fit a comparison model
# Start coding here...
# Create train and test dataset
install.packages('caret')
library(caret)
set.seed(1)
#create ID column
data$id <- 1:nrow(data)
#use 70% of dataset as training set and 30% as test set
train_set <- data %>% sample_frac(0.70)
test_set <- anti_join(data, train_set, by = 'id')
######## Decision tree #####
# Define cross validation for all upcoming algorithms
control <- trainControl(method = "cv", number = 10, p = .9)
########### Tuning decision tree with cp ####
#### Be careful may take time to complete ##########
### Using doParralel package to use more cores in computer (This is set to 10 because of 12 maximum)
### It's gonna be used even for knn
train_tree <- train(Owned ~ . - Owned.for - id,
method = "rpart",
data = train_set,
tuneGrid = data.frame(cp = seq(0.01, 0.1, 25)),
trControl = control)
#### Fitting best value into the tree ####
fit_tree <- train(Owned ~ . - Owned.for - id,
method = "rpart",
data = train_set,
tuneGrid = data.frame(cp = train_tree$bestTune),
trControl = control)
acc_tree <- confusionMatrix(predict.train(fit_tree, test_set),
test_set$Owned)$overall["Accuracy"]
# Writing the results into table
acc_results <- tibble(method = "Decision tree",
Accuracy = acc_tree)
### KNN
train_knn <- train(Owned ~ . - Owned.for - id, method = "knn",
data = train_set,
tuneGrid = data.frame(k = seq(1,30,1)),
trControl = control)
## Fitting in best K
fit_knn <- train(Owned ~ . -Owned.for - id, method = "knn",
data = train_set,
tuneGrid = data.frame(k = train_knn$bestTune),
trControl = control)
### RF
install.packages("randomForest")
library(randomForest)
fit_rf <- randomForest(Owned ~. -Owned.for - id, data=train_set, ntree = 100)
Model Fitting
Describe your approach to the model fitting. In your description you should:
- Describe what type of machine learning problem you are working on
- Describe which method you selected for the baseline model and explain why you chose this model
- Describe which method you selected for the comparison model and explain why you chose this model
Write your description here I have chosen three models - decision tree, knn, and random forest algorithms
Starting with a decision tree because it’s the fastest application and less demanding of computational power. A decision tree is a flowchart-like structure in which each internal node represents a “test” on an attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the outcome of the test, and each leaf node represents a class label (decision taken after computing all attributes). The paths from the root to leaf represent classification rules. I have built our tree with complexity parameter (cp) tuning from 0.01 to 0.1. After tuning the parameter the best fit is used for our model. Also, I have defined a 10-fold cross-validation function that will be used now and later for kNN.
The comparison model was the k-nearest neighbour algorithm, because of a completely different approach towards the data classification. Following by kNN which is a non-parametric method used for classification and regression. The algorithm assumes that similar things exist in close proximity. In other words, similar things are near to each other. It uses ‘feature similarity’ to predict the values of new data points which further means that the new data point will be assigned a value based on how closely it matches the points in the training set. I have tuned the model with the K sequence and with cross-validation. After finding the best K value I fit it into our model.
Last I used Random Forest just to compare it with the decision tree. Random forest is a classifier that evolves from decision trees. It actually consists of many decision trees. To classify a new instance, each decision tree provides a classification for input data. It collects the classifications and chooses the most voted prediction as the result.
# Model Evaluation
# Choose a metric and evaluate the performance of the two models
# Start coding here...
acc_tree <- confusionMatrix(predict.train(fit_tree, test_set),
test_set$Owned)$overall["Accuracy"]
# Writing the results into table
acc_results <- tibble(method = "Decision tree",
Accuracy = acc_tree)
#### Making prediction inside the confusion matrix ######
acc_knn <- confusionMatrix(caret::predict.train(fit_knn, test_set),
test_set$Owned)$overall["Accuracy"]
# Stop using doParallel
# Writing the results into table
acc_results <- bind_rows(acc_results,
tibble(method = "k-nearest neighbors algorithm ",
Accuracy = acc_knn))
### Making prediction inside the confusion matrix.
## We have to change class of the objet because of ranger
acc_rf <- confusionMatrix(predict(fit_rf, test_set),
test_set$Owned)$overall["Accuracy"]
## Writing the results
acc_results <- bind_rows(acc_results,
tibble(method = "Random Forest",
Accuracy = acc_rf))
acc_results
Model Evaluation
Explain what the results of your evaluation tell you. You should:
- Describe which metric you have selected to compare the models and why
- Explain what the outcome of this evaluation tells you about the performance of your models
- Identify, based on the evaluation, which you would consider to be the better performing approach
Write your description here
I have chosen the accuracy metric as an evaluation device that is most understandable. It tells the user the rate of correctly predicting the answer in the test dataset. Based on the results, all of the algorithms predict the ownage status with an accuracy of around 84%. The kNN is performing best with 85.5% of accuracy. Therefore I would select this approach towards identifying people who write reviews about mopeds.
✅ When you have finished...
- Publish your Workspace using the option on the left
- Check the published version of your report:
- Can you see everything you want us to grade?
- Are all the graphics visible?
- Review grading rubric, have you included everything that will be graded?
- Head back to the Certification Dashboard to submit your case study