Mopeds

Mopeds

Beta

Data Scientist Associate Case Study

Company Background

EMO is a manufacturer of motorcycles. The company successfully launched its first electric moped in India in 2019. The product team knows how valuable owner reviews are in making improvements to their mopeds.

Unfortunately they often get reviews from people who never owned the moped. They don’t want to consider this feedback, so would like to find a way to identify reviews from these people. They have obtained data from other mopeds, where they know if the reviewer owned the moped or not. They think this is equivalent to their own reviews.

Customer Question

Your manager has asked you to answer the following:

Can you predict which reviews come from people who have never owned the moped before?

Dataset

The dataset contains reviews about other mopeds from a local website. The data you will use for this analysis can be accessed here: "data/moped.csv"

Column Name	Criteria
Used it for	Character, the purpose of the electric moped for the user, one of “Commuting”, “Leisure”.
Owned for	Character, duration of ownership of vehicle one of “<= 6 months”, “> 6 months”, “Never Owned”. Rows that indicate ownership should be combined into the category “Owned”.
Model name	Character, the name of the electric moped.
Visual Appeal	Numeric, visual appeal rating (on a 5 point scale, replace missing values with 0).
Reliability	Numeric, reliability rating (on a 5 point scale, replace missing values with 0).
Extra Feature	Numeric, extra feature rating (on a 5 point scale, replace missing values with 0).
Comfort	Numeric, comfort rating (on a 5 point scale, replace missing values with 0).
Maintenance cost	Numeric, maintenance cost rating (on a 5 point scale, replace missing values with 0).
Value for money	Numeric, value for money rating (on a 5 point scale, replace missing values with 0).

Data Scientist Associate Case Study Submission

Use this template to complete your analysis and write up your summary for submission.

# Data Validation
# Check all variables in the data against the criteria in the dataset above

# Start coding here... 

data <- read.csv("data/moped.csv")
head(data)
# Change the data type of character variables to get different view of the data
data$Used.it.for <- as.factor(data$Used.it.for)
data$Owned.for <- as.factor(data$Owned.for)
data$Model.Name <- as.factor(data$Model.Name)

# Create new variable indicating if the moped was owned 
library(dplyr)
data <- data %>% mutate(Owned = as.factor(case_when(data$Owned.for == '<= 6 months' | data$Owned.for == '> 6 months' ~ "Yes",
                       data$Owned.for == 'Never owned' ~ "No")))
# Change NA values to 0
data <- data %>% mutate_at(c(4:9), ~replace(., is.na(.), 0))

Data Validation

Describe the validation tasks you completed and what you found. Have you made any changes to the data to enable further analysis? Remember to describe what you did for every column in the data.

Write your description here

The data are containing the variables of interest. However I had to make some changes to fulfil the description and make the data easier to use for analysis.

Changes by a variable:

Used.it.for <- Changed to factor, to ensure better readability and usage during the analysis

Owned.for <- Also changed to factor as variable 'Used.it.for', also I have created a new variable called 'Owned' that describes the status of possible owner based on the duration of ownership i.e. 'Never owned' = 'No', Other answers = 'Yes'

Model.Name <- Changed to factor as the first two variables

Visual.Appeal, Reliability, Extra.Features, Comfort, Maintenance.cost, Value.for.Money <- In these variables I have replaced the missing values by 0

# Exploratory Analysis
# Explore the characteristics of the variables in the data

# Start coding here... 
library(ggplot2)
summary(data)

data %>% ggplot(aes(Owned, fill = Used.it.for)) + 
geom_bar() +
labs(title = 'The purpose of the electric moped for the user by owning status')


plot.visual <- data %>% ggplot(aes(Visual.Appeal, fill = Owned)) + geom_histogram()
plot.reliability <- data %>% ggplot(aes(Reliability, fill = Owned)) + geom_histogram()
install.packages('gridExtra')
library(gridExtra)
grid.arrange(plot.visual, plot.reliability)


plot.visual_maintenance <- data %>% ggplot(aes(Maintenance.cost, Visual.Appeal, color = Owned)) + geom_jitter()
plot.visual_cost <- data %>% ggplot(aes(Extra.Features, Visual.Appeal, color = Owned)) + geom_jitter()
plot.visual_comfort <- data %>% ggplot(aes(Comfort, Visual.Appeal, color = Owned)) + geom_jitter()
plot.visual_value <- data %>% ggplot(aes(Value.for.Money, Visual.Appeal, color = Owned)) + geom_jitter()
plot.model_reliabilit <- data %>% ggplot(aes(Used.it.for, Reliability, fill = Owned)) + geom_violin()
plot.model_reliabilit
 

grid.arrange(plot.visual_maintenance, plot.visual_cost, plot.visual_comfort, plot.visual_value, ncol =2)

Exploratory Analysis

Describe what you found in the exploratory analysis. In your description you should:

Reference at least two different data visualizations you created above to demonstrate the characteristics of variables
Reference at least one data visualization you created above to demonstrate the relationship between two or more variables
Describe what your exploratory analysis has told you about the data
Describe any changes you have made to the data to enable modeling

Write your description here Bar chart
I have found that the answered question considering the use of the moped differs between people who own it and who don't, whether people who own it report in 3/4 cases the usage for commute, the reported answer for people who don't own it is fifty-fifty.

Histograms The distribution for reliability and visual appeal are receiving mostly the rating of 5 points and with 2 points being the least given.

Violin chart Also, it seems like people who don't own the moped are more likely to give a higher rating in reliability, however, the ratings from people who own it are more spread across the rating scale.

Scatter plots It seems like there are some visual patterns in the relationship between visual appeal and other variables. It is visible that there is a positive correlation, that the positively rated visual appeal is associated with a positive rating in other variables. However, there are a lot of missing values in the variables of interest so the visualization might be a little bit distorted.

# Model Fitting
# Choose and fit a baseline model
# Choose and fit a comparison model

# Start coding here... 
# Create train and test dataset
install.packages('caret')
library(caret)


set.seed(1)

#create ID column
data$id <- 1:nrow(data)

#use 70% of dataset as training set and 30% as test set 
train_set <- data %>% sample_frac(0.70)
test_set  <- anti_join(data, train_set, by = 'id')

######## Decision tree #####
# Define cross validation for all upcoming algorithms
control <- trainControl(method = "cv", number = 10, p = .9)

########### Tuning decision tree with cp ####
#### Be careful may take time to complete ##########
### Using doParralel package to use more cores in computer (This is set to 10 because of 12 maximum)
### It's gonna be used even for knn



train_tree <- train(Owned ~ . - Owned.for - id, 
                        method = "rpart",
                          data = train_set,
                         tuneGrid = data.frame(cp = seq(0.01, 0.1, 25)),
                        trControl = control)

#### Fitting best value into the tree ####
fit_tree <- train(Owned ~ . - Owned.for - id, 
                           method = "rpart",
                           data = train_set,
                           tuneGrid = data.frame(cp = train_tree$bestTune),
                           trControl = control)

acc_tree <- confusionMatrix(predict.train(fit_tree, test_set), 
                            test_set$Owned)$overall["Accuracy"]

# Writing the results into table
acc_results <- tibble(method = "Decision tree", 
                     Accuracy = acc_tree)



### KNN
train_knn <- train(Owned ~ . - Owned.for - id, method = "knn", 
                   data = train_set,  
                   tuneGrid = data.frame(k = seq(1,30,1)),
                   trControl = control)

## Fitting in best K
fit_knn <- train(Owned ~ . -Owned.for - id, method = "knn", 
                          data = train_set,  
                          tuneGrid = data.frame(k = train_knn$bestTune),
                          trControl = control)



### RF
install.packages("randomForest") 
library(randomForest)

fit_rf <- randomForest(Owned ~. -Owned.for - id, data=train_set, ntree = 100)

Model Fitting

Describe your approach to the model fitting. In your description you should:

Describe what type of machine learning problem you are working on
Describe which method you selected for the baseline model and explain why you chose this model
Describe which method you selected for the comparison model and explain why you chose this model

Write your description here I have chosen three models - decision tree, knn, and random forest algorithms

Starting with a decision tree because it’s the fastest application and less demanding of computational power. A decision tree is a flowchart-like structure in which each internal node represents a “test” on an attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the outcome of the test, and each leaf node represents a class label (decision taken after computing all attributes). The paths from the root to leaf represent classification rules. I have built our tree with complexity parameter (cp) tuning from 0.01 to 0.1. After tuning the parameter the best fit is used for our model. Also, I have defined a 10-fold cross-validation function that will be used now and later for kNN.

The comparison model was the k-nearest neighbour algorithm, because of a completely different approach towards the data classification. Following by kNN which is a non-parametric method used for classification and regression. The algorithm assumes that similar things exist in close proximity. In other words, similar things are near to each other. It uses ‘feature similarity’ to predict the values of new data points which further means that the new data point will be assigned a value based on how closely it matches the points in the training set. I have tuned the model with the K sequence and with cross-validation. After finding the best K value I fit it into our model.

Last I used Random Forest just to compare it with the decision tree. Random forest is a classifier that evolves from decision trees. It actually consists of many decision trees. To classify a new instance, each decision tree provides a classification for input data. It collects the classifications and chooses the most voted prediction as the result.

# Model Evaluation
# Choose a metric and evaluate the performance of the two models

# Start coding here...

acc_tree <- confusionMatrix(predict.train(fit_tree, test_set), 
                            test_set$Owned)$overall["Accuracy"]

# Writing the results into table
acc_results <- tibble(method = "Decision tree", 
                     Accuracy = acc_tree)

#### Making prediction inside the confusion matrix ######
acc_knn <- confusionMatrix(caret::predict.train(fit_knn, test_set), 
                           test_set$Owned)$overall["Accuracy"]

# Stop using doParallel
# Writing the results into table
acc_results <- bind_rows(acc_results, 
                         tibble(method = "k-nearest neighbors algorithm ", 
                                Accuracy = acc_knn))

### Making prediction inside the confusion matrix.
## We have to change class of the objet because of ranger 
acc_rf <- confusionMatrix(predict(fit_rf, test_set),
                          test_set$Owned)$overall["Accuracy"]

## Writing the results
acc_results <- bind_rows(acc_results, 
                         tibble(method = "Random Forest", 
                                Accuracy = acc_rf))
acc_results

Model Evaluation

Explain what the results of your evaluation tell you. You should:

Describe which metric you have selected to compare the models and why
Explain what the outcome of this evaluation tells you about the performance of your models
Identify, based on the evaluation, which you would consider to be the better performing approach