Mopeds
  • AI Chat
  • Code
  • Report
  • Beta
    Spinner

    Data Scientist Associate Case Study

    Company Background

    EMO is a manufacturer of motorcycles. The company successfully launched its first electric moped in India in 2019. The product team knows how valuable owner reviews are in making improvements to their mopeds.

    Unfortunately they often get reviews from people who never owned the moped. They don’t want to consider this feedback, so would like to find a way to identify reviews from these people. They have obtained data from other mopeds, where they know if the reviewer owned the moped or not. They think this is equivalent to their own reviews.

    Customer Question

    Your manager has asked you to answer the following:

    • Can you predict which reviews come from people who have never owned the moped before?

    Dataset

    The dataset contains reviews about other mopeds from a local website. The data you will use for this analysis can be accessed here: "data/moped.csv"

    Column NameCriteria
    Used it forCharacter, the purpose of the electric moped for the user, one of “Commuting”, “Leisure”.
    Owned forCharacter, duration of ownership of vehicle one of “<= 6 months”, “> 6 months”, “Never Owned”. Rows that indicate ownership should be combined into the category “Owned”.
    Model nameCharacter, the name of the electric moped.
    Visual AppealNumeric, visual appeal rating (on a 5 point scale, replace missing values with 0).
    ReliabilityNumeric, reliability rating (on a 5 point scale, replace missing values with 0).
    Extra FeatureNumeric, extra feature rating (on a 5 point scale, replace missing values with 0).
    ComfortNumeric, comfort rating (on a 5 point scale, replace missing values with 0).
    Maintenance costNumeric, maintenance cost rating (on a 5 point scale, replace missing values with 0).
    Value for moneyNumeric, value for money rating (on a 5 point scale, replace missing values with 0).

    Data Scientist Associate Case Study Submission

    Use this template to complete your analysis and write up your summary for submission.

    # Data Validation
    # Check all variables in the data against the criteria in the dataset above
    
    # Start coding here... 
    
    data <- read.csv("data/moped.csv")
    head(data)
    # Change the data type of character variables to get different view of the data
    data$Used.it.for <- as.factor(data$Used.it.for)
    data$Owned.for <- as.factor(data$Owned.for)
    data$Model.Name <- as.factor(data$Model.Name)
    
    # Create new variable indicating if the moped was owned 
    library(dplyr)
    data <- data %>% mutate(Owned = as.factor(case_when(data$Owned.for == '<= 6 months' | data$Owned.for == '> 6 months' ~ "Yes",
                           data$Owned.for == 'Never owned' ~ "No")))
    # Change NA values to 0
    data <- data %>% mutate_at(c(4:9), ~replace(., is.na(.), 0))
    

    Data Validation

    Describe the validation tasks you completed and what you found. Have you made any changes to the data to enable further analysis? Remember to describe what you did for every column in the data.

    Write your description here

    The data are containing the variables of interest. However I had to make some changes to fulfil the description and make the data easier to use for analysis.

    Changes by a variable:

    Used.it.for <- Changed to factor, to ensure better readability and usage during the analysis

    Owned.for <- Also changed to factor as variable 'Used.it.for', also I have created a new variable called 'Owned' that describes the status of possible owner based on the duration of ownership i.e. 'Never owned' = 'No', Other answers = 'Yes'

    Model.Name <- Changed to factor as the first two variables

    Visual.Appeal, Reliability, Extra.Features, Comfort, Maintenance.cost, Value.for.Money <- In these variables I have replaced the missing values by 0

    # Exploratory Analysis
    # Explore the characteristics of the variables in the data
    
    # Start coding here... 
    library(ggplot2)
    summary(data)
    
    data %>% ggplot(aes(Owned, fill = Used.it.for)) + 
    geom_bar() +
    labs(title = 'The purpose of the electric moped for the user by owning status')
    
    
    plot.visual <- data %>% ggplot(aes(Visual.Appeal, fill = Owned)) + geom_histogram()
    plot.reliability <- data %>% ggplot(aes(Reliability, fill = Owned)) + geom_histogram()
    install.packages('gridExtra')
    library(gridExtra)
    grid.arrange(plot.visual, plot.reliability)
    
    
    plot.visual_maintenance <- data %>% ggplot(aes(Maintenance.cost, Visual.Appeal, color = Owned)) + geom_jitter()
    plot.visual_cost <- data %>% ggplot(aes(Extra.Features, Visual.Appeal, color = Owned)) + geom_jitter()
    plot.visual_comfort <- data %>% ggplot(aes(Comfort, Visual.Appeal, color = Owned)) + geom_jitter()
    plot.visual_value <- data %>% ggplot(aes(Value.for.Money, Visual.Appeal, color = Owned)) + geom_jitter()
    plot.model_reliabilit <- data %>% ggplot(aes(Used.it.for, Reliability, fill = Owned)) + geom_violin()
    plot.model_reliabilit
     
    
    grid.arrange(plot.visual_maintenance, plot.visual_cost, plot.visual_comfort, plot.visual_value, ncol =2)

    Exploratory Analysis

    Describe what you found in the exploratory analysis. In your description you should:

    • Reference at least two different data visualizations you created above to demonstrate the characteristics of variables
    • Reference at least one data visualization you created above to demonstrate the relationship between two or more variables
    • Describe what your exploratory analysis has told you about the data
    • Describe any changes you have made to the data to enable modeling

    Write your description here Bar chart
    I have found that the answered question considering the use of the moped differs between people who own it and who don't, whether people who own it report in 3/4 cases the usage for commute, the reported answer for people who don't own it is fifty-fifty.

    Histograms The distribution for reliability and visual appeal are receiving mostly the rating of 5 points and with 2 points being the least given.

    Violin chart Also, it seems like people who don't own the moped are more likely to give a higher rating in reliability, however, the ratings from people who own it are more spread across the rating scale.

    Scatter plots It seems like there are some visual patterns in the relationship between visual appeal and other variables. It is visible that there is a positive correlation, that the positively rated visual appeal is associated with a positive rating in other variables. However, there are a lot of missing values in the variables of interest so the visualization might be a little bit distorted.

    # Model Fitting
    # Choose and fit a baseline model
    # Choose and fit a comparison model
    
    # Start coding here... 
    # Create train and test dataset
    install.packages('caret')
    library(caret)
    
    
    set.seed(1)
    
    #create ID column
    data$id <- 1:nrow(data)
    
    #use 70% of dataset as training set and 30% as test set 
    train_set <- data %>% sample_frac(0.70)
    test_set  <- anti_join(data, train_set, by = 'id')
    
    ######## Decision tree #####
    # Define cross validation for all upcoming algorithms
    control <- trainControl(method = "cv", number = 10, p = .9)
    
    ########### Tuning decision tree with cp ####
    #### Be careful may take time to complete ##########
    ### Using doParralel package to use more cores in computer (This is set to 10 because of 12 maximum)
    ### It's gonna be used even for knn
    
    
    
    train_tree <- train(Owned ~ . - Owned.for - id, 
                            method = "rpart",
                              data = train_set,
                             tuneGrid = data.frame(cp = seq(0.01, 0.1, 25)),
                            trControl = control)
    
    #### Fitting best value into the tree ####
    fit_tree <- train(Owned ~ . - Owned.for - id, 
                               method = "rpart",
                               data = train_set,
                               tuneGrid = data.frame(cp = train_tree$bestTune),
                               trControl = control)
    
    acc_tree <- confusionMatrix(predict.train(fit_tree, test_set), 
                                test_set$Owned)$overall["Accuracy"]
    
    # Writing the results into table
    acc_results <- tibble(method = "Decision tree", 
                         Accuracy = acc_tree)
    
    
    
    ### KNN
    train_knn <- train(Owned ~ . - Owned.for - id, method = "knn", 
                       data = train_set,  
                       tuneGrid = data.frame(k = seq(1,30,1)),
                       trControl = control)
    
    ## Fitting in best K
    fit_knn <- train(Owned ~ . -Owned.for - id, method = "knn", 
                              data = train_set,  
                              tuneGrid = data.frame(k = train_knn$bestTune),
                              trControl = control)
    
    
    
    ### RF
    install.packages("randomForest") 
    library(randomForest)
    
    fit_rf <- randomForest(Owned ~. -Owned.for - id, data=train_set, ntree = 100) 
    
    
    
     

    Model Fitting

    Describe your approach to the model fitting. In your description you should:

    • Describe what type of machine learning problem you are working on
    • Describe which method you selected for the baseline model and explain why you chose this model
    • Describe which method you selected for the comparison model and explain why you chose this model

    Write your description here I have chosen three models - decision tree, knn, and random forest algorithms

    Starting with a decision tree because it’s the fastest application and less demanding of computational power. A decision tree is a flowchart-like structure in which each internal node represents a “test” on an attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the outcome of the test, and each leaf node represents a class label (decision taken after computing all attributes). The paths from the root to leaf represent classification rules. I have built our tree with complexity parameter (cp) tuning from 0.01 to 0.1. After tuning the parameter the best fit is used for our model. Also, I have defined a 10-fold cross-validation function that will be used now and later for kNN.

    The comparison model was the k-nearest neighbour algorithm, because of a completely different approach towards the data classification. Following by kNN which is a non-parametric method used for classification and regression. The algorithm assumes that similar things exist in close proximity. In other words, similar things are near to each other. It uses ‘feature similarity’ to predict the values of new data points which further means that the new data point will be assigned a value based on how closely it matches the points in the training set. I have tuned the model with the K sequence and with cross-validation. After finding the best K value I fit it into our model.

    Last I used Random Forest just to compare it with the decision tree. Random forest is a classifier that evolves from decision trees. It actually consists of many decision trees. To classify a new instance, each decision tree provides a classification for input data. It collects the classifications and chooses the most voted prediction as the result.

    # Model Evaluation
    # Choose a metric and evaluate the performance of the two models
    
    # Start coding here...
    
    acc_tree <- confusionMatrix(predict.train(fit_tree, test_set), 
                                test_set$Owned)$overall["Accuracy"]
    
    # Writing the results into table
    acc_results <- tibble(method = "Decision tree", 
                         Accuracy = acc_tree)
    
    #### Making prediction inside the confusion matrix ######
    acc_knn <- confusionMatrix(caret::predict.train(fit_knn, test_set), 
                               test_set$Owned)$overall["Accuracy"]
    
    # Stop using doParallel
    # Writing the results into table
    acc_results <- bind_rows(acc_results, 
                             tibble(method = "k-nearest neighbors algorithm ", 
                                    Accuracy = acc_knn))
    
    ### Making prediction inside the confusion matrix.
    ## We have to change class of the objet because of ranger 
    acc_rf <- confusionMatrix(predict(fit_rf, test_set),
                              test_set$Owned)$overall["Accuracy"]
    
    ## Writing the results
    acc_results <- bind_rows(acc_results, 
                             tibble(method = "Random Forest", 
                                    Accuracy = acc_rf))
    acc_results

    Model Evaluation

    Explain what the results of your evaluation tell you. You should:

    • Describe which metric you have selected to compare the models and why
    • Explain what the outcome of this evaluation tells you about the performance of your models
    • Identify, based on the evaluation, which you would consider to be the better performing approach

    Write your description here

    I have chosen the accuracy metric as an evaluation device that is most understandable. It tells the user the rate of correctly predicting the answer in the test dataset. Based on the results, all of the algorithms predict the ownage status with an accuracy of around 84%. The kNN is performing best with 85.5% of accuracy. Therefore I would select this approach towards identifying people who write reviews about mopeds.

    ✅ When you have finished...

    • Publish your Workspace using the option on the left
    • Check the published version of your report:
      • Can you see everything you want us to grade?
      • Are all the graphics visible?
    • Review grading rubric, have you included everything that will be graded?
    • Head back to the Certification Dashboard to submit your case study