Joshua Momo/

Duplicate of Sample Data Scientist Associate Solution


Data Scientist Associate

Example Practical Exam Solution

You can find the project information that accompanies this example solution in the resource center, Practical Exam Resources.

Use this template to complete your analysis and write up your summary for submission.

Data Validation

The dataset contains 200 rows and 9 columns with missing values before cleaning. I have validated all the columns against the criteria in the dataset table:

  • Region: Same as description without missing values, 10 Regions.
  • Place name: Same as description without missing values.
  • Rating: 2 missing values, so I decided to remove the missing values.
  • Enough Reviews: Same as description without missing values.
  • Price: Same as description without missing values, 3 categories.
  • Delivery option: Same as description without missing values.
  • Dine in option: 50+ missing values, so I replace missing values with 'False', and convert it into boolean data type.
  • Take out option: 50+ missing values, so I replace missing values with 'False',and convert it into boolean data type.

After the data validation, the dataset contains 198 rows and 9 columns.

Original Dataset

# Data Validation
# Check all variables in the data against the criteria in the dataset above

import pandas as pd
import matplotlib.pyplot as plt
import as style
import seaborn as sns
import numpy as np
df = pd.read_csv('data/coffee.csv')

After removing the missing values for Rating and Review columns

df = df.dropna(subset=['Rating'])

After replacing missing values with 'False'

df['Dine in option'] = df['Dine in option'].fillna(False)
df['Takeout option'] = df['Takeout option'].fillna(False)
df['Dine in option'] = df['Dine in option'].astype('bool')
df['Takeout option'] = df['Takeout option'].astype('bool')

Exploratory Analysis

From Graph 1 distribution of rating, we see some outliers. Since we don't have a lot of data, we decided not to remove outliers at this point. Enough Reviews variable is our target variable. From Graph 2, we can see the proportion of the minority class - True, indicating having enough reviews, is about 37%. Therefore, the imbalanced class issue is mild, so we decided not to address this issue at this point.

From investigating the relationship between the rating and target variable (Graph 3), we can see the rating variable might be a good indicator for prediction. Also, from investigating different options (Graph 4-6), we can see the Dine in and Take out options are good indicators for prediction.

Finally, to enable model fitting, I have made the following changes:

  • Remove the Place name column because it has unique values, so we won't use that feature.
  • Convert all the categorical variables into numeric variables

Inspecting the Rating and Enough Reviews variables

Hidden code

Inspecting the number of categories in categorical variables

#Inspect the categorical variables

Inspecting the Relationships between Ratings, Options and Target Variable (Enough Reviews)

sns.boxplot(data=df, x='Enough Reviews',y='Rating',linewidth=0.8).set(title='Graph 3 Ratings vs Enough Reviews');
sns.boxplot(data=df, x='Enough Reviews',y='Rating',hue='Delivery option',linewidth=0.8).set(title='Graph 4 Ratings vs Enough Reviews in Delivery Option');
sns.boxplot(data=df, x='Enough Reviews',y='Rating',hue='Dine in option',linewidth=0.8).set(title='Graph 5 Ratings vs Enough Reviews in Dine in Option');
sns.boxplot(data=df, x='Enough Reviews',y='Rating',hue='Takeout option',linewidth=0.8).set(title='Graph 6 Ratings vs Enough Reviews in Takeout Option');

Make changes to enable modeling

df = df.drop(columns=['Place name'])
from sklearn import preprocessing
features = df.drop(columns='Enough Reviews')
X = pd.get_dummies(features, columns=['Place type','Price','Region','Delivery option','Dine in option','Takeout option'])
df['Enough Reviews'] = df['Enough Reviews'].replace({True:1,False:0})
y = df['Enough Reviews'] 

Model Fitting

Predicting whether a newly opened coffee shop can get over 450 reviews is a binary classification problem in machine learning. I am choosing the Logistic Regression model as a baseline model because it is very efficient to train and interpret. The comparison model I am choosing is the Decision Tree model because it works well with mixed data type and is less influenced by outliers.

#import ML models and peformance metrics
from sklearn.linear_model import LogisticRegression 
from sklearn.tree import DecisionTreeClassifier 
from sklearn.model_selection import train_test_split 
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix