Sample Data Scientist Associate Solution (copy)
  • AI Chat
  • Code
  • Report
  • Beta
    Spinner

    Data Scientist Associate

    Example Practical Exam Solution

    You can find the project information that accompanies this example solution in the resource center, Practical Exam Resources.

    Use this template to complete your analysis and write up your summary for submission.

    Task 1

    The dataset contains 200 rows and 9 columns with missing values before cleaning. I have validated all the columns against the criteria in the dataset table:

    • Region: Same as description without missing values, 10 Regions.
    • Place name: Same as description without missing values.
    • Rating: 2 missing values, so I replace the missing values with 0.
    • Reviews: 2 missing values, so I replace the missing values with overall median number.
    • Price: Same as description without missing values, 3 categories.
    • Delivery option: Same as description without missing values.
    • Dine in option: 50+ missing values, so I replace missing values with 'False', and convert it into boolean data type.
    • Take out option: 50+ missing values, so I replace missing values with 'False',and convert it into boolean data type.

    After the data validation, the dataset contains 200 rows and 9 columns.

    Original Dataset

    # Data Validation
    # Check all variables in the data against the criteria in the dataset above
    
    import pandas as pd
    import matplotlib.pyplot as plt
    import matplotlib.style as style
    import seaborn as sns
    import numpy as np
    df = pd.read_csv('data/coffee.csv')
    df.info()

    Validate the categorical variables

    cat = ['Region','Place type','Price','Delivery option','Dine in option','Takeout option']
    for column in cat:
      print(df[column].value_counts())

    Validate the numerical variables

    df.describe()

    Check the missing values in the columns

    df.isna().sum()

    Clean Rating and Review columns

    df['Rating'] = df['Rating'].fillna(0)
    median = np.median(df['Reviews'].dropna())
    df['Reviews'] = df['Reviews'].fillna(median)
    df.info()

    Clean Dine in Option and Takeaway Option column

    df['Dine in option'] = df['Dine in option'].fillna(False)
    df['Takeout option'] = df['Takeout option'].fillna(False)
    df['Dine in option'] = df['Dine in option'].astype('bool')
    df['Takeout option'] = df['Takeout option'].astype('bool')
    df.info()

    Task 2

    From Graph 1 Count of Rating, the most number of stores were given rating 4.6, then follows by 4.7. We can see the majority of the stores were given rating higher than 4.5.