Duplicate of Certification Workspace
  • AI Chat
  • Code
  • Report
  • Beta
    Spinner

    Data Scientist Associate Practical Exam Submission

    Use this template to complete your analysis and write up your summary for submission.

    Task 1

    Original data

    • booking_id : Same as description without missing values
    • months_as_member : Same as description without missing values, minimum is 1 month
    • weight : 20 missing values, I replaced missing values with the overall average weight. The minimum weight is not 40.00 kg but 55.41kg.
    • days_before : Same as description without missing values, remove the texts "days" and change the wholce column into type "int"
    • day_of_week : Same as description without missing values, there is different labels should be same such as "Wednesday" and "Wed".I change the class in only 3 letters
    • time : Same as description without missing values
    • category : replace 13 missing values "-" to unknown
    • attended : Same as description without missing values
    import pandas as pd 
    import matplotlib.pyplot as plt
    import numpy as np 
    import seaborn as sns
    df = pd.read_csv("fitness_class_2212.csv")
    print(df.info())
    print("The minimum number of months as a member:",min(df["months_as_member"]))
    # Replace missing value of weight and check the minimum of the weight
    df_mod = df.fillna(df["weight"].mean())
    print(df_mod.info())
    print("the minimum weight is",df_mod["weight"].min())
    #Remove "days" in days_before, make the column"days_before" to the type int
    df_mod["days_before"] = df_mod["days_before"].str.replace(" days","")
    df_mod["days_before"] = df_mod["days_before"].astype("int")
    print(df_mod.info())
    # Take the 3 letters in each value, make the data consistent
    df_mod["day_of_week"] = df_mod["day_of_week"].str[:3]
    #replace "-" to unknown
    df_mod["category"] = df_mod["category"].replace("-","unknown")
    print(df_mod.iloc[55])

    Task 2

    The observations of bookings are not balance. There are 1046 people not attending more than 454 people who attended.

    #a
    df_mod.select_dtypes(['object','bool']).nunique()

    day_of_weekhas the most obsevcations

    attendence = df_mod[df_mod["attended"]==1]["attended"].count()
    absence = df_mod[df_mod["attended"]==0]["attended"].count()
    print("Total bookings:",len(df_mod))
    print("attendence",attendence)
    print("absence",absence)
    x = ["Attendence","Absence"]
    y = [attendence,absence]
    c = ["orange","green"]
    plt.bar(x,y,color = c ,alpha = 0.5)
    plt.title("Number of attendence and absence")
    plt.xlabel("attended")
    plt.ylabel("Number")
    plt.show()

    Task 3

    The number of moths as a member is a skewed distribution. Most of the months as member are from 8 months to 19 months. The median of the distribution is 12 months. There are still lots of outliers over 40 months.