Data Scientist Associate Practical Exam Submission
Project Background
- GoalZone is a fitness club chain in Canada.
- GoalZone offers a range of fitness classes in two capacities - 25 and 15.
- Some classes are always fully booked. Fully booked classes often have a low attendance rate.
Project Goal
- GoalZone wants to increase the number of spaces available for classes.
- They want to do this by predicting whether the member will attend the class or not.
- If they can predict a member will not attend the class, they can make another space available.
Classification ML Algorithms Used
- Logistic Regression
- XGBoost with SMOTE
Dataset
The dataset contains the attendance information for the class scheduled this year so far. The data you will use for this analysis can be accessed here: fitness_class (Invalid URL)
| Column Name | Criteria |
|---|---|
| booking_id | Nominal. The unique identifier of the booking. Missing values are not possible due to the database structure. |
| months_as_member | Discrete. The number of months as this fitness club member,minimum 1 month.Replace missing values with the overall average month. |
| weight | Continuous. The member's weight in kg, rounded to 2 decimal places. The minimum possible value is 40.00 kg.Replace missing values with the overall average weight. |
| days_before | Discrete. The number of days before the class the member registered, minimum 1 day. Replace missing values with 0. |
| day_of_week | Ordinal. The day of the week of the class. One of “Mon”, “Tue”, “Wed”,“Thu”, “Fri”, “Sat” or “Sun”. Replace missing values with “unknown”. |
| time | Ordinal. The time of day of the class. Either “AM” or “PM”. Replace missing values with “unknown”. |
| category | Nominal. The category of the fitness class. One of “Yoga”, “Aqua”,“Strength”, “HIIT”, or “Cycling”.Replace missing values with “unknown”. |
| attended | Nominal. Whether the member attended the class (1) or not (0).Missing values should be removed |
Task 1: Data Validation and Visualization
The dataset contains 1500 rows and 8 columns with missing values before cleaning. I have validated the columns againsts the criteria in the dataset table.
booking_id: same as description without missing values, 1500 ids.months_as_member: same as description without missing values.weight: same as description, 20 missing values replace with overall average weight.days_before: removed 25 trailing ' days' instanced and changed column to datatype to integer.day_of_week: column has inconsistent naming ('Wednesday' vs 'Wed', 'Monday' vs 'Mon', 'Fri.' vs 'Fri'). I edited them to have “Mon”, “Tue”, “Wed”,“Thu”, “Fri”, “Sat” or “Sun” format.time: same as description without missing values.category: converted "-" values to 'unknown'.attended: same as description without missing values.
After data validation, the data set contains 1500 rows and 8 columns. However, after scaling and feature selecting best parameters using sklearn's mutual_info_classif, the following 8 columns are chosen to build model (1500 rows and 8 columns).
'Log_months_as_member', 'LogScaled_weight', 'MinMax_Scaled_weight', 'category_unknown', 'MinMax_Scaled_days_before', 'category_Aqua', 'category_Yoga', 'time'
Answer
Answer1. For every column in the data:
a. State whether the values match the description given in the table above.
Not all columns follow the description in the table.days_beforehas object datatype due to trailing ' days' for some values.day_of_weekhas inconsistent naming('Wednesday' vs 'Wed', 'Monday' vs 'Mon', 'Fri.' vs 'Fri')categoryhas unexpected "-" character as value.weighthas NA values but other values look accurate.
b. State the number of missing values in the column.
weightcolumn has 20 missing values in columncategorycolumn has 13 "-" values. This could be missing or incorrect entry.
c.Describe what you did to make values match the description if they did not match.
days_beforeremoved trailing ' days' str and converted column to integer type.day_of_weekconverted 'Wednesday' to 'Wed', 'Monday' to 'Mon', 'Fri.' to 'Fri'categoryconverted 13 "-" input instances to 'unknown'weightreplaced the missing values with the overall average weight.
Setting style
%%HTML
<style type="text/css">
table.dataframe td, table.dataframe th{
border:1px black solid !important;
color: black !important;
}
</style>Data Validation
Load modules, data and looking at the first few observations
# Load modules
import warnings
warnings.filterwarnings("ignore")
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib_inline
import matplotlib.style as style
import seaborn as sns
# load data
master_df = pd.read_csv("fitness_class_2212.csv")
df = master_df.copy()
df.head()Before Editing: Looking at non-null values and data types
print(df.info())Checking at descriptive stats for numeric columns
df.describe()Checking for missing values
print("Missing values")
print("-"*23)
print(df.isnull().sum())
print("-"*23)
print(f"Total missing values:{df.isnull().sum().sum()}")Looking at value counts for categorical columns