Predicting Fitness Class Attendance

Data Scientist Associate Practical Exam Submission

Project Background

GoalZone is a fitness club chain in Canada.
GoalZone offers a range of fitness classes in two capacities - 25 and 15.
Some classes are always fully booked. Fully booked classes often have a low attendance rate.

Project Goal

GoalZone wants to increase the number of spaces available for classes.
They want to do this by predicting whether the member will attend the class or not.
If they can predict a member will not attend the class, they can make another space available.

Classification ML Algorithms Used

Logistic Regression
XGBoost with SMOTE

Dataset

The dataset contains the attendance information for the class scheduled this year so far. The data you will use for this analysis can be accessed here: fitness_class (Invalid URL)

Column Name	Criteria
booking_id	Nominal. The unique identifier of the booking. Missing values are not possible due to the database structure.
months_as_member	Discrete. The number of months as this fitness club member,minimum 1 month.Replace missing values with the overall average month.
weight	Continuous. The member's weight in kg, rounded to 2 decimal places. The minimum possible value is 40.00 kg.Replace missing values with the overall average weight.
days_before	Discrete. The number of days before the class the member registered, minimum 1 day. Replace missing values with 0.
day_of_week	Ordinal. The day of the week of the class. One of “Mon”, “Tue”, “Wed”,“Thu”, “Fri”, “Sat” or “Sun”. Replace missing values with “unknown”.
time	Ordinal. The time of day of the class. Either “AM” or “PM”. Replace missing values with “unknown”.
category	Nominal. The category of the fitness class. One of “Yoga”, “Aqua”,“Strength”, “HIIT”, or “Cycling”.Replace missing values with “unknown”.
attended	Nominal. Whether the member attended the class (1) or not (0).Missing values should be removed

Task 1: Data Validation and Visualization

The dataset contains 1500 rows and 8 columns with missing values before cleaning. I have validated the columns againsts the criteria in the dataset table.

booking_id: same as description without missing values, 1500 ids.
months_as_member: same as description without missing values.
weight: same as description, 20 missing values replace with overall average weight.
days_before: removed 25 trailing ' days' instanced and changed column to datatype to integer.
day_of_week: column has inconsistent naming ('Wednesday' vs 'Wed', 'Monday' vs 'Mon', 'Fri.' vs 'Fri'). I edited them to have “Mon”, “Tue”, “Wed”,“Thu”, “Fri”, “Sat” or “Sun” format.
time: same as description without missing values.
category: converted "-" values to 'unknown'.
attended: same as description without missing values.

After data validation, the data set contains 1500 rows and 8 columns. However, after scaling and feature selecting best parameters using sklearn's mutual_info_classif, the following 8 columns are chosen to build model (1500 rows and 8 columns).

'Log_months_as_member', 'LogScaled_weight', 'MinMax_Scaled_weight', 'category_unknown', 'MinMax_Scaled_days_before', 'category_Aqua', 'category_Yoga', 'time'

`Answer`

1. For every column in the data:

a. State whether the values match the description given in the table above.

Not all columns follow the description in the table.
days_before has object datatype due to trailing ' days' for some values.
day_of_week has inconsistent naming('Wednesday' vs 'Wed', 'Monday' vs 'Mon', 'Fri.' vs 'Fri')
category has unexpected "-" character as value.
weight has NA values but other values look accurate.

b. State the number of missing values in the column.

weight column has 20 missing values in column
category column has 13 "-" values. This could be missing or incorrect entry.

c.Describe what you did to make values match the description if they did not match.

days_before removed trailing ' days' str and converted column to integer type.
day_of_week converted 'Wednesday' to 'Wed', 'Monday' to 'Mon', 'Fri.' to 'Fri'
category converted 13 "-" input instances to 'unknown'
weight replaced the missing values with the overall average weight.

Setting style

%%HTML
<style type="text/css">
table.dataframe td, table.dataframe th{
    border:1px black solid !important;
    color: black !important;
}
</style>

Data Validation

Load modules, data and looking at the first few observations

# Load modules 
import warnings
warnings.filterwarnings("ignore")
import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib_inline
import matplotlib.style as style
import seaborn as sns


# load data
master_df = pd.read_csv("fitness_class_2212.csv")

df = master_df.copy()
df.head()

Before Editing: Looking at non-null values and data types

print(df.info())

Checking at descriptive stats for numeric columns

df.describe()

Checking for missing values

print("Missing values")
print("-"*23)
print(df.isnull().sum())
print("-"*23)
print(f"Total missing values:{df.isnull().sum().sum()}")

Looking at value counts for categorical columns

‌
‌
‌