Kristian Aagaard/

Certification - Data Scientist Associate - Fitness Class (copy)


Data Scientist Associate Practical Exam Submission

Use this template to complete your analysis and write up your summary for submission.

Task 1

The dataset contains of 1500 rows and 8 columns with missing values and other criteria that isn't followed, before the data cleaning. All columns are validated against the criteria in the dataset table:

  • booking_id: is as described, no missing values.
  • months_as_member: is as described, no missing values.
  • weight: has 20 missing values. The missing values are replaced by the overall average weigt.
  • days_before: 25 rows have added 'days' to the number of days before, the string is removed and the dtype is converted to int dtype.
  • day_of_week: has 6 rows that doesn't follow the described categories. These are replaced so that the follow the description.
  • time: is as described, no missing values.
  • category: has 13 rows that doesn't follow the described categories. These are replaced as unknown, according to the dataset description.
  • attended: is as described, no missing values.

After the data validation, the dataset contains 1500 rows and 8 columns.

# Import numpy under its usual alias
import numpy as np
# Import Pandas under its usual alias
import pandas as pd

# Read in the fitness_class CSV as a DataFrame
gz_df = pd.read_csv('fitness_class_2212.csv')

# Printing the first five rows of the DataFrame

# Printing the data type of the columns




# Replacing missing values in the weight column
gz_df['weight'] = gz_df['weight'].fillna(gz_df['weight'].mean())

# Replacing string from days_before columns
gz_df['days_before'] = gz_df['days_before'].str.replace(' days', '')

gz_df['days_before'] = gz_df['days_before'].astype('int')

# Replacing wrong values in the day_of_week colmun
gz_df['day_of_week'] = gz_df['day_of_week'].str.replace('Monday', 'Mon')
gz_df['day_of_week'] = gz_df['day_of_week'].str.replace('Wednesday', 'Wed')
gz_df['day_of_week'] = gz_df['day_of_week'].str.strip('.')

# Replacing wrong values in the category column
gz_df['category'] = gz_df['category'].replace('-', 'unknown')

Task 2

From Graph 1 Category Attended, the most attended category is the HIIT-category with 213 participants, followed by Cycling with 110 participants and Strength as the third most attended category with 62 participants. The HIIT and Cycling categories are by far the two most popular categories. The HIIT-category constitute 60% of the total number of attended classes. These information’s tells that the observations are imbalanced across the categories of the attended variable.

# import matplotlib
import matplotlib.pyplot as plt

category = gz_df.groupby(['category', 'attended'])['attended'].sum()
category = category.unstack()
category.plot(kind='bar', color='tab:blue', position=0.75).set(title='Graph 1 Category Attended')
#remove legend
plt.legend('', frameon=False)

Task 3

From Graph 2-1 Months as Member we see that the distribution is right-skewed. Meaning that the majority of the data is concentrated on the left side of the graph while we see some big outliers on the right side of the graph. This is also confimed by the Kernal Density Estimation, showed by the blue line. Because of skewed data a log tranformation have been applied to the data. The result of the log transformation is seen in Grahp 2-2 and now the data resembles a normal distribution much more.

# import seaborn
import seaborn as sns

sns.histplot(data=gz_df, x='months_as_member', palette='Blues', bins=35, kde=True)
plt.title('Graph 2-1 Months as Member')
plt.xlabel('Months as Member')
gz_df['log_months_as_member'] = np.log(gz_df.months_as_member)
sns.histplot(data=gz_df, x='log_months_as_member', palette='Blues', bins=35, kde=True)
plt.title('Graph 2-2 Months as Member(Log Transformation)')
plt.xlabel('Months as Member')

Task 4

From Graph 3-1 Attendance vs Months as Member we can see that we find the most active members around those who have been members from 10 to 30 months, after that the members get less and less active, the longer they have been members. Let's compare these findings with the findings of Graph 2-1 Months as Member, because when we do that then the results also makes sense. From Graph 2-1 Months as Member we know that the highest concentration of members is between 5 and 13 months. From 13 months of membership the numbers of members just falls until we reach 40 months of membership, from here there is very few members.

When taking a look at Graph 3-2 Attendance vs Months as Member we see that the members not attending the classes have been members for the same amount of time as the majority of the members who are attending the classes. This graph also shows that when a person has been a member for more than 60 months, then they are always attending the classes they signed up for.

attendance = pd.DataFrame(gz_df.groupby('months_as_member')['attended'].sum())
attendance = attendance.iloc[3:, :]
sns.relplot(data=attendance, x='months_as_member', y='attended')
plt.title('Graph 3-1 Attendance vs Months as Member')
plt.xlabel('Months as Member')
sns.relplot(data=gz_df, x='months_as_member', y='attended', hue='attended', palette='Blues')
plt.title('Graph 3-2 Attendance vs Months as Member')
plt.xlabel('Months as Member')

Task 5

In this assignment we are asked to predict whether a member will attend in the class they signed up for. In other words, we are trying to predict a discrete binary value, therefore we are dealing with a classification problem.

# Import ML models and performance metrics
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn import svm 
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score

  • AI Chat
  • Code