Tanis Anderson/

DSA Practical


Data Science Associate

Task 1

This dataset has 1,500 rows with 8 columns. Before cleaning, some columns contain either missing values or inconsistent data entries that don't comply with the descriptions in the dataset table:

  • booking_id: Same as the description.
  • months_as_member: Same as the description with no missing values.
  • weight: Contains 20 missing values, they were replaced with the average weight.
  • days_before: Values are stored as strings where some entries contain the word "days" after the number. The strings were reduced to just the numeric value then converted to integer data type.
  • day_of_week: Some values contain the full day name; these were converted to the abbreviated name and all values were stripped of any other characters (some values had an extra period).
  • time: Same as the description with no missing values.
  • category: Contains values that don't match the description ("-") which were replaced with the value "Unknown".
  • attended: Same as the description with no missing values.

Original dataset

#Loading necessary libraries for analysis

#Importing the dataset and viewing the structure
df <- read_csv("fitness_class_2212.csv", show_col_types = FALSE)

Finding number of missing values for each column


Examining categorical variables

cat_vars <- c("day_of_week", "time", "category", "attended")

for(x in cat_vars){
	print(df %>% count(.data[[x]]))

Examining numeric variables

summary(df %>% select(!all_of(cat_vars)))

#Taking a closer look at days_before seeing as it is a character class
df %>%
	filter(nchar(days_before) > 5)%>%

Cleaning the categorical variables

#Cleaning the day_of_week variable, starting by stripping away extra periods
df$day_of_week <- str_replace_all(df$day_of_week, "\\.", "")

#Converting long day names to their abbreviated version
df <- df %>% 
	mutate(day_of_week = case_when(day_of_week == "Monday" ~ "Mon",
								   day_of_week == "Wednesday" ~ "Wed",
								   TRUE ~ day_of_week))
print(count(df, day_of_week))

#Cleaning the category variable by replacing the hyphens ( - ) with "unknown"
df$category <- str_replace(df$category, "-", "Unknown")
print(count(df, category))

Cleaning numeric variables

#Replacing all NA values in the weight column with the average weight
df$weight <- replace_na(df$weight, mean(df$weight, na.rm = TRUE))
df$weight <- round(df$weight, 2)

#Stripping away all non-numeric values in the days_before column then converting it to class integer
df$days_before <- as.integer(str_replace_all(df$days_before, "[^0-9]", ""))

Task 2

From Graph 1 we see that the HIIT category has the highest number of observations that atteneded the class, with cycling having the second most. Observations are not balanced; the number of total attended differs across categories.

#Loading the ggplot2 library for graphing

ggplot(df, aes(category, attended)) +
	geom_col() +
	labs(y = "Total Attended",  title = "Graph 1  Attendance by Category")

Task 3

As shown in Graph 2-1, the distribution of the months_as_member variable is right (positevely) skewed. There are outliers which causes the distribution to have a long tail, with the biggest outlier being around 150 months. If we take the log of the months_as_member variable, the distribution becomes very close to a normal distribution, as shown in Graph 2-2

#Examining the distribution of the months_as_member variable
ggplot(df, aes(months_as_member)) +
  geom_histogram(aes(y = ..density..), fill = "white", color = "black") +
  geom_density(color = "black", size = 0.8) +
  labs(x = "Months as Member", title = "Graph 2-1 Distribution of Months as Member")
#Performing a log transformation to the target variable to see how the distrubution changes to adjust for outliers
ggplot(df, aes(log(months_as_member))) +
  geom_histogram(aes(y = ..density..), fill = "white", color = "black") +
  geom_density(color = "black", size = 0.8) +
  labs(x = "Months as Member (Log transformed)", title = "Graph 2-2 Distribution of Months as Member (Log Transformed)")

Task 4

The median number of months as a member is higher for those who attended the class versus those who didn't. The majority of those who did not attend the class are around 1 to 14 months as a member, while those who did attend are most likely to be a member for more than 14 months. As shown in Graph 3-1, over three-quarters of those who attended bookings were members for longer than the median of those who didn't attend. To improve interpretability, the biggest outlier has been removed in Graph 3-2.

#Creating a boxplot of the months_as_member variable, grouped by the attended variable (0 and 1)
ggplot(df, aes(attended, months_as_member, group = attended)) +
  geom_boxplot() +
  labs(y = "Months as member", title = "Graph 3-1 Boxplot between Months as Member and Attended")
#Filter out the biggest outlier to improve readability
ggplot(df %>% filter(months_as_member < 140), aes(attended, months_as_member, group = attended)) +
  geom_boxplot() +
  labs(y = "Months as member", title = "Graph 3-2 Boxplot between Months as Member and Attended (Outlier Removed)")

Task 5

Since the buisness wants to predict whether someone will attend (1) or not attend (0) a class, this is a logistic regression / classification problem.