Recipe Site Traffic

Beta

Recipe Site Traffic

Load required packages

# Load required packages
library(ggplot2)
library(dplyr)
library(tidyr)
library(stringr)
library(readr)
library(rsample)
library(rpart)
library(reshape2)
library(caret)

# Set seed
set.seed(1)

# Read the data and store them in a data frame
df <- read.csv('https://s3.amazonaws.com/talent-assets.datacamp.com/recipe_site_traffic_2212.csv')
df

Data Validation

The dataset contains 947 rows and 8 columns with missing values before cleaning. Below is the summary regarding all the validation steps that were followed for each column (variable) separately.

recipe: The values of this column are integers, as it should be, since they correspond to the recipe ID. All values are in range (i.e. no negative values) and there are no missing values. Overall, the column was left untouched.
calories: This column has correctly numeric values, since it measures the number of calories a serving of the recipe has. There are 52 missing values and by looking at the boxplots for each recipe category the existance of outliers led me to replace them with me median value based on the category. Looking at the summary statistics, all values seem to be in range (i.e., no negative values).
carbohydrate: This column has correctly numeric values, since it measures the amount of carbohydrate a serving of the recipe has (in grams). There are 52 missing values and by looking at the boxplots for each recipe category the existance of outliers led me to replace them with me median value based on the category. Looking at the summary statistics, all values seem to be in range (i.e., no negative values).
sugar: This column has correctly numeric values, since it measures the amount of sugar a serving of the recipe has (in grams). There are 52 missing values and by looking at the boxplots for each recipe category the existance of outliers led me to replace them with me median value based on the category. Looking at the summary statistics, all values seem to be in range (i.e., no negative values).
protein: This column has correctly numeric values, since it measures the amount of protein a serving of the recipe has (in grams). There are 52 missing values and by looking at the boxplots for each recipe category the existance of outliers led me to replace them with me median value based on the category. Looking at the summary statistics, all values seem to be in range (i.e., no negative values).
category: The values are of the class character, so they need to be converted to factor. Doing so and looking at the names of the factors, it is clear that there is one extra category, Chicken Breast, which shouldn’t be present. Every instance of this value was replaced with the “Chicken” value. There are no missing values.
servings: The values are of the class character, so they need to be converted to a numeric value and specifically an integer. If they are converted to integer immediately, 3 NAs will be introduced due to coercion. Filtering the strings for those with length more than 1, 3 cases are presented, where after the number of servings follows the phrase “as a snack”. This phrase will be removed, by using the parse_number() function from the readr package in order to keep only the numeric part. Thenm it is converted to integer. The column has no missing values, while its values range is correct (i.e., no negative values).
high_traffic: The column contains 373 missing values. These values correspond to site traffic that is not high, so they have been replaces with the string “Low”. Afterwards, they were converted to factor.

For the replacement of the missing values in the variables calories, carbohydrate, sugar and protein, since each measurement corresponds to one serving, I grouped the data frame only by category and applied the replace_na() and median() functions across all four variables.

The cleaned dataset, after the validation of the data and replacement of the missing values (all of them correspond to the same recipes), contains 947 observations.

# Check the structure of the original data frame
str(df)

# Column "recipe"
str(df$recipe) # Values are integer as it should since it's recipe ID
summary(df$recipe) # Values are in range (no negative values)
sum(is.na(df$recipe)) # There are no missing values

# Column "calories"
str(df$calories) # Values are numeric as it should
summary(df$calories) # Values are in range (no negative values)
sum(is.na(df$calories)) # There are 52 missing values

# Boxplot of calories by category
ggplot(df, aes(x=calories, y=category, fill = category)) +
  geom_boxplot() +
  labs(x="Calories per serving", 
       y="Category", 
       title = "Boxplot of calories by category",
       fill="Category")

# There are outliers in some categories, thus the missing values will be replaced in a later step with the median of calories based on the category.

# Column "carbohydrate"
str(df$carbohydrate) # Values are numeric as it should
summary(df$carbohydrate) # Values are in range (no negative values)
sum(is.na(df$carbohydrate)) # There are 52 missing values

# Boxplot of carbohydrates by category
ggplot(df, aes(x=carbohydrate, y=category, fill = category)) +
  geom_boxplot() +
  labs(x="Carbohydrates per serving (in grams)", 
       y="Category", 
       title = "Boxplot of carbohydrates by category",
       fill="Category")

# There are outliers in all categories, thus the missing values will be replaced in a later step with the median of carbohydrates based on the category.

# Column "sugar"
str(df$sugar) # Values are numeric as it should
summary(df$sugar) # Values are in range (no negative values)
sum(is.na(df$sugar)) # There are 52 missing values

# Boxplot of sugar by category
ggplot(df, aes(x=sugar, y=category, fill = category)) +
  geom_boxplot() +
  labs(x="Sugar per serving (in grams)", 
       y="Category", 
       title = "Boxplot of sugar by category",
       fill="Category")

# There are outliers in all categories, thus the missing values will be replaced in a later step with the median of sugar based on the category.

# Column "protein"
str(df$protein) # Values are numeric as it should
summary(df$protein) # Values are in range (no negative values)
sum(is.na(df$protein)) # There are 52 missing values

# Boxplot of protein by category
ggplot(df, aes(x=protein, y=category, fill = category)) +
  geom_boxplot() +
  labs(x="Protein per serving (in grams)", 
       y="Category", 
       title = "Boxplot of protein by category",
       fill="Category")

# There are outliers in some categories, thus the missing values will be replaced in a later step with the median of protein based on the category.

# Column "category"
glimpse(df$category) # Values are characters need to transform to factor
sum(is.na(df$category)) # No missing values
df$category <- factor(df$category)
levels(df$category) # Need to replace "Chicken Breast" with the value "Chicken"
df <- df %>% mutate(category = replace(category, category == "Chicken Breast", "Chicken"))
df$category <- factor(df$category)
levels(df$category) # 10 categories in total as stated in the Data information table

# Column "servings"
glimpse(df$servings) # Values should be numeric based on the Data information, instead of character

df %>% filter(str_length(servings)>1) # In 3 cases there are values with more than one character.
df$servings <- as.integer(parse_number(df$servings)) # Keep only number from string and convert to integer

sum(is.na(df$servings)) # No missing values
summary(df$servings) # Values are in range (no negative values)

# Column "high_traffic"
glimpse(df$high_traffic)
sum(is.na(df$high_traffic)) # Has 373 missing values that correspond to low traffic

# Replace all missing values with the value "Low" to indicate not "High" traffic to the site
df <- df %>% 
  mutate(high_traffic = ifelse(is.na(high_traffic), "Low", high_traffic))

df$high_traffic <- factor(df$high_traffic) # Convert to factor
str(df$high_traffic)
sum(is.na(df$high_traffic)) # No missing values

# Now all the missing values that were found in previous steps will be replaced in the data frame with the median of the variable in each category.
df_clean <- df %>%
  group_by(category) %>%
  mutate(across(calories, ~replace_na(., median(., na.rm=TRUE))),
         across(carbohydrate, ~replace_na(., median(., na.rm=TRUE))),
         across(sugar, ~replace_na(., median(., na.rm=TRUE))),
         across(protein, ~replace_na(., median(., na.rm=TRUE)))) %>%
  ungroup()

str(df_clean) # 947 observations

Exploratory Analysis

Figure 1 depicts the bar chart of the variable high_traffic. It appears that 535 recipes generated high traffic to the website when shown, compared to 360 which generated low traffic. This distribution indicates that the observations are not fully balanced (ratio of High/Low was 3 to 2) across the two groups. Though, this imbalance is mild and no further action will be taken to address the issue.

Figure 2 depicts the histogram of the variable calories. It appears that the distribution is right skewed, with one distinct peak around the values: 200-250. Additionally, there are some cases when some recipes have extremely high values (over 1800 calories), which are clearly outliers. All these indicate that, generally, most of the recipes have low calories, with exception a handful of them which have really high calories.

Figure 3 depicts the percent stacked bar plot of the variable category by the variable high_traffic. It appears that recipes that belong to the categories “Vegetable”, “Potato” and “Pork” generate extremely high traffic to the site, followed by recipes of the categories “One Dish Meal”, “Meat”, “Lunch/Snacks” and “Dessert”. These 7 categories can be described as popular. In contrast, recipes in the categories of “Chicken”, “Breakfast” and “Beverages” don’t generate high enough traffic and as a result are less popular. This indicates that the category of the recipe has significant impact on the site traffic.

Figures 4-8 present violin plots of all numeric variables (calories, carbohydrate, sugar, protein and servings) by the variable high_traffic. Comparing the violin plot between high and low site traffic for each variable, the differences in appearance are indistinguishable in most cases. This lead us to the conclusion that these 5 variables don't play significant role on the site traffic.

Figure 9 shows that between the 5 numeric variables the corellations are really weak, since all the values are close to 0 (the highest correlation coefficient is 0,18). This is convenient, since we won't have problems with multicollinearity and we can use all the numeric variables as features for the model development.

‌
‌
‌

Recipe Site Traffic

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Recipe Site Traffic

Load required packages

Data Validation

Exploratory Analysis

Recipe Site Traffic