Recipe Site Traffic

❗You can find my presentation of this project (video + powerpoint ) here . It is a non-technical presentation for the Product Manager from Tasty Bytes who requested this analysis.

1. 📖 Introduction

Tasty Bytes was founded in 2020 in the midst of the Covid Pandemic. It started as a search engine for recipes, helping people to find ways to use up the limited supplies they had at home. Now, over two years on, they are a fully fledged business.

2. 🎯 Goals for this project

At the moment, the team choose their favorite recipe from a selection and display that on the home page. They have noticed that traffic to the rest of the website goes up by as much as 40% if they pick a popular recipe, but they don’t know how to decide if a recipe will be popular. More traffic means more subscriptions so this is really important to the company.

The product manager from Tasty Bytes specifically requested to:

🟢 Predict which recipes will lead to high traffic.

🟢 Correctly predict high traffic recipes 80% of the time.

3. 💾 The dataset

The product manager from TastyBytes provided data for each recipe, as well as whether there was high traffic when the recipe was featured on the home page.

The dataset contains:

recipe: Numeric, unique identifier of recipe
calories: Numeric, number of calories
carbohydrate: Numeric, amount of carbohydrates in grams
sugar: Numeric, amount of sugar in grams
protein: Numeric, amount of protein in grams
category: Character, type of recipe. Recipes are listed in one of ten possible groupings:
- Lunch/Snacks,
- Beverages,
- Potato,
- Vegetable,
- Meat,
- Chicken,
- Pork,
- Dessert,
- Breakfast,
- One Dish Meal
servings: Numeric, number of servings for the recipe
high_traffic: Character, if the traffic to the site was high when this recipe was shown, this is marked with “High”.

4. 🚀Analysis plan

✅In the first step, we perfomed an Exploratory Data Analysis to uncover any valueble insights related to:

The interactions between the recipe characteristics.
The interactions between all the characteristics of the recipe and the high traffic generated on the site when a particular recipe is displayed;

✅ Next, we also conducted statistical significance tests to explore the relationship between the variables:

First, we performed target-feature tests to determine whether there is a connection between the target feature, high_traffic, and the other recipe features.
Second, we performed feature-feature tests to assess any relationships between the features themselves.

✅ We then put the two candidate models through a series of steps to fit and evaluate them, including:

We split the dataset into training and testing sets (80:20 ratio), stratifying by the class proportions of the target variable.
The categorical variable was encoding using one-hot technique, and, where necessary, all the input features were standardised by having a mean of 0 and standard deviation of 1.
We trained the two machine learning algorithms for this classification problem.
We measured the models performances using the accuracy, recall and precision.
We tuned the hyperparameters for the Gradient Boosting model using 5-fold stratified cross-validation.
Finally, we assesed feature importances and made a final selection of the features.

The classifiers used are:

Logistic Regression
Gradient Boosting

✅ After this analysis, we provided some recommendations on how to increase the traffic on this site based on our results.

📚 5. Install packages

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import missingno as msno
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, make_scorer

from sklearn.preprocessing import OneHotEncoder, RobustScaler
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.model_selection import StratifiedKFold, GridSearchCV
from sklearn.pipeline import make_pipeline, Pipeline

from scipy.stats import shapiro, chi2_contingency, mannwhitneyu

6. 📕 Read the dataset

The dataset contains information that is categorized into 8 variables for a total of 947 recipes.

recipes = pd.read_csv(r'./recipe_site_traffic_2212.csv')
recipes.head()

recipes.shape

7. 🧹 Data cleaning and validation

In this step, we examined, validated and cleaned each column in the data. We performed the following tasks:

Checked for any missing values in the columns and either eliminated or replaced as necessary;
Checked the data types for each variable and changed them as necessary;
Checked if the dataset had negative values for numeric columns and eliminated them;
Checked for duplicates;
Checked for categorical data problems: Can we find any inconsistencies in the values of the category column?

7.1. Missing values

We observed that there were 52 missing values in the columns calories, carbohydrate, sugar and protein. After using the missingno library for visualizing the missing values, we could see that in almost all cases, the null values were present in all the four columns for a particular recipe. We decided to eliminate these null values from the dataset since they represented a small percentage of the total of 947 recipes.

We have noticed 373 null values for the high_traffic column. According to the dataset description, the High label is assigned to recipes that received high traffic when shown on the site. Therefore, we can assume that any recipe without the High label had low traffic, and we will replace the null values with the label Low.

After cleaning and validating data, the dataset contains 895 recipes.

# how many missing values are in the dataset?
recipes.isna().sum()

‌
‌
‌