Food Delivery App
A food delivery app has just hired you as a data analyst. It coordinates orders from different restaurants to customers in New York City. They have only been in operation a month and need more visibility into their business.
The founder would like to know what insights you can extract from the data. For example:
- Are there many repeat customers?
- Do repeat customers like to try different cuisines, or do they have favorite restaurant types?
- Is there a relationship between how long it takes to deliver a meal and the customer's rating?
They would also like to know your recommendations based on what you find. What does the data suggest their next steps should be?
Source of dataset.
import pandas as pd import numpy as np import matplotlib.pyplot as plt import seaborn as sns df = pd.read_csv("food_order.csv") df
There are 1898 rows with 9 attributes. Deeper examination of the data will need to be conducted to ensure cleanliness.
print('There are', df.customer_id.nunique(), 'unique customers - this will be examined in more detail later.', 'As expected, there are', df.order_id.nunique(), "unique order id's - one for each row.")
Many of these names appear incorrect. These should be fixed:
badNames = ['Big Wong Restaurant \x8c_¤¾Ñ¼', 'Empanada Mama (closed)', 'Chipotle Mexican Grill $1.99 Delivery', "Joe's Shanghai \x8e_À\x8eü£¾÷´", 'Dirty Bird To Go (archived)', 'CafÌ© China'] goodNames = ['Big Wong Restaurant', 'Empanada Mama', 'Chipotle Mexican Grill', "Joe's Shanghai", 'Dirty Bird To Go', 'Cafe China'] # Get index values of bad names for i, name in enumerate(badNames): value = df[df['restaurant_name'] == name].index # Replace bad names with good names for x in value: df.loc[x,'restaurant_name'] = goodNames[i] # Standardize names df['restaurant_name'] = df['restaurant_name'].apply(lambda x : x.strip().capitalize()) # checking restaurant names again for verifying changes df.restaurant_name.unique()
This looks normal.
This appears to be reasonable as well.
The day_of_the_week column also appears to be fine.
This would be better as an integer with NA values.
df.rating = df.rating.apply(pd.to_numeric, args=('coerce',)).astype('Int64') print(df.rating.dtype, df.rating.unique()) print(df.rating.isna().sum()) print(df.rating.describe())
There are an unfortunately high number of NA values in this column. This can be examined more in detail later.
It would also be helpful to have another calculated column: total_time_from_order.
df['total_time_from_order'] = df.delivery_time + df.food_preparation_time df.total_time_from_order.describe()