Beta
Retail Store Sales
Metadata
calendar: contains information about the dates on which the products are sold
- date: The date in a "yyyy-mm-dd" format
- wm_yr_wk: The id of the week the date belongs to
- weekday: The type of the day (Saturday, Sunday, ... , Friday)
- wday: The id of the weekday, starting from Saturday
- month: The month of the date
- year: The year of the date
- d: The number of date from 2011-01-29 to 2016-06-19
- event_name_1: If the date includes an event, the name of this event
- event_type_1: If the date includes an event, the type of this event
- event_name_2: If the date includes an event, the name of this event
- event_type_2: If the date includes an event, the type of this event
- snap_CA: A binary variable (0 or 1) indicating the stores of CA allow snap purchases on the examined date
- snap_TX: A binary variable (0 or 1) indicating the stores of TX allow snap purchases on the examined date
- snap_WI: A binary variable (0 or 1) indicating the stores of WI allow snap purchases on the examined date SNAP stands for Supplemental Nutrition Assistance Program
sell_prices: contains information about the price of the products sold per store and date
- store_id: The id of the store where the product is sold
- item_id: The id of the product
- wm_yr_wk: The id of the week the date belong to
- sell_price: The price of the products
sales: contains the historical daily unit sales data per product and store
- id: Indicate the item_id and store_id are evaluated
- item_id: The id of the product
- dept_id: The id of the department the product belong to
- cat_id: The id of the category the product belongs to
- store_id: The id of the store where the product is sold
- state_id: The state where the store is located
- d_1,d_2,...,d_1941: The number of units sold at day ..., starting from 2011-01-29
Step 1: Check the data
I have checked the dataset by using Excel and Power BI as the lists below.
- Typos/data entry errors
- Outliers
- Data types for all columns So in the Python notebook, I check the number of rows and columns and some missing values.
import numpy as np
import pandas as pd
#Import the sales.csv, sell_prices.csv, and calendar.csv file
sales_table = pd.read_csv("sales.csv")
price_table = pd.read_csv("sell_prices.csv")
calendar_table = pd.read_csv("calendar.csv")
print(price_table)
print(calendar_table)
print(sales_table)
print(sales_table.head(4))
print(sales_table.info())
#Many Nulls in event_name_1, event_name_2, event_type_1, and event_type_2 columns
print(calendar_table.info())
print(price_table.info())
Step 2: Explore Data
I explore data by questioning the dataset to find insight.
2.1 What is the store's maximum number of sales by sum every day (from 2011-01-29 to 2016-05-22)?
The CA_3 store has the maximum number of sales by sum every day at 11,363,540 units, and this also affects the sum of sales in CA state, which is higher than in other states.
import matplotlib.pyplot as plt
import pandas as pd
sales = pd.read_csv("sales.csv")
# Group by "store_id" and sum the "d_1" to "d_1941" columns
grouped_sales = sales.groupby("store_id").sum()
# Indicate x and y for plotting scatter chart
x_point = grouped_sales.index
y_point = grouped_sales.loc[:, "d_1":"d_1941"].sum(axis=1)
# Plot bar chart
plt.bar(range(len(x_point)), y_point)
plt.xlabel("Store ID")
plt.ylabel("The sum of sales every day")
# Adjust x-axis labels
plt.xticks(range(len(x_point)), x_point, fontsize=6) # Adjust fontsize to make x-axis labels smaller
plt.show()
print(grouped_sales)
print(y_point)
2.2 What is the department's maximum number of sales by sum every day (from 2011-01-29 to 2016-05-22)?
The FOOD_3 department has the maximum number of sales by sum every day at 32,937,002 units, and this also affects the sum of sales in items in the department, which is higher than in other items.
import matplotlib.pyplot as plt
import pandas as pd
sales = pd.read_csv("sales.csv")
# Group by "dept_id" and sum the "d_1" to "d_1941" columns
grouped_sales = sales.groupby("dept_id").sum()
# Indicate x and y for plotting scatter chart
x_point = grouped_sales.index
y_point = grouped_sales.loc[:, "d_1":"d_1941"].sum(axis=1)
# Plot bar chart
plt.bar(range(len(x_point)), y_point)
plt.xlabel("Department ID")
plt.ylabel("The sum of sale every day")
# Adjust x-axis labels
plt.xticks(range(len(x_point)), x_point, fontsize=6) # Adjust fontsize to make x-axis labels smaller
plt.show()
print(grouped_sales)
print(y_point)
2.3 If the FOODS_3 department has the maximum sale unit value, when is the maximum frequency people in three states like buying food?
After I randomized the results, almost every day people like buying the food, which probably matches when the government assists the people with SNAP.
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
sales = pd.read_csv("sales.csv")
def find_max_sum(grouped_sales):
# Group by "dept_id" and sum the "d_1" to "d_1941" columns
grouped_sales = sales.groupby("dept_id").sum()
# Calculate the sum of each column
sum_per_column = grouped_sales.loc[grouped_sales.index == "FOODS_3", "d_1":"d_1941"].sum()
# Find the maximum value and its corresponding column name
max_sum = sum_per_column.nlargest(30)
max_column = sum_per_column.idxmax()
return max_column, max_sum
max_column, max_sum = find_max_sum(sales)
max_column, max_sum
import pandas as pd
calendar = pd.read_csv("calendar.csv")
# Find the rows where d is equal to d_954, d_1934, d_961, and d_1136
matching_rows = calendar[calendar["d"].isin(["d_954", "d_1934", "d_961", "d_1136", "d_948", "d_960", "d_953","d_940","d_982","d_1892","d_1884","d_1864","d_1164","d_981","d_1617","d_947","d_1898","d_1010","d_1252","d_1261","d_492","d_499","d_1835","d_946","d_435","d_1079","d_1177","d_1941","d_1080","d_926"])]
# Select the d and SNAP_CA, SNAP_TX, SNAP_WI columns from the matching rows
matching_values = matching_rows[["d", "snap_CA", "snap_TX", "snap_WI"]]
matching_values
From the dataset, The dates that the government assists people in each state are
- CA State 1st–10th of every month
- TX State 1st–15th of every month
- WI State 2nd–15th of every month