Skip to content

Retail Store Sales

Metadata

calendar: contains information about the dates on which the products are sold

  • date: The date in a "yyyy-mm-dd" format
  • wm_yr_wk: The id of the week the date belongs to
  • weekday: The type of the day (Saturday, Sunday, ... , Friday)
  • wday: The id of the weekday, starting from Saturday
  • month: The month of the date
  • year: The year of the date
  • d: The number of date from 2011-01-29 to 2016-06-19
  • event_name_1: If the date includes an event, the name of this event
  • event_type_1: If the date includes an event, the type of this event
  • event_name_2: If the date includes an event, the name of this event
  • event_type_2: If the date includes an event, the type of this event
  • snap_CA: A binary variable (0 or 1) indicating the stores of CA allow snap purchases on the examined date
  • snap_TX: A binary variable (0 or 1) indicating the stores of TX allow snap purchases on the examined date
  • snap_WI: A binary variable (0 or 1) indicating the stores of WI allow snap purchases on the examined date SNAP stands for Supplemental Nutrition Assistance Program

sell_prices: contains information about the price of the products sold per store and date

  • store_id: The id of the store where the product is sold
  • item_id: The id of the product
  • wm_yr_wk: The id of the week the date belong to
  • sell_price: The price of the products

sales: contains the historical daily unit sales data per product and store

  • id: Indicate the item_id and store_id are evaluated
  • item_id: The id of the product
  • dept_id: The id of the department the product belong to
  • cat_id: The id of the category the product belongs to
  • store_id: The id of the store where the product is sold
  • state_id: The state where the store is located
  • d_1,d_2,...,d_1941: The number of units sold at day ..., starting from 2011-01-29

image.png

Step 1: Check the data

I have checked the dataset by using Excel and Power BI as the lists below.

  • Typos/data entry errors
  • Outliers
  • Data types for all columns So in the Python notebook, I check the number of rows and columns and some missing values.

import numpy as np
import pandas as pd
#Import the sales.csv, sell_prices.csv, and calendar.csv file
sales_table = pd.read_csv("sales.csv")
price_table = pd.read_csv("sell_prices.csv")
calendar_table = pd.read_csv("calendar.csv")
print(price_table)
print(calendar_table)
print(sales_table)
print(sales_table.head(4))
print(sales_table.info()) 
#Many Nulls in event_name_1, event_name_2, event_type_1, and event_type_2 columns
print(calendar_table.info())
print(price_table.info())

Step 2: Explore Data

I explore data by questioning the dataset to find insight.

2.1 What is the store's maximum number of sales by sum every day (from 2011-01-29 to 2016-05-22)?

The CA_3 store has the maximum number of sales by sum every day at 11,363,540 units, and this also affects the sum of sales in CA state, which is higher than in other states.

import matplotlib.pyplot as plt
import pandas as pd

sales = pd.read_csv("sales.csv")

# Group by "store_id" and sum the "d_1" to "d_1941" columns
grouped_sales = sales.groupby("store_id").sum()

# Indicate x and y for plotting scatter chart
x_point = grouped_sales.index
y_point = grouped_sales.loc[:, "d_1":"d_1941"].sum(axis=1)

# Plot bar chart
plt.bar(range(len(x_point)), y_point)
plt.xlabel("Store ID")
plt.ylabel("The sum of sales every day")

# Adjust x-axis labels
plt.xticks(range(len(x_point)), x_point, fontsize=6)  # Adjust fontsize to make x-axis labels smaller

plt.show()

print(grouped_sales)
print(y_point)

2.2 What is the department's maximum number of sales by sum every day (from 2011-01-29 to 2016-05-22)?

The FOOD_3 department has the maximum number of sales by sum every day at 32,937,002 units, and this also affects the sum of sales in items in the department, which is higher than in other items.

import matplotlib.pyplot as plt
import pandas as pd

sales = pd.read_csv("sales.csv")

# Group by "dept_id" and sum the "d_1" to "d_1941" columns
grouped_sales = sales.groupby("dept_id").sum()

# Indicate x and y for plotting scatter chart
x_point = grouped_sales.index
y_point = grouped_sales.loc[:, "d_1":"d_1941"].sum(axis=1)

# Plot bar chart
plt.bar(range(len(x_point)), y_point)
plt.xlabel("Department ID")
plt.ylabel("The sum of sale every day")

# Adjust x-axis labels
plt.xticks(range(len(x_point)), x_point, fontsize=6)  # Adjust fontsize to make x-axis labels smaller

plt.show()

print(grouped_sales)
print(y_point)

2.3 If the FOODS_3 department has the maximum sale unit value, when is the maximum frequency people in three states like buying food?

After I randomized the results, almost every day people like buying the food, which probably matches when the government assists the people with SNAP.

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

sales = pd.read_csv("sales.csv")

def find_max_sum(grouped_sales):
    # Group by "dept_id" and sum the "d_1" to "d_1941" columns
    grouped_sales = sales.groupby("dept_id").sum()

    # Calculate the sum of each column
    sum_per_column = grouped_sales.loc[grouped_sales.index == "FOODS_3", "d_1":"d_1941"].sum()

    # Find the maximum value and its corresponding column name
    max_sum = sum_per_column.nlargest(30)
    max_column = sum_per_column.idxmax()

    return max_column, max_sum

max_column, max_sum = find_max_sum(sales)
max_column, max_sum
import pandas as pd

calendar = pd.read_csv("calendar.csv")

# Find the rows where d is equal to d_954, d_1934, d_961, and d_1136
matching_rows = calendar[calendar["d"].isin(["d_954", "d_1934", "d_961", "d_1136", "d_948", "d_960", "d_953","d_940","d_982","d_1892","d_1884","d_1864","d_1164","d_981","d_1617","d_947","d_1898","d_1010","d_1252","d_1261","d_492","d_499","d_1835","d_946","d_435","d_1079","d_1177","d_1941","d_1080","d_926"])]

# Select the d and SNAP_CA, SNAP_TX, SNAP_WI columns from the matching rows
matching_values = matching_rows[["d", "snap_CA", "snap_TX", "snap_WI"]]

matching_values

From the dataset, The dates that the government assists people in each state are

  • CA State 1st–10th of every month
  • TX State 1st–15th of every month
  • WI State 2nd–15th of every month