Improving Retail Business
  • AI Chat
  • Code
  • Report
  • Beta
    Spinner

    Retail Store Sales

    Metadata

    calendar: contains information about the dates on which the products are sold

    • date: The date in a "yyyy-mm-dd" format
    • wm_yr_wk: The id of the week the date belongs to
    • weekday: The type of the day (Saturday, Sunday, ... , Friday)
    • wday: The id of the weekday, starting from Saturday
    • month: The month of the date
    • year: The year of the date
    • d: The number of date from 2011-01-29 to 2016-06-19
    • event_name_1: If the date includes an event, the name of this event
    • event_type_1: If the date includes an event, the type of this event
    • event_name_2: If the date includes an event, the name of this event
    • event_type_2: If the date includes an event, the type of this event
    • snap_CA: A binary variable (0 or 1) indicating the stores of CA allow snap purchases on the examined date
    • snap_TX: A binary variable (0 or 1) indicating the stores of TX allow snap purchases on the examined date
    • snap_WI: A binary variable (0 or 1) indicating the stores of WI allow snap purchases on the examined date SNAP stands for Supplemental Nutrition Assistance Program

    sell_prices: contains information about the price of the products sold per store and date

    • store_id: The id of the store where the product is sold
    • item_id: The id of the product
    • wm_yr_wk: The id of the week the date belong to
    • sell_price: The price of the products

    sales: contains the historical daily unit sales data per product and store

    • id: Indicate the item_id and store_id are evaluated
    • item_id: The id of the product
    • dept_id: The id of the department the product belong to
    • cat_id: The id of the category the product belongs to
    • store_id: The id of the store where the product is sold
    • state_id: The state where the store is located
    • d_1,d_2,...,d_1941: The number of units sold at day ..., starting from 2011-01-29

    image.png

    Step 1: Check the data

    I have checked the dataset by using Excel and Power BI as the lists below.

    • Typos/data entry errors
    • Outliers
    • Data types for all columns So in the Python notebook, I check the number of rows and columns and some missing values.
    
    import numpy as np
    import pandas as pd
    #Import the sales.csv, sell_prices.csv, and calendar.csv file
    sales_table = pd.read_csv("sales.csv")
    price_table = pd.read_csv("sell_prices.csv")
    calendar_table = pd.read_csv("calendar.csv")
    print(price_table)
    print(calendar_table)
    print(sales_table)
    print(sales_table.head(4))
    print(sales_table.info()) 
    #Many Nulls in event_name_1, event_name_2, event_type_1, and event_type_2 columns
    print(calendar_table.info())
    print(price_table.info())

    Step 2: Explore Data

    I explore data by questioning the dataset to find insight.

    2.1 What is the store's maximum number of sales by sum every day (from 2011-01-29 to 2016-05-22)?

    The CA_3 store has the maximum number of sales by sum every day at 11,363,540 units, and this also affects the sum of sales in CA state, which is higher than in other states.

    import matplotlib.pyplot as plt
    import pandas as pd
    
    sales = pd.read_csv("sales.csv")
    
    # Group by "store_id" and sum the "d_1" to "d_1941" columns
    grouped_sales = sales.groupby("store_id").sum()
    
    # Indicate x and y for plotting scatter chart
    x_point = grouped_sales.index
    y_point = grouped_sales.loc[:, "d_1":"d_1941"].sum(axis=1)
    
    # Plot bar chart
    plt.bar(range(len(x_point)), y_point)
    plt.xlabel("Store ID")
    plt.ylabel("The sum of sales every day")
    
    # Adjust x-axis labels
    plt.xticks(range(len(x_point)), x_point, fontsize=6)  # Adjust fontsize to make x-axis labels smaller
    
    plt.show()
    
    print(grouped_sales)
    print(y_point)

    2.2 What is the department's maximum number of sales by sum every day (from 2011-01-29 to 2016-05-22)?

    The FOOD_3 department has the maximum number of sales by sum every day at 32,937,002 units, and this also affects the sum of sales in items in the department, which is higher than in other items.

    import matplotlib.pyplot as plt
    import pandas as pd
    
    sales = pd.read_csv("sales.csv")
    
    # Group by "dept_id" and sum the "d_1" to "d_1941" columns
    grouped_sales = sales.groupby("dept_id").sum()
    
    # Indicate x and y for plotting scatter chart
    x_point = grouped_sales.index
    y_point = grouped_sales.loc[:, "d_1":"d_1941"].sum(axis=1)
    
    # Plot bar chart
    plt.bar(range(len(x_point)), y_point)
    plt.xlabel("Department ID")
    plt.ylabel("The sum of sale every day")
    
    # Adjust x-axis labels
    plt.xticks(range(len(x_point)), x_point, fontsize=6)  # Adjust fontsize to make x-axis labels smaller
    
    plt.show()
    
    print(grouped_sales)
    print(y_point)

    2.3 If the FOODS_3 department has the maximum sale unit value, when is the maximum frequency people in three states like buying food?

    After I randomized the results, almost every day people like buying the food, which probably matches when the government assists the people with SNAP.

    import matplotlib.pyplot as plt
    import numpy as np
    import pandas as pd
    
    sales = pd.read_csv("sales.csv")
    
    def find_max_sum(grouped_sales):
        # Group by "dept_id" and sum the "d_1" to "d_1941" columns
        grouped_sales = sales.groupby("dept_id").sum()
    
        # Calculate the sum of each column
        sum_per_column = grouped_sales.loc[grouped_sales.index == "FOODS_3", "d_1":"d_1941"].sum()
    
        # Find the maximum value and its corresponding column name
        max_sum = sum_per_column.nlargest(30)
        max_column = sum_per_column.idxmax()
    
        return max_column, max_sum
    
    max_column, max_sum = find_max_sum(sales)
    max_column, max_sum
    import pandas as pd
    
    calendar = pd.read_csv("calendar.csv")
    
    # Find the rows where d is equal to d_954, d_1934, d_961, and d_1136
    matching_rows = calendar[calendar["d"].isin(["d_954", "d_1934", "d_961", "d_1136", "d_948", "d_960", "d_953","d_940","d_982","d_1892","d_1884","d_1864","d_1164","d_981","d_1617","d_947","d_1898","d_1010","d_1252","d_1261","d_492","d_499","d_1835","d_946","d_435","d_1079","d_1177","d_1941","d_1080","d_926"])]
    
    # Select the d and SNAP_CA, SNAP_TX, SNAP_WI columns from the matching rows
    matching_values = matching_rows[["d", "snap_CA", "snap_TX", "snap_WI"]]
    
    matching_values

    From the dataset, The dates that the government assists people in each state are

    • CA State 1st–10th of every month
    • TX State 1st–15th of every month
    • WI State 2nd–15th of every month