Reporting on sales data
📖 Background
You work in the accounting department of a company that sells motorcycle parts. The company operates three warehouses in a large metropolitan area.
You’ve recently learned data manipulation and plotting, and suggest helping your colleague analyze past sales data. Your colleague wants to capture sales by payment method. She also needs to know the average unit price for each product line.
💾 The data
The team assembled the following file:
The sales data has the following fields:
- "date" - The date, from June to August 2021.
- "warehouse" - The company operates three warehouses: North, Central, and West.
- "client_type" - There are two types of customers: Retail and Wholesale.
- "product_line" - Type of products purchased.
- "quantity" - How many items were purchased.
- "unit_price" - Price per item sold.
- "total" - Total sale = quantity * unit_price.
- "payment" - How the client paid: Cash, Credit card, Transfer.
💪 Challenge
Create a report to answer your colleague's questions. Include:
- What are the total sales for each payment method?
- What is the average unit price for each product line?
- Create plots to visualize findings for questions 1 and 2.
- [Optional] Investigate further (e.g., average purchase value by client type, total purchase value by product line, etc.)
- Summarize your findings.
load libraries
library(dplyr)
library(ggplot2)
load the dataset
df <- readr::read_csv('./data/sales_data.csv')
head(df)
Data cleaning
Missing values
sapply(df,function(x) sum(is.na(x)))
There are no missing values. So no need to worry.
Duplicates
sum(duplicated(df))
And also, there are no duplicates. We've unique records.
Feature engineering and data preprocessing
-Extract the month from the date and create a new variable called, 'Month' for future works.
df$Month=as.integer(substr(df$date,6,7))
for(i in 1:length(df$Month)){
if(df$Month[i]==6){
df$Month[i]='June'
}else if(df$Month[i]==7){
df$Month[i]='July'
}else{
df$Month[i]='August'
}
}
Analysis
Total sales for each payment method
-Let's check Total sales for each payment method
t1=df %>%
group_by(payment) %>%
summarize(total_sales = sum(total))
t1
The highest total sales (159642.3 dollars) were paid through the payment method 'transfer'. Most of them are not using cash. Using the following plot, you can get a better understanding of that.
qplot(t1$payment,
t1$total_sales,
geom="col",
xlab="Payment Method",
ylab="Total Sales")
‌
‌