# import necessary packages
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns
sns.set()
%matplotlib inline
from datetime import datetime as dt
import plotly.express as px
# load the dataset
df=pd.read_csv('office_supplies.csv')
df.head()
Customer Question
The management would like you to answer the following:
● Are there products that do not sell as well in some locations?
● Are there any other patterns over time in each region that you can find in the data
ASSESSING AND CLEANING
df.info()
df.describe()
# CHECKING FOR DUPLICATE VALUES
df.duplicated().any()
df[df.duplicated(keep=False)]
ASSESSMENT REPORT
- column names cotains spaces and capital letters which should be corrected for easier accessing
- The Profit column which has a lot of null values (1993) which signifies some unknown error
- The 'Order Date' column is in string type instead of date
- There is a duplicated row in the dataset
CLEANING BASED ON ASSESSMENT REPORT
ISSUE 1
# ACTION: replace spaces from all column names with underscore and format them as all lower case
#CODE :
df.columns = df.columns.str.lower().str.replace(' ','_')
#TEST
display(df.columns.str.contains(' ').any(), df.columns)
ISSSUE 2
Dealing with null values in the Price column:
There are 2 choices: fill with zero or drop them
if i drop these rows, it will reduce the dataset and maybe orders that answers the questions im asking will be missing. though the question of this analysis is about sales of products in some regions, its safe to assume that there might be a correlation between low sales and 'null' profit so dropping these null profits might just mean dropping some important orders .
if i fill with 0 theres a chance that ill be giving the wrong information as these null values might indicate that information about profits made hasn't been received yet and that doesnt necessarily mean a zero profit transaction
the best option is to fill rows with null values in the Price column with 0 and not include the Price column too much in my analysis
df[df.profit.isna() & df.discount >0]
# ACTION: fill rows with null values in the Price column with 0
#CODE:
df.profit.fillna(value=0,inplace=True)
# TEST:
df.profit.isnull().any()