Isaac Awotwe
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
Sign up
Beta
Spinner

Where should a drinks company run promotions?

📖 Background

Your company owns a chain of stores across Russia that sell a variety of alcoholic drinks. The company recently ran a drinks promotion in Saint Petersburg that was very successful. Due to the cost to the business, it isn’t possible to run the promotion in all regions. The marketing team would like to target 10 other regions that have similar buying habits to Saint Petersburg where they would expect the promotion to be similarly successful.

The data

The marketing team has sourced you with historical sales volumes per capita for several different drinks types.

  • "year" - year (1998-2016)
  • "region" - name of a federal subject of Russia. It could be oblast, republic, krai, autonomous okrug, federal city and a single autonomous oblast
  • "wine" - sale of wine in litres by year per capita
  • "beer" - sale of beer in litres by year per capita
  • "vodka" - sale of vodka in litres by year per capita
  • "champagne" - sale of champagne in litres by year per capita
  • "brandy" - sale of brandy in litres by year per capita
import pandas as pd
df = pd.read_csv(r'./data/russian_alcohol_consumption.csv')
df.head()

💪 Competition challenge

  1. Recommend 10 additional regions they should select for the promotion.
  2. Tell the story that supports your recommendations.

1. Data Cleaning and Validation

1.1 Data Type Check

Do the columns have the correct data types?

print(df.dtypes)

The results above show that all 7 columns have the data types we would expect.

1.2 Data Ranges Check

We want to make sure that the two discrete columns - "year" and "region" - contain the correct range of values.

print("The unique entries in the \"year\" column:\n", df.year.unique())
print("number of unique entries in the \"year\" column: ", df.year.nunique())

There are 19 unique entries in the "year" column and they all look appropriate.

Next, we will count the number of unique entries in the "region" column as is.
Subsequently, we will apply some text formatting - convert the strings to lower case and also remove any leading and trailing spaces - and then count again.
The intent is to ensure that all the entries in the "region" column are indeed unique.

print("number of unique values in the 'region' columnn without text formatting: ", df.region.nunique())

#check to make sure there are no deplicate regions
striped=df[["region"]]
striped["region"]=striped["region"].str.strip()
print("number of unique values in the 'region' columnn after changing texts to lower case and removing leading and/or trailing spaces: ", striped.region.str.lower().nunique())

The results above is a way of gaining some confidence that the values in the "region" column are indeed unique. Notice that even after removing any leading and trailing spaces (this does not mean there are any, but it is a quality assurance step) and converting all strings to lower case, the count of unique values is still 85.
Next, we will inspect the values in the region column by printing them out.

print(df.region.unique())

1.3 Data Structure Check

‌
‌
‌
  • AI Chat
  • Code