Data Analysis on Alcoholic Drinks in Russia for Creating Promotional strategy.
Alcohol consumption in Russia remains among the highest in the world. According to a 2011 report by the World Health Organization, which makes it the best place to start a beverage business. People love their drinks and the company which owns a chain of stores across Russia that sell a variety of alcoholic drinks wants to invest in marketing campaigns. The company recently ran a wine promotion in Saint Petersburg that was very successful. Due to the cost to the business, it isn’t possible to run the promotion in all regions. In this project, we are going to analyze our data, fix missing values, visualize data, train the clustering model, and finally visualize our results.
Our main goal is to find the next ten locations similar to Saint Petersburg and where the wine promotion campaign became successful.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px
import os
import warnings
from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering
from sklearn.preprocessing import LabelEncoder
from scipy.cluster import hierarchy
from scipy.spatial import distance_matrix
warnings.filterwarnings('ignore')
os.listdir("data/")
np.random.seed(2021)
Data
The marketing team has sourced you with historical sales volumes per capita for several different drinks types.
- "year" - year (1998-2016)
- "region" - the name of a federal subject of Russia. It could be oblast, republic, krai, autonomous okrug, federal city, and a single autonomous oblast
- "wine" - sale of wine in liters by year per capita
- "beer" - sale of beer in liters by year per capita
- "vodka" - sale of vodka in liters by year per capita
- "champagne" - sale of champagne in liters by year per capita
- "brandy" - sale of brandy in liters by year per capita
Loading Dataset
We have used pandas to load the .csv
dataset and its pretty small dataset containing yearly (1998-2016) alcohol consumption (beer, champagne, brandy, wine, vodka) per region (85).
df = pd.read_csv('./data/russian_alcohol_consumption.csv')
df.head()
There is 1615 sample which is logical as we have 19 years of data and 85 regions. The beer is leading the game as the mean value of beer is 51.3 liters by year per capita and the second-highest is vodka 11.81 liters by year per capita which is now even close to beer. This means people prefer beer as a go-to beverage. The beer also has the highest standard deviation which means that its demand is not stable and can fluctuate with time whereas champagne and brandy are a pretty safe bet if you want to start a low-risk business with the lowest standard deviation.
df.describe()
We can explore all the unique regions within the dataset.
list(df['region'].unique())
Correlation
There is a high correlation between champagne and brandy which makes it even better. If you promote champagne there will be an increase in the sale of brandy and champagne, which makes it win-win situation.
dataplot = sns.heatmap(df.corr(), cmap="YlGnBu", annot=True)
plt.show()
Missing Values
We will be using pandas dataframe background_gradient
to display several missing values and percentages. It seems like all drinks columns have missing values and the highest is brandy with 66.
NA = pd.DataFrame(
data=[
df.isna().sum().tolist(),
[
"{:.2f}".format(i) + "%"
for i in (df.isna().sum() / df.shape[0] * 100).tolist()
],
],
columns=df.columns,
index=["NA Count", "NA Percent"],
).T.sort_values(by="NA Count", ascending=False)
NA.style.background_gradient(cmap="seismic", subset=["NA Count"])
We are going to use the fillna
function and method pad
to fill missing values with previous values in a column. As we can see that there are no missing values in our dataset.
df.fillna(method="pad", inplace=True)
df.isna().sum()
Geo Location
For geolocation, we need to get coordinates to display stat on the map. For that we need:
- geopy -> Nominatim
- creating user agent to connect to geopy server.
- create
lat
andlon
function to extract latitude and longitude using the name of the place. - value count the
region
column, resetting index, and then renaming columns. - applying both functions on to
geo['region']
- export the file as "russian_geo.csv"
This process takes 5 minutes to run so we are just going to save the results in a .csv file and later merge it with our main database.
import time from geopy.geocoders import Nominatim app = Nominatim(user_agent="russia") def lat(x): time.sleep(1) try: return app.geocode(x).raw["lat"] except: return lat(x) def lon(x): time.sleep(1) try: return app.geocode(x).raw["lon"] except: return lon(x) geo = df['region'].value_counts() geo = geo.to_frame().reset_index().rename(columns={"region":"count","index":"region"}) geo['lat'] = geo['region'].apply(lat) geo['lon'] = geo['region'].apply(lon) geo.drop('count',axis=1).to_csv("data/russian_geo.csv",index=False)
- loading geo location dataset.
- merging it with main dataset.
- group by "region" and mean.
- sort values by "beer" descending.
We are going to use df_geo dataset to plot the total alcohol consumption on the Plotly map.
geo = pd.read_csv("data/russian_geo.csv")
df_geo = (
pd.merge(df, geo, on="region")
.groupby("region")
.mean()
.reset_index()
.sort_values("beer", ascending=False)
)
Map
In this section, we are going to plot the total alcohol consumption per region on the map.
- we have created a new column name "total" which sums up all the drinks columns.
- we have also created test columns that contain a caption, that will be displayed on the map.
- we are dividing our dataset into three categories based, first contains the top 10, the second contains 11th to 21st and third contains the rest of the regions. The ranking is based on Alcohol consumption per region.
- We are going to use Plotly sample code for plotting bubble maps.
The top 10 and 11-21 have no paterens. They are all over the place but we can see the pattern in the rest of the categories. You can also explore the different regions by hovering your mouse over and zooming out to observe more regions.
⚠⚠Plotly Map is working perfectly in Workspace but the published version has Javascript Bug.⚠⚠
df_geo["total"] = (
df_geo["wine"]
+ df_geo["brandy"]
+ df_geo["vodka"]
+ df_geo["beer"]
+ df_geo["champagne"]
)
df_geo["text"] = (
df_geo["region"] + "<br> Alcohol consumption:" + (df_geo["total"]).astype(str)
)
# limits = [(0,99),(100,299),(300,499),(500,999),(1000,3000)]
limits = [(0, 10), (11, 21), (22, 84)]
colors = ["royalblue", "crimson", "lightseagreen"]
scale = 500
fig = go.Figure()
for i in range(len(limits)):
lim = limits[i]
df_sub = df_geo[lim[0] : lim[1]]
fig.add_trace(
go.Scattergeo(
lon=df_sub["lon"],
lat=df_sub["lat"],
text=df_sub["text"],
marker=dict(
size=df_sub["total"],
color=colors[i],
line_color="rgb(40,40,40)",
line_width=0.5,
sizemode="area",
),
name="Top {0} - {1}".format(lim[0] + 1, lim[1]),
)
)
fig.update_layout(
title_text="Alcohol Consumption in Russia",
showlegend=True,
legend_title="Beer 1 liter/percapita",
legend_title_font_size=14,
geo=dict(
scope="world",
landcolor="rgb(217, 217, 217)",
lonaxis=dict(range=[10.6985, 100.06269]),
lataxis=dict(range=[33.72197, 80.7293]),
),
)
fig.show()
If you are not able to see the map, this is how it looks like👇
Alcohol Consumption Past Trend
In this section, we are going to explore different types of drinks and their consumption trend over the past 19 years.
As we can see beer consumption has risen with time up till 2007, then it became steady and its declining since 2011. Wine, champagne, and brandy consumption are lower than beer but they are steady. The vodka demand has increased up till 2002 and it's declining slowly and steadily with time.
The safest bet is to launch a campaign on either brandy or champagne but wine can be profitable as it has more consumption per capita and it is stable