Workspace
Abid Ali Awan/

Data Analysis on Alcoholic Drinks in Russia for Creating Promotional strategy.

0
Beta
Spinner

Data Analysis on Alcoholic Drinks in Russia for Creating Promotional strategy.

Alcohol consumption in Russia remains among the highest in the world. According to a 2011 report by the World Health Organization, which makes it the best place to start a beverage business. People love their drinks and the company which owns a chain of stores across Russia that sell a variety of alcoholic drinks wants to invest in marketing campaigns. The company recently ran a wine promotion in Saint Petersburg that was very successful. Due to the cost to the business, it isn’t possible to run the promotion in all regions. In this project, we are going to analyze our data, fix missing values, visualize data, train the clustering model, and finally visualize our results.

Our main goal is to find the next ten locations similar to Saint Petersburg and where the wine promotion campaign became successful.

import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px

import os
import warnings

from sklearn.cluster import KMeans
from sklearn.cluster import AgglomerativeClustering 

from sklearn.preprocessing import LabelEncoder

from scipy.cluster import hierarchy 
from scipy.spatial import distance_matrix 

warnings.filterwarnings('ignore')
os.listdir("data/")
np.random.seed(2021)

Data

The marketing team has sourced you with historical sales volumes per capita for several different drinks types.

  • "year" - year (1998-2016)
  • "region" - the name of a federal subject of Russia. It could be oblast, republic, krai, autonomous okrug, federal city, and a single autonomous oblast
  • "wine" - sale of wine in liters by year per capita
  • "beer" - sale of beer in liters by year per capita
  • "vodka" - sale of vodka in liters by year per capita
  • "champagne" - sale of champagne in liters by year per capita
  • "brandy" - sale of brandy in liters by year per capita

Loading Dataset

We have used pandas to load the .csv dataset and its pretty small dataset containing yearly (1998-2016) alcohol consumption (beer, champagne, brandy, wine, vodka) per region (85).

df = pd.read_csv('./data/russian_alcohol_consumption.csv')
df.head()

There is 1615 sample which is logical as we have 19 years of data and 85 regions. The beer is leading the game as the mean value of beer is 51.3 liters by year per capita and the second-highest is vodka 11.81 liters by year per capita which is now even close to beer. This means people prefer beer as a go-to beverage. The beer also has the highest standard deviation which means that its demand is not stable and can fluctuate with time whereas champagne and brandy are a pretty safe bet if you want to start a low-risk business with the lowest standard deviation.

df.describe()

We can explore all the unique regions within the dataset.

list(df['region'].unique())

Correlation

There is a high correlation between champagne and brandy which makes it even better. If you promote champagne there will be an increase in the sale of brandy and champagne, which makes it win-win situation.

dataplot = sns.heatmap(df.corr(), cmap="YlGnBu", annot=True)
plt.show()

Missing Values

We will be using pandas dataframe background_gradient to display several missing values and percentages. It seems like all drinks columns have missing values and the highest is brandy with 66.

NA = pd.DataFrame(
    data=[
        df.isna().sum().tolist(),
        [
            "{:.2f}".format(i) + "%"
            for i in (df.isna().sum() / df.shape[0] * 100).tolist()
        ],
    ],
    columns=df.columns,
    index=["NA Count", "NA Percent"],
).T.sort_values(by="NA Count", ascending=False)
NA.style.background_gradient(cmap="seismic", subset=["NA Count"])

We are going to use the fillna function and method pad to fill missing values with previous values in a column. As we can see that there are no missing values in our dataset.

df.fillna(method="pad", inplace=True)
df.isna().sum()