Aleksey Schukin/

Global Wine Market


Global Wine Markets 2015

📖 Background

With the end of year holidays approaching, many people like to relax or party with a glass of wine. That makes wine an important industry in many countries. Understanding this market is important to the livelihood of many people.

You work at a multinational consumer goods organization that is considering entering the wine production industry. Managers at your company would like to understand the market better before making a decision.

💾 The data

This dataset is a subset of the University of Adelaide's Annual Database of Global Wine Markets.

The dataset consists of a single CSV file, data/wine.csv.

Each row in the dataset represents the wine market in one country. There are 34 metrics for the wine industry covering both the production and consumption sides of the market.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

wine = pd.read_csv("data/wine.csv")
wine.rename(columns = {wine.columns[2] : 'Vine Area'}, inplace=True)

We will conduct several standard studies of our data. Let's see how many missing values and look at the data types.

def percent_hbar(df, old_threshold=None):
    percent_of_nulls = (df.isnull().sum()/len(df)*100).sort_values().round(2)
    threshold = percent_of_nulls.mean()
    ax = percent_of_nulls.plot(kind='barh', figsize=(20, 16), title='% of NaN (from {} lines)'.format(len(df)), 
                               color='#86bf91', legend=False, fontsize=17)
    ax.set_xlabel('Count of NaN')
    dict_percent = dict(percent_of_nulls)
    i = 0
    for k in dict_percent:
        color = 'blue'
        if dict_percent[k] > 0:
            if dict_percent[k] > threshold:
                color = 'red'
            ax.text(dict_percent[k]+0.1, i + 0.09, str(dict_percent[k])+'%', color=color, 
                    fontweight='bold', fontsize='large')
        i += 0.98
    if old_threshold is not None:
        plt.axvline(x=old_threshold,linewidth=1, color='r', linestyle='--')
        ax.text(old_threshold+0.3, .10, '{0:.2%}'.format(old_threshold/100), color='r', fontweight='bold', fontsize='large')
        plt.axvline(x=threshold,linewidth=1, color='green', linestyle='--')
        ax.text(threshold+0.3, .7, '{0:.2%}'.format(threshold/100), color='green', fontweight='bold', fontsize='large')
        plt.axvline(x=threshold,linewidth=1, color='r', linestyle='--')
        ax.text(threshold+0.3, .7, '{0:.2%}'.format(threshold/100), color='r', fontweight='bold', fontsize='large')
    return ax, threshold
plot, threshold = percent_hbar(wine)
variables = pd.DataFrame(columns=['Variable','Number of unique values','Values'])

for i, var in enumerate(wine.columns):
    variables.loc[i] = [var, wine[var].nunique(), wine[var].unique().tolist()]
variables.set_index('Variable', inplace=True)    

Let's look at the correlation between our data

plt.figure(figsize=(30, 22))
mask = np.triu(np.ones_like(wine.corr(), dtype=np.bool))
heatmap = sns.heatmap(wine.corr(), mask=mask, vmin=-1, vmax=1, annot=True, cmap='Blues')
heatmap.set_title('Triangle Correlation Heatmap', fontdict={'fontsize':14}, pad=18);