Beta
Flu Shot Learning
Load training data
import pandas as pd
training_features = pd.read_csv('flu-training-features.csv')
training_labels = pd.read_csv('flu-training-labels.csv')
test_features = pd.read_csv('flu-test-features.csv')
training_features.head(10)
training_labels.head(10)
test_features.head(10)
training_features.info()
The main observations from the info are:
- A high proportion of null values in the health_insurance, employment_industry and employment_occupation categories
- Several categorical features encoded as objects, which will need to be transformed into numeric values for a machine learning algorithm.
- as shown below, the number of categories in each varies between 2 and 23
training_features.describe(include=[object])
Before doing any analysis we must first merge the training features and labels.
training_table = training_features.merge(training_labels, on='respondent_id')
training_table.head()
Exploratory Data Analysis
With a large number of features, and potentially a lot of co-linearity, it is useful to perform EDA to identify variables that can be amalgamated and those which appear to have the strongest relationship with vaccine uptake.
First, we'll look at the proportion of people who get either the H1N1 or seasonal vaccine.
import matplotlib.pyplot as plt
%matplotlib inline
fig, ax = plt.subplots(2, 1, sharex='all')
n_obs = training_labels.shape[0]
training_labels['h1n1_vaccine'].value_counts().div(n_obs).plot.barh(title='Proportion of H1N1 Vaccine', ax=ax[0])
ax[0].set_ylabel('h1n1_vaccine')
training_labels['seasonal_vaccine'].value_counts().div(n_obs).plot.barh(title='Proportion of Seasonal Vaccine', ax=ax[1])
ax[1].set_ylabel('seasonal_vaccine')
fig.tight_layout
These graphs show that while the proportion of people getting the seasonal vaccine is close to 50%, only 20% recieve the H1N1 vaccine.
Below, we use the crosstab function to view the percentage of people who get each vaccine, both or neither.
print(round((pd.crosstab(training_labels['h1n1_vaccine'], training_labels['seasonal_vaccine'],
margins=True, rownames=['h1n1_vaccine'], colnames=['seasonal_vaccine'], normalize=True) * 100), 2))