Flu Shot Learning
  • AI Chat
  • Code
  • Report
  • Beta
    Spinner

    Flu Shot Learning

    Load training data

    import pandas as pd
    
    training_features = pd.read_csv('flu-training-features.csv')
    training_labels = pd.read_csv('flu-training-labels.csv')
    test_features = pd.read_csv('flu-test-features.csv')
    
    training_features.head(10)
    training_labels.head(10)
    test_features.head(10)
    training_features.info()

    The main observations from the info are:

    • A high proportion of null values in the health_insurance, employment_industry and employment_occupation categories
    • Several categorical features encoded as objects, which will need to be transformed into numeric values for a machine learning algorithm.
      • as shown below, the number of categories in each varies between 2 and 23
    training_features.describe(include=[object])

    Before doing any analysis we must first merge the training features and labels.

    training_table = training_features.merge(training_labels, on='respondent_id')
    
    training_table.head()

    Exploratory Data Analysis

    With a large number of features, and potentially a lot of co-linearity, it is useful to perform EDA to identify variables that can be amalgamated and those which appear to have the strongest relationship with vaccine uptake.

    First, we'll look at the proportion of people who get either the H1N1 or seasonal vaccine.

    import matplotlib.pyplot as plt
    %matplotlib inline
    fig, ax = plt.subplots(2, 1, sharex='all')
    
    n_obs = training_labels.shape[0]
    
    training_labels['h1n1_vaccine'].value_counts().div(n_obs).plot.barh(title='Proportion of H1N1 Vaccine', ax=ax[0])
    ax[0].set_ylabel('h1n1_vaccine')
    
    training_labels['seasonal_vaccine'].value_counts().div(n_obs).plot.barh(title='Proportion of Seasonal Vaccine', ax=ax[1])
    ax[1].set_ylabel('seasonal_vaccine')
    
    fig.tight_layout

    These graphs show that while the proportion of people getting the seasonal vaccine is close to 50%, only 20% recieve the H1N1 vaccine.

    Below, we use the crosstab function to view the percentage of people who get each vaccine, both or neither.

    print(round((pd.crosstab(training_labels['h1n1_vaccine'], training_labels['seasonal_vaccine'], 
                       margins=True, rownames=['h1n1_vaccine'], colnames=['seasonal_vaccine'], normalize=True) * 100), 2))