[CodeAlong] Visualizations in Python
  • AI Chat
  • Code
  • Report
  • Beta
    Spinner

    Visualization in Python

    One of the best ways to improve your data visualization skills is to try and replicate great visualizations you see out there. In this live code-along, we will take a look at how to recreate some amazing visualizations using Python so that you can take your data visualization skills to the next level.

    import pandas as pd
    import matplotlib.pyplot as plt
    from matplotlib.patches import Rectangle
    %config InlineBackend.figure_format = 'retina'
    import seaborn as sns
    import numpy as np

    ⚾ Strikeouts in Baseball

    The first visualization we will try and replicate is a sports piece published by the New York Times in 2012. It is a beautiful visualization illustrating how strikeouts were on the rise. The visualization showcases the strikeouts per game by team as well as the aggregated strikeouts per game for the whole league. Read the original article to get more context, and analyze the visualization carefully before attempting to replicate it.

    The data for this visualization comes from an excellent database compiled by Sean Lahman that contains complete batting and pitching statistics from 1871 to 2020, plus fielding statistics, standings, team stats, managerial records, post-season data, and more.

    teams = pd.read_csv('teams.csv')[['yearID', 'franchID', 'name', 'G', 'SO']] 
    teams.head()
    # Precompute data
    # Compute team level SOG
    team_sog = (
      teams
        .query('yearID >= 1900')
        .assign(SOG = lambda d: d.SO / d.G)
    )
    
    # Compute SOG for Boston Red Sox
    red_sox_sog = (
      team_sog
        .query('name == "Boston Red Sox"')
    )
    
    # Compute Avg SOG for the league
    league_sog = (
      team_sog
        .groupby('yearID', as_index=False)
        .agg(SOG = ('SOG', 'mean'))
    )
    league_sog.head()
    # Plot data
    # Plot setup
    plt.rcParams['figure.figsize'] = (16, 6)
    plt.style.use('fivethirtyeight')
    
    # Add a scatter plot layer for all SOGs.
    plt.scatter(
        team_sog['yearID'],
        team_sog['SOG'],
        color = 'gray',
        alpha = 0.2
    )
    
    # Add a line plot layer for a specific team (Boston Red Sox, Orange Line)
    plt.plot(
        red_sox_sog['yearID'],
        red_sox_sog['SOG'],
        color = 'orange',
        marker = 'o'
    )
    
    # Add a line plot layer for the entire league (Blue Line)
    plt.plot(
        league_sog['yearID'],
        league_sog['SOG'],
        color = 'steelblue',
        marker = 'o'
    )
    
    # Change axis limits
    plt.ylim(-0.1, 10)
    plt.axhline(xmin=0, color='black')
    
    # Add text annotation layer
    plt.annotate(
        "US Enters World War 1", 
        xytext=(1902, 1), 
        xy=(1914, 4), 
        arrowprops=dict(
          color='black',
          arrowstyle='-'
        )
    )
    
    # Increase tick label size
    plt.tick_params(axis = 'both', which = 'major', labelsize = 16)
    
    # Add title, subtitle etc.
    plt.axhline(y = 0, color = 'black', linewidth = 1.3, alpha = .7)
    plt.text(x=1885, y=12, s = "Strikeouts on the Rise", fontsize=24);
    plt.text(x=1885, y=11.2, s = "There were more strikeouts in 2020 than at any other time in major league history.", fontsize=16);

    🦠 COVID Cases by State

    The second visualization we will try and replicate is also from the New York Times and was published on March 21st 2020 to visualize the spread of COVID by state. Read the original article to get a better understanding.

    You will need two datasets to replicate this plot. The first dataset is provided by the New York Times and provides a time series of COVID cases by date. The second dataset provides a useful mapping of states to x-y coordinates on the grid. Use it wisely to place the different panels appropriately.

    # COVID Cases by State
    covid_cases = pd.read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv")
    covid_cases.head()
    covid_cases_start_dates = (
      covid_cases
        .assign(date = lambda d: pd.to_datetime(d.date))
        .groupby('state', as_index=False)
        [['date']]
        .agg(date_min = ('date', 'min'))
        .assign(date_min_all = lambda d: d.date_min.min())
        .assign(first_day = lambda d: (d.date_min - d.date_min_all) / np.timedelta64(1, 'D'))
    )
    covid_cases_start_dates.head()
    # Grid Coordinates for States
    # Source: https://github.com/hrbrmstr/statebins/blob/master/R/aaa.R
    state_coords = pd.read_csv('state_coords.csv').query('abbrev != "NYC"')
    state_coords.head()
    plt.style.use('seaborn')
    plt.rcParams['figure.figsize'] = (20, 20)
    fig = plt.figure()
    # Use GridSpec for customising layout
    gs = fig.add_gridspec(nrows=13, ncols=13)
    for state in state_coords.to_dict(orient='records'):
        ax = fig.add_subplot(gs[state['y'], state['x']])
        ax.axes.xaxis.set_visible(False)
        ax.axes.yaxis.set_visible(False)
        state_name = state["state"]
        d = (
          covid_cases
            .query('state == @state_name')
        )
        first_day = (
            covid_cases_start_dates
              .query('state == @state_name')
              .first_day
        )
        ax.plot(d['date'], d['cases'], linewidth=1.2)
        ax.set_ylim(-1, covid_cases.cases.max())
        ax.text(x=0, y=covid_cases.cases.max()*0.8, s=state['abbrev'], fontweight='bold', fontsize='large')
    
    plt.suptitle("Number of New Cases Each Day", fontsize=24, fontweight=2)
    fig.tight_layout()
    plt.show()