Workspace
Ramnath Vaidyanathan/

[CodeAlong] Visualizations in Python

0
Beta
Spinner

Visualization in Python

One of the best ways to improve your data visualization skills is to try and replicate great visualizations you see out there. In this live code-along, we will take a look at how to recreate some amazing visualizations using Python so that you can take your data visualization skills to the next level.

import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle
%config InlineBackend.figure_format = 'retina'
import seaborn as sns
import numpy as np

⚾ Strikeouts in Baseball

The first visualization we will try and replicate is a sports piece published by the New York Times in 2012. It is a beautiful visualization illustrating how strikeouts were on the rise. The visualization showcases the strikeouts per game by team as well as the aggregated strikeouts per game for the whole league. Read the original article to get more context, and analyze the visualization carefully before attempting to replicate it.

The data for this visualization comes from an excellent database compiled by Sean Lahman that contains complete batting and pitching statistics from 1871 to 2020, plus fielding statistics, standings, team stats, managerial records, post-season data, and more.

teams = pd.read_csv('teams.csv')[['yearID', 'franchID', 'name', 'G', 'SO']] 
teams.head()
# Precompute data
# Compute team level SOG
team_sog = (
  teams
    .query('yearID >= 1900')
    .assign(SOG = lambda d: d.SO / d.G)
)

# Compute SOG for Boston Red Sox
red_sox_sog = (
  team_sog
    .query('name == "Boston Red Sox"')
)

# Compute Avg SOG for the league
league_sog = (
  team_sog
    .groupby('yearID', as_index=False)
    .agg(SOG = ('SOG', 'mean'))
)
league_sog.head()
# Plot data
# Plot setup
plt.rcParams['figure.figsize'] = (16, 6)
plt.style.use('fivethirtyeight')

# Add a scatter plot layer for all SOGs.
plt.scatter(
    team_sog['yearID'],
    team_sog['SOG'],
    color = 'gray',
    alpha = 0.2
)

# Add a line plot layer for a specific team (Boston Red Sox, Orange Line)
plt.plot(
    red_sox_sog['yearID'],
    red_sox_sog['SOG'],
    color = 'orange',
    marker = 'o'
)

# Add a line plot layer for the entire league (Blue Line)
plt.plot(
    league_sog['yearID'],
    league_sog['SOG'],
    color = 'steelblue',
    marker = 'o'
)

# Change axis limits
plt.ylim(-0.1, 10)
plt.axhline(xmin=0, color='black')

# Add text annotation layer
plt.annotate(
    "US Enters World War 1", 
    xytext=(1902, 1), 
    xy=(1914, 4), 
    arrowprops=dict(
      color='black',
      arrowstyle='-'
    )
)

# Increase tick label size
plt.tick_params(axis = 'both', which = 'major', labelsize = 16)

# Add title, subtitle etc.
plt.axhline(y = 0, color = 'black', linewidth = 1.3, alpha = .7)
plt.text(x=1885, y=12, s = "Strikeouts on the Rise", fontsize=24);
plt.text(x=1885, y=11.2, s = "There were more strikeouts in 2020 than at any other time in major league history.", fontsize=16);

🦠 COVID Cases by State

The second visualization we will try and replicate is also from the New York Times and was published on March 21st 2020 to visualize the spread of COVID by state. Read the original article to get a better understanding.

You will need two datasets to replicate this plot. The first dataset is provided by the New York Times and provides a time series of COVID cases by date. The second dataset provides a useful mapping of states to x-y coordinates on the grid. Use it wisely to place the different panels appropriately.

# COVID Cases by State
covid_cases = pd.read_csv("https://raw.githubusercontent.com/nytimes/covid-19-data/master/us-states.csv")
covid_cases.head()
covid_cases_start_dates = (
  covid_cases
    .assign(date = lambda d: pd.to_datetime(d.date))
    .groupby('state', as_index=False)
    [['date']]
    .agg(date_min = ('date', 'min'))
    .assign(date_min_all = lambda d: d.date_min.min())
    .assign(first_day = lambda d: (d.date_min - d.date_min_all) / np.timedelta64(1, 'D'))
)
covid_cases_start_dates.head()
# Grid Coordinates for States
# Source: https://github.com/hrbrmstr/statebins/blob/master/R/aaa.R
state_coords = pd.read_csv('state_coords.csv').query('abbrev != "NYC"')
state_coords.head()
plt.style.use('seaborn')
plt.rcParams['figure.figsize'] = (20, 20)
fig = plt.figure()
# Use GridSpec for customising layout
gs = fig.add_gridspec(nrows=13, ncols=13)
for state in state_coords.to_dict(orient='records'):
    ax = fig.add_subplot(gs[state['y'], state['x']])
    ax.axes.xaxis.set_visible(False)
    ax.axes.yaxis.set_visible(False)
    state_name = state["state"]
    d = (
      covid_cases
        .query('state == @state_name')
    )
    first_day = (
        covid_cases_start_dates
          .query('state == @state_name')
          .first_day
    )
    ax.plot(d['date'], d['cases'], linewidth=1.2)
    ax.set_ylim(-1, covid_cases.cases.max())
    ax.text(x=0, y=covid_cases.cases.max()*0.8, s=state['abbrev'], fontweight='bold', fontsize='large')

plt.suptitle("Number of New Cases Each Day", fontsize=24, fontweight=2)
fig.tight_layout()
plt.show()