Analyzing Olympics Data
  • AI Chat
  • Code
  • Report
  • Beta
    Spinner

    Analyzing Olympics Data with SQL and Python

    Welcome to your webinar workspace! Here, you can follow along as we load data from multiple sources and then answer some questions about the Olympics!

    🏃  Load in the Olympics Data

    The primary data is available in your directory in the path athlete_events.csv.

    The cell below handles imports of the required packages and data.

    # Import libraries
    import pandas as pd
    import plotly.express as px
    
    # Import the data
    olympics = pd.read_csv("athlete_events.csv")
    
    # Preview the DataFrame
    olympics

    We can inspect the data types and the number of non-null rows per column using the .info() method.

    # Inspect the DataFrame
    olympics.info()

    An easier way to inspect the number of missing values per column is to use .isna() combined with .sum().

    # Check missing values
    olympics.isna().sum()

    The missing values in the medal column are because the dataset contains all competitors (not just those who won a medal). The remaining columns with missing values are not of interest to us today.

    When exploring the data, it looked as though some of the teams had hyphens and backslashes. Let's inspect it more closely by inspecting the unique values of the column.

    By using .value_counts() combined with .to_frame(), we can inspect the unique team names by frequency inside the interactive table viewer.

    # Inspect the team column
    olympics["team"].value_counts().to_frame()

    The team column is messy and sometimes contains countries separated by forward slashes or hyphens. Let's clean this by using .str.extract() to extract the first country mentioned in the cases of slashes or hyphens (e.g., "Denmark/Sweden" becomes "Denmark").

    If you want to learn more about regular expressions in Python, check out our course on the subject!

    # Split the team column on forward slashes and hyphens
    olympics["team_clean"] = olympics["team"].str.split("[/-]").str[0]
    
    # Preview the new column
    olympics["team_clean"].unique()

    🌎  Bring in additional data

    Let's query a MariaDB database containing information on world nations to provide some additional data.

    We will store our query result as a pandas DataFrame named nations_data.

    Unknown integration
    DataFrameavailable as
    nations_data
    variable
    SELECT 
    	name AS country,
        year,
        population
    FROM countries
    INNER JOIN country_stats USING(country_id)
    This query is taking long to finish...Consider adding a LIMIT clause or switching to Query mode to preview the result.

    We now have country data that we can join with the Olympics data! We will use the .merge() method to combine the two DataFrames using the country and year columns.

    A "left" join matches on rows in the olympics_data DataFrame, as some teams will not be present in the countries_data DataFrame.

    # Perform a left join between the two DataFrames
    olympics_full = olympics.merge(
        nations_data, left_on=["team_clean", "year"], right_on=["country", "year"], how="left"
    )
    
    # Preview our data
    olympics_full