What's In a Name?
  • AI Chat
  • Code
  • Report
  • Beta
    Spinner

    What's in a Name?

    In this live code-along, we will explore a rich dataset of first names of babies born in the US, that spans a period of more than 100 years! This suprisingly simple dataset can help us uncover so many interesting stories, and that is exactly what we are going to be doing.

    %%capture
    !pip install html_table_extractor wget wquantiles 

    Import Libraries

    # Import modules and functions
    import numpy as np
    import pandas as pd
    from wquantiles import quantile
    
    # Plotting libraries
    %config InlineBackend.figure_format='retina'
    import matplotlib.pyplot as plt
    plt.rcParams["figure.figsize"] = (14, 6)
    plt.style.use('seaborn-darkgrid')
    import plotly.express as px

    Read Data

    Let us start by reading the babynames data in names.csv.gz.

    babynames = pd.read_csv("names.csv.gz")
    babynames.head()

    Explore Data

    Although this dataset only has 4 columns, there are so many interesting questions one could explore. While the possibilities are endless, I have chosen six interesting questions, out of which we will explore a subset based on interest levels.

    1. Popular Names: What are the most popular names?
    2. Trendy Names: What are the most trendy names?
    3. Unisex Names: What are the most unisex names?
    4. Length of Names: How has the length of names changed over the years?
    5. Letters in Names: What are the most common first and last letters?
    6. Estimate Age from Name: How can we estimate a person's age from their name?

    Popular Names

    One of the first things we want to do is to understand naming trends. Let us start by figuring out the top five most popular male and female names for this decade (born 2011 and later). Do you want to make any guesses? Go on, be a sport!!

    # Get the five most popular male and female names of the decade starting in 2011.
    popular_names = (
      babynames
        .query('year > 2010')
        .groupby(['sex', 'name'])
        [['births']]
        .sum()
        .sort_values(by=['sex', 'births'], ascending=False)
        .groupby('sex')
        .head(5)
        .reset_index()
    )
    popular_names
    # Plot a horizontal bar plot of number of births by name facetted by sex
    fig = px.bar(
        popular_names.iloc[::-1],
        x="births",
        y="name",
        facet_row="sex",
        height=500
    )
    fig.update_layout(
        title = "Most Popular Names of the Decade in the US",
        yaxis_title = None,
        yaxis2_title = None,
        yaxis2_matches=None,
        xaxis_title = 'Number of Births'
    )
    # Plot trend in the number of babies with a specific name
    pd.options.plotting.backend='plotly'
    # from ipywidgets import interact
    # @interact(name=popular_names.name)
    def plot_trends(name):
        fig = (
          babynames
            .query('name == @name')
            .sort_values(by='year')
            .reset_index(drop=True)
            .plot.line(x="year", y="births", color="sex")
        )
        fig.update_layout(
            title=f'Trend in the Name: {name}',
            xaxis_title=None,
            yaxis_title='Number of Births'
        )
        return fig
    
    plot_trends("Richie")

    Trendy Names

    A stable name is one whose proportion across years does not vary drastically, while a trendy name is one whose popularity peaks for a short period and then dies down. There are many ways to capture trendiness. A simple measure would be to look at the maximum proportion of births for a name, normalized by the sume of proportion of births across years. For example, if the name Joe had the proportions 0.1, 0.2, 0.1, 0.1, then the trendiness measure would be 0.2/(0.1 + 0.2 + 0.1 + 0.1) which equals 0.5. Let us use this idea to figure out the top 10 trendy names in this data set, with at least 5000 births.

    # Get the top trendy names with more than 5000 births.
    
    
    
    
    
    
    
    
    
    
    
    
    
    

    Unisex Names

    There are some names that are used commonly by both the sexes. Let us dive into the data and figure out what are the most popular unisex names. Let us assume that a name is considered unisex if there are more than 33% males and 33% females with that name. We can tweak these thresholds subsequently to see if it reveals a different set of names!There are some names that are used commonly by both the sexes. Let us dive into the data and figure out what are the most popular unisex names. Let us assume that a name is considered unisex if there are more than 33% males and 33% females with that name. We can tweak these thresholds subsequently to see if it reveals a different set of names!

    # Get names with more than 33% male and 33% females, sorted by births.