Merge DataFrames
  • AI Chat
  • Code
  • Report
  • Beta
    Spinner

    Merge DataFrames

    When working with multiple datasets, you'll often have to join datasets to analyze data holistically. This template uses pandas' merge() function to join two DataFrames on one or more columns. These are skills covered in DataCamp's Joining Data with pandas course if you want to learn more!

    import pandas as pd
    
    df1 = pd.read_csv("data/df1.csv")
    df1.head()
    df2 = pd.read_csv("data/df2.csv")
    df2.head()

    Choose a merging strategy

    There are five types of joins available in pandas: inner, outer, cross, left, or right. You'll have to decide on the join type depending on your desired outcome. If you need a refresher on which to pick, check out chapter two of DataCamp's Joining Data with pandas course.

    Joining on column(s) with the same name

    The code below joins two DataFrames on columns sharing the same name in both DataFrames.

    join1 = pd.merge(
        left=df1,  # Specify DataFrame on the left to merge
        right=df2,  # Specify DataFrame on the right to merge
        how="left",  # Choose 'inner', 'outer', 'cross' 'left' or 'right'
        on=["uid"],  # List of column(s) to merge on (these must exist in both DataFrames)
        indicator=True,  # When true, adds “_merge” column with source of each row
    )
    
    print("Number of rows & columns:", join1.shape)
    join1.head()

    Joining on column(s) with different names

    The code below joins two DataFrames on columns that may not share the same column names in both DataFrames.

    join2 = pd.merge(
        left=df1,  # Specify DataFrame on the left to merge
        right=df2,  # Specify DataFrame on the right to merge
        how="inner",  # Choose 'inner', 'outer', 'cross' 'left' or 'right'
        left_on=["reg_date", "uid"],  # List of column(s) to merge on in the left DataFrame
        right_on=["date", "uid"],  # List of column(s) to merge on in the right DataFrame
        indicator=True,  # When true, adds “_merge” column with source of each row
    )
    
    print("Number of rows & columns:", join2.shape)
    join2.head()

    For more information on arguments and output options, visit pandas' merge() documentation.

    If you're struggling to understand the different join types, it may be helpful to experiment with different values for the how and on arguments of the merge() function. The _merge column enabled by indicator=True is also useful for interpreting the resulting DataFrame, specifically why a row is included in the output.