Live Training: Exploratory Data Analysis in Python (Solution)
  • AI Chat
  • Code
  • Report
  • Spinner

    Exploratory Data Analysis in Python

    In this live training, we'll be doing Exploratory Data Analysis, or EDA, on a dataset that consists of hotel booking data. It includes many details about the bookings, including room specifications, the length of stay, the time between the booking and the stay, whether the booking was canceled, and how the booking was made. The data was gathered between July 2015 and August 2017. You can consult the appendices at the bottom of the notebook for citations and an overview of all variables.

    import pandas as pd
    import plotly.express as px

    Import the data

    df = pd.read_csv("hotel_bookings_clean_v2.csv")
    df

    Basic exploration

    # Dimensions
    print(df.shape)
    # Missing values
    df.isnull().sum()
    # Describe
    df.describe()
    # How many bookings were canceled?
    n_canceled = df['is_canceled'].sum()
    pct_canceled = df['is_canceled'].mean()
    print(f"{n_canceled} bookings ({pct_canceled*100:.2f}% of all bookings) were cancelled")

    Are the cancellation rates different during different times of the year?

    # Calculate and plot cancellations every month
    cancellations = df\
        .filter(['arrival_date_month', 'is_canceled'])\
        .groupby(by = 'arrival_date_month', as_index=False)\
        .sum()
    px.bar(cancellations, x='arrival_date_month', y='is_canceled')
    # Calculate and plot total bookins every month
    total_bookings = df\
        .filter(['arrival_date_month', 'is_canceled'])\
        .groupby(by = 'arrival_date_month', as_index=False)\
        .count()\
        .rename(columns = {'is_canceled': 'total_bookings'})
    px.bar(total_bookings, x='arrival_date_month', y='total_bookings')
    # Calculate cancellation rates every month
    merged = pd\
        .merge(total_bookings, cancellations, on='arrival_date_month')\
        .assign(pct_canceled = lambda x: x["is_canceled"] / x["total_bookings"])
    px.bar(merged, x='arrival_date_month', y='pct_canceled')

    Does the amount of nights influence the cancellation rate?

    # Prepare the data
    df_sel = df\
        .assign(stays = lambda x: x["stays_in_week_nights"] + x["stays_in_weekend_nights"])\
        .query('stays < 15')