Predicting hotel booking cancellations in Python
  • AI Chat
  • Code
  • Report
  • Beta
    Spinner

    Predicting hotel booking cancellations in Python

    In this workspace, we will build a machine learning model to predict whether or not a customer cancelled a hotel booking.

    We will use a dataset on hotel bookings from the article "Hotel booking demand datasets", published in the Elsevier journal, Data in Brief. The abstract of the article states

    This data article describes two datasets with hotel demand data. One of the hotels (H1) is a resort hotel and the other is a city hotel (H2). Both datasets share the same structure, with 31 variables describing the 40,060 observations of H1 and 79,330 observations of H2. Each observation represents a hotel booking. Both datasets comprehend bookings due to arrive between the 1st of July of 2015 and the 31st of August 2017, including bookings that effectively arrived and bookings that were canceled.

    For convenience, the two datasets have been combined into a single csv file data/hotel_bookings.csv. Let us start by importing all the functions needed to import, visualize and model the data.

    # Data imports
    import pandas as pd
    import numpy as np
    
    # Visualization imports
    import plotly.express as px
    
    # ML Imports and configuration
    from sklearn.model_selection import train_test_split, KFold, cross_validate, cross_val_score
    from sklearn.pipeline import Pipeline
    from sklearn.compose import ColumnTransformer
    from sklearn.preprocessing import LabelEncoder, OneHotEncoder
    from sklearn.impute import SimpleImputer
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.metrics import accuracy_score
    from sklearn import set_config
    from sklearn.metrics import plot_confusion_matrix
    from sklearn.metrics import confusion_matrix
    set_config(display="diagram")

    1. Import the data

    The first step in any machine learning workflow is to get the data and explore it.

    hotel_bookings = pd.read_csv('data/hotel_bookings.csv')
    hotel_bookings.head()

    As a quick exploration, let us look at the number of bookings by month.

    bookings_by_month = hotel_bookings.groupby('arrival_date_month', as_index=False)[['hotel']].count().rename(columns={"hotel": "nb_bookings"})
    months = ['January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'November', 'December'] 
    fig = px.bar(
        bookings_by_month, 
        x='arrival_date_month', 
        y='nb_bookings', 
        title=f'Hotel bookings by month', 
        category_orders={"arrival_date_month": months}
    )
    fig.show(config={"displayModeBar": False})

    Our objective is to build a classification model - or classifier - that predicts whether or not a user cancelled a hotel booking.

    1. Split the data into training and test sets.

    Let us start by defining a split to divide the data into training and test sets. The basic idea is to train the model on a portion of the data and test its performance on the other portion that has not been seen by the model. This is done in order to prevent overfitting.

    # List all numberical features
    features_num = [
        "lead_time", "arrival_date_week_number", "arrival_date_day_of_month", "stays_in_weekend_nights",
        "stays_in_week_nights", "adults", "children", "babies", "is_repeated_guest" ,
        "previous_cancellations", "previous_bookings_not_canceled", "agent", "company", 
        "required_car_parking_spaces", "total_of_special_requests", "adr"
    ]
    
    # List all categorical features
    features_cat = [
        "hotel", "arrival_date_month", "meal", "market_segment", "distribution_channel", 
        "reserved_room_type", "deposit_type", "customer_type"
    ]
    
    features = features_num + features_cat
    
    X = hotel_bookings[features]
    y = hotel_bookings['is_canceled']
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 420)

    2. Preprocess the data

    The next step is to set up a pipeline to preprocess the features. We will impute all missing values with a constant, and one-hot encode all categorical features.

    transformer_num = SimpleImputer(strategy="constant")
    
    transformer_cat = Pipeline(steps=[
        ("imputer", SimpleImputer(strategy="constant", fill_value="Unknown")),
        ("onehot", OneHotEncoder(handle_unknown='ignore'))
    ])
    
    preprocessor = ColumnTransformer(transformers=[
        ("num", transformer_num, features_num),
        ("cat", transformer_cat, features_cat)
    ])
    
    preprocessor

    4. Fit the models and evaluate performance

    Next, we extend the pipeline to fit a Decision Tree model on the training data.

    # Compose data preprocessing and model into a single pipeline
    steps = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('model', DecisionTreeClassifier(random_state=1234))
    ])
    steps.fit(X_train, y_train)

    To see how well our model performed, we'll calculate and visualize a confusion matrix, and calculate the accuracy of the model

    plot_confusion_matrix(steps, X_train, y_train);