Workspace
max chen/

Competition - predict hotel cancellation

0
Beta
Spinner

Objective

Discover factors on hotel booking cancellation and provide recommendations to increase revenue.

The Data

They have provided you with their bookings data in a file called hotel_bookings.csv, which contains the following:

ColumnDescription
Booking_IDUnique identifier of the booking.
no_of_adultsThe number of adults.
no_of_childrenThe number of children.
no_of_weekend_nightsNumber of weekend nights (Saturday or Sunday).
no_of_week_nightsNumber of week nights (Monday to Friday).
type_of_meal_planType of meal plan included in the booking.
required_car_parking_spaceWhether a car parking space is required.
room_type_reservedThe type of room reserved.
lead_timeNumber of days before the arrival date the booking was made.
arrival_yearYear of arrival.
arrival_monthMonth of arrival.
arrival_dateDate of the month for arrival.
market_segment_typeHow the booking was made.
repeated_guestWhether the guest has previously stayed at the hotel.
no_of_previous_cancellationsNumber of previous cancellations.
no_of_previous_bookings_not_canceledNumber of previous bookings that were canceled.
avg_price_per_roomAverage price per day of the booking.
no_of_special_requestsCount of special requests made as part of the booking.
booking_statusWhether the booking was cancelled or not.

Source (data has been modified): https://www.kaggle.com/datasets/ahsan81/hotel-reservations-classification-dataset

Visualization

Visualization of data is done by tableau after cleaning data with python. Click here to view interactive dashboards. Start from identy target to cancel weekday.

Findings

The hotel has a 32.74% cancellation rate on its entire customer population (32780), which causes 1.18M financial loss.

For customers who canceled hotel bookings, most of them are non-repeated customers. Customers with children have a minor impact since it only dominates 7.53% in this group disregards the number of children. Customer with size 2 without children has a significant impact on hotel booking cancellation, which rules 66.88% and makes 63.87% revenue loss.

As the number of children and adults increases, expecting the average room price to be more expensive. However, there is an outlier in the cancel booking group. The customer with size 2 has 9 children, the average room price is only $76.5.

The given data is from 2017 and 2018. For data in 2017, it is only accessible begins from September up to December. Evaluating average room price impacts on customer size in cancellation on each month from September to December in both years, the average room price reaches its peak in September. The average room price starts to decline in the following months. However, when the average room price is at its maximum point, the customer size reaches its peak in October. This is an interesting fact that price drops and demand decreases. This is likely due to the coming holiday season, as customers wait for better prices.

For customers with two people without children, they prefer room type 1 the most, followed by room type 4 and room type 2 respectively. The most favorite meal plans are meal plan 1, then meal plan 2. Interestingly, customers in this group cancel their bookings most likely happens on Thursday and Wednesday based on the data in 2017 and 2018 respectively.

Conclusion and Reconmendation

To conclude, the hotel should focus on the customer with size two without kids since it is the main portion in canceling hotel bookings. Customers are sensitive to the average room price due to the coming holiday season. Therefore, the adjusted monthly average room price should consider the impact of the holiday season. Changing average room price with a small margin monthly such that the change of price margin matches the discount rate during the holiday season. Additionally, based on the pattern of canceling hotel bookings, it is better to provide promotions on Thursdays and Wednesdays to encourage customers to stay with the hotel. For example, special days in a week with a bundle(room type 4 and meal plan 1) to attract more customers or low down customer churn rate.

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pylab as plt
hotels = pd.read_csv("data/hotel_bookings.csv")
hotels.head(10)
hotels.info()
# get the portion of missing values in each columns
missing_por = np.round(hotels.isnull().mean()*100,2)
missing_por
hotels.describe()

The main challenging for missing value in this dataset is to fill missing values in the timeframe(no_of_weekend_nights, no_of_week_nights ,lead_time, arrival_year, arrival_month and arrival_date) and price(avg_price_per_room). These factors on the timeframe has significantly impact on analysis to understand the behavior of consumers booking hotels.

Since the portion of missing values in no_of_weekend_nights(1.01%) ,no_of_week_nights(2.22%) ,lead_time(1.30%), arrival_year(1.04%), arrival_month(1.394%),arrival_date(2.70%) and avg_price_per_room(1.27%) are less than 3% percent on the entire given dataset, we will drop the missing values in these columns

# dropping 'lead_time','arrival_year','arrival_month', 'arrival_date', 'avg_price_per_room'
hotel_update = hotels.dropna(subset=['no_of_weekend_nights','no_of_week_nights','lead_time','arrival_year','arrival_month', 'arrival_date', 'avg_price_per_room'])

hotel_update.shape[0]/hotels.shape[0]
np.round(hotel_update.isnull().mean()*100,2)
hotel_update.describe()
#no_of_adults fill no_of_adults with most frequent number
hotel_update['no_of_adults'] =  hotel_update['no_of_adults'].fillna(hotel_update['no_of_adults'].mode().loc[0])

# check if there is invalide value in no_of_adults which is less than 1
print("Number of adult less than 1?",(hotel_update['no_of_adults'] < 1).any())
# update the value in most frequent number
mask_adult = hotel_update['no_of_adults'] < 1
hotel_update.loc[mask_adult,'no_of_adults'] = hotel_update['no_of_adults'].mode().loc[0]
# fill missing value with the most frequent number
hotel_update['no_of_children'] =  hotel_update['no_of_children'].fillna(hotel_update['no_of_children'].mode().loc[0])
# fill required_car_parking_space with most frequent number
hotel_update['required_car_parking_space'] = hotel_update['required_car_parking_space'].fillna(hotel_update['required_car_parking_space'].mode().loc[0])
# repeated_guest depends has two types for 0, it does not have no_of_previous_cancellations neither no_of_previous_bookings_not_canceled 
cancel_vars = ['no_of_previous_cancellations','no_of_previous_bookings_not_canceled']
mask_rg = hotel_update['repeated_guest'] == 0 

#has missing values?
print("Missing values: ", hotel_update[mask_rg][cancel_vars].isnull().any(), sep='\n')

#set 0 on no_of_previous_cancellations & no_of_previous_cancellations when repeated_guest is 0
hotel_update.loc[mask_rg,cancel_vars] = 0

# fill repeated_guest is null as 0 when no_of_previous_cancellations & no_of_previous_cancellations are 0
mask_0 = (hotel_update['no_of_previous_cancellations'] == 0) & (hotel_update['no_of_previous_bookings_not_canceled'] ==0) &(hotel_update['repeated_guest'].isnull())
hotel_update.loc[mask_0,'repeated_guest'] = 0

print("Missing values: ", hotel_update[mask_rg][cancel_vars].isnull().any(), sep='\n')
#following the same steps for repeat_guest = 1
msk_rg = hotel_update['repeated_guest'] == 1
msk_1 = (hotel_update['no_of_previous_cancellations'].isnull()) | (hotel_update['no_of_previous_bookings_not_canceled'].isnull())
#has missing values?
print("Missing values: ", hotel_update[msk_rg][cancel_vars].isnull().any(), sep='\n')
hotel_update.loc[(msk_rg&msk_1)]
# update missing values
hotel_update.loc[(msk_rg&msk_1), 'no_of_previous_cancellations'] = 0
hotel_update.loc[(msk_rg&msk_1), 'no_of_previous_bookings_not_canceled'] = 1

print("Missing values: ", hotel_update[msk_rg][cancel_vars].isnull().any(), sep='\n')
# dropping repeat_guest OR no_of_previous_cancellations is null WHEN repeat_guest is null
mask_null = hotel_update['repeated_guest'].isnull()
mask_nulls = (mask_null & hotel_update['no_of_previous_cancellations'].isnull()) | (mask_null & hotel_update['no_of_previous_bookings_not_canceled'].isnull())
hotel_update.drop(hotel_update[mask_nulls].index, inplace=True)
#fill the missing value in repeat_guess when repeat_guest OR no_of_previous_cancellations is not zero.
hotel_update.loc[mask_null,'repeated_guest'] = 1
# set missing values to be zero
mask_rqst = hotel_update['no_of_special_requests'].isnull()
hotel_update.loc[mask_rqst, 'no_of_special_requests']= 0
# fill the missing value of type_of_meal_plan, room_type_reserved, market_segment_type with most frequent value
hotel_update['type_of_meal_plan'] = hotel_update['type_of_meal_plan'].fillna(hotel_update['type_of_meal_plan'].mode()[0])
hotel_update['room_type_reserved'] = hotel_update['room_type_reserved'].fillna(hotel_update['room_type_reserved'].mode()[0])
hotel_update['market_segment_type'] = hotel_update['market_segment_type'].fillna(hotel_update['market_segment_type'].mode()[0])