Workspace
Damien Clarke/

Competition - predict hotel cancellation

0
Beta
Spinner

Predicting Hotel Cancellations

🏨 Background

You are supporting a hotel with a project aimed to increase revenue from their room bookings. They believe that they can use data science to help them reduce the number of cancellations. This is where you come in!

They have asked you to use any appropriate methodology to identify what contributes to whether a booking will be fulfilled or cancelled. They intend to use the results of your work to reduce the chance someone cancels their booking.

The Data

They have provided you with their bookings data in a file called hotel_bookings.csv, which contains the following:

ColumnDescription
Booking_IDUnique identifier of the booking.
no_of_adultsThe number of adults.
no_of_childrenThe number of children.
no_of_weekend_nightsNumber of weekend nights (Saturday or Sunday).
no_of_week_nightsNumber of week nights (Monday to Friday).
type_of_meal_planType of meal plan included in the booking.
required_car_parking_spaceWhether a car parking space is required.
room_type_reservedThe type of room reserved.
lead_timeNumber of days before the arrival date the booking was made.
arrival_yearYear of arrival.
arrival_monthMonth of arrival.
arrival_dateDate of the month for arrival.
market_segment_typeHow the booking was made.
repeated_guestWhether the guest has previously stayed at the hotel.
no_of_previous_cancellationsNumber of previous cancellations.
no_of_previous_bookings_not_canceledNumber of previous bookings that were canceled.
avg_price_per_roomAverage price per day of the booking.
no_of_special_requestsCount of special requests made as part of the booking.
booking_statusWhether the booking was cancelled or not.

Source (data has been modified): https://www.kaggle.com/datasets/ahsan81/hotel-reservations-classification-dataset

import pandas as pd
hotels = pd.read_csv("data/hotel_bookings.csv")
hotels

The Challenge

  • Use your skills to produce recommendations for the hotel on what factors affect whether customers cancel their booking.

Note:

To ensure the best user experience, we currently discourage using Folium and Bokeh in Workspace notebooks.

Judging Criteria

CATEGORYWEIGHTINGDETAILS
Recommendations35%
  • Clarity of recommendations - how clear and well presented the recommendation is.
  • Quality of recommendations - are appropriate analytical techniques used & are the conclusions valid?
  • Number of relevant insights found for the target audience.
Storytelling35%
  • How well the data and insights are connected to the recommendation.
  • How the narrative and whole report connects together.
  • Balancing making the report in-depth enough but also concise.
Visualizations20%
  • Appropriateness of visualization used.
  • Clarity of insight from visualization.
Votes10%
  • Up voting - most upvoted entries get the most points.

Checklist before publishing

  • Rename your workspace to make it descriptive of your work. N.B. you should leave the notebook name as notebook.ipynb.
  • Remove redundant cells like the judging criteria, so the workbook is focused on your work.
  • Check that all the cells run without error.

Time is ticking. Good luck!

Booking_status to a 1 - Cancelled 0 - Not Cancelled

import numpy as np
fred = hotels.copy(deep=True)


#fred.booking_status.value_counts(dropna=False)
#print(fred.size)
#fred.dropna(subset=fred.columns[:-1],inplace=True)
#print(fred.size)
#fred

fred.loc[fred["booking_status"]=="Canceled","booking_status"]=1
fred.loc[fred["booking_status"]=="Not_Canceled","booking_status"]=0
fred.describe()
fred.booking_status.value_counts()
fred["booking_status"] = fred['booking_status'].astype(np.int32)

fred = pd.concat([fred,
                  pd.get_dummies(pd.cut(fred['lead_time'], bins=5, precision=0),
                                 prefix="lead_time", dummy_na=True),
                  pd.get_dummies(fred["no_of_children"], prefix="no_of_children", drop_first=True,
                                 dummy_na=True),
                  pd.get_dummies(fred["no_of_adults"], prefix="no_of_adults", drop_first=True,
                                 dummy_na=True),
                  pd.get_dummies(pd.cut(fred['no_of_week_nights'], bins=5, precision=0),
                                 prefix="no_of_week_nights", dummy_na=True),
                  pd.get_dummies(pd.cut(fred['no_of_weekend_nights'], bins=5, precision=0),
                                 prefix="no_of_weekend_nights", dummy_na=True),
                  pd.get_dummies(fred["type_of_meal_plan"], prefix="type_of_meal_plan",drop_first=True),
                  pd.get_dummies(fred["room_type_reserved"], prefix="room_type_reserved",drop_first=True),
                  pd.get_dummies(fred["required_car_parking_space"],
                                 prefix="required_car_parking_space", drop_first=True, dummy_na=True),
                  pd.get_dummies(fred["market_segment_type"], prefix="market_segment_type", drop_first=True,
                                 dummy_na=True),
                  pd.get_dummies(fred["repeated_guest"], prefix="repeated_guest", drop_first=True,
                                 dummy_na=True),
                  pd.get_dummies(pd.cut(fred['no_of_previous_cancellations'], bins=5, precision=0),
                                 prefix="no_of_previous_cancellations", dummy_na=True),
                  pd.get_dummies(pd.cut(fred['no_of_previous_bookings_not_canceled'], bins=5, precision=0),
                                 prefix="no_of_previous_bookings_not_canceled", dummy_na=True),
                  pd.get_dummies(pd.cut(fred['avg_price_per_room'], bins=5, precision=0),
                                 prefix="avg_price_per_room", dummy_na=True),
                  pd.get_dummies(pd.cut(fred['no_of_special_requests'], bins=3, precision=0),
                                 prefix="no_of_special_requests", dummy_na=True)],axis=1)

fred = fred.drop(['Booking_ID','arrival_year','arrival_month','arrival_date',
                  'lead_time','no_of_children','no_of_adults','no_of_week_nights','no_of_weekend_nights',
                  'type_of_meal_plan','room_type_reserved','required_car_parking_space','market_segment_type',
                  'repeated_guest','no_of_previous_cancellations','no_of_previous_bookings_not_canceled',
                  'avg_price_per_room','no_of_special_requests'],axis=1)

fred.columns
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
SEED = 1
X = pd.DataFrame(fred[fred.columns[1:]])
y= pd.DataFrame(fred[fred.columns[0]])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
                                                    stratify=y, random_state=SEED)
# Instantiate a random forests regressor 'rf' 400 estimators
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy:", accuracy)
fi = pd.DataFrame({'name':rf.feature_names_in_,'importance':rf.feature_importances_})
fi.sort_values('importance',ascending=False,inplace=True)
important = fi.loc[fi['importance']>0.001,"name"].tolist()
dir(rf)
from sklearn.linear_model import LogisticRegression
from sklearn import preprocessing
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import (
    accuracy_score,
    confusion_matrix,
    precision_score,
    recall_score,
    RocCurveDisplay,
)

# Standardize X data based on X_train
sc = StandardScaler().fit(X_train[important])
X_train_scaled = sc.transform(X_train[important])
X_test_scaled = sc.transform(X_test[important])

# Define parameters: these will need to be tuned to prevent overfitting and underfitting
params = {
    "penalty": "l2",  # Norm of the penalty: 'l1', 'l2', 'elasticnet', 'none'
    "C": 1,  # Inverse of regularization strength, a positive float
    "random_state": 123,
}

# Create a logistic regression classifier object with the parameters above
clf = LogisticRegression(**params)

# Train the classifer on the train set
clf = clf.fit(X_train_scaled, y_train)

# Predict the outcomes on the test set
y_pred = clf.predict(X_test_scaled)

# Calculate the accuracy, precision, and recall scores
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))

# Plot ROC curve
RocCurveDisplay.from_estimator(clf, X_test_scaled, y_test)
fi = pd.DataFrame({'name':important,'coef':clf.coef_.tolist()[0]}).sort_values('coef',ascending=False)

fi
  • AI Chat
  • Code