Decision Tree Classification
A decision tree classifier is a supervised learning method used for classification and regression known for its interpretability due to its flowchart-like tree structure. This template trains, tunes, and visualizes a decision tree for a classification problem. If you would like to learn more about decision trees, take a look at DataCamp's Machine Learning with Tree-Based Models in Python course.
To swap in your dataset in this template, the following is required:
- There are at least two feature columns and a column with a categorical target variable you would like to predict.
- The features have been cleaned and preprocessed, including categorical encoding. DataCamp has a course on preprocessing if you need more guidance.
- There are no NaN/NA values. You can use this template to impute missing values if needed.
The placeholder dataset in this template is consists of hotel booking data with details, such as length of stay and how the booking was made. Each row represents a booking and whether the booking was canceled (the target variable). You can find more information on this dataset's source and dictionary here.
1. Loading packages and data
# Load packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn import tree
# Load the data and replace with your CSV file path
df = pd.read_csv("data/hotel_bookings_clean.csv")
df.head()
# Check if there are any null values
print(df.isnull().sum())
# Check columns to make sure you have features and a target variable
df.info()
2. Splitting the data
To split the data, we'll use the train_test_split() function.
# Split the data into two DataFrames: X (features) and y (target variable)
X = df.iloc[:, 1:] # Specify at least two columns as features
y = df["is_canceled"] # Specify one column as the target variable
# Split the data into train and test subsets
# You can adjust the test size and random state
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.30, random_state=123
)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
3. Building a decision tree classifier
The following code builds a scikit-learn DecisionTreeClassifier
using the most fundamental parameters. As a reminder, you can learn more about these parameters in DataCamp's Machine Learning with Tree-Based Models in Python course and scikit-learn's documentation.
# Define parameters: these will need to be tuned to prevent overfitting and underfitting
params = {
"criterion": "gini", # Function to measure the quality of a split: 'Gini' or 'Entropy'
"max_depth": 6, # Max depth of the tree
"min_samples_split": 2, # Min number of samples required to split a node
"min_samples_leaf": 1, # Min number of samples required at a leaf node
"ccp_alpha": 0.01, # Cost complexity parameter for pruning
"random_state": 123,
}
# Create a DecisionTreeClassifier object with the parameters above
clf = DecisionTreeClassifier(**params)
# Train the decision tree classifer on the train set
clf = clf.fit(X_train, y_train)
# Predict the outcomes on the test set
y_pred = clf.predict(X_test)
To evaluate this classifier, we will use accuracy and implement it with sklearn's metrics.accuracy_score() function. Accuracy may not be the best evaluation metric for your problem, especially if your dataset has class imbalance (in which case recall or precision may be more suitable).
# Evaluate accuracy
print("Accuracy:", accuracy_score(y_test, y_pred))
4. Visualizing a decision tree
You can visualize your trained DecisionTreeClassifier using sklearn's plot_tree().
plt.figure(figsize=(12, 8))
tree.plot_tree(clf, feature_names=X.columns)