Decision Tree Classification
  • AI Chat
  • Code
  • Report
  • Beta
    Spinner

    Decision Tree Classification

    A decision tree classifier is a supervised learning method used for classification and regression known for its interpretability due to its flowchart-like tree structure. This template trains, tunes, and visualizes a decision tree for a classification problem. If you would like to learn more about decision trees, take a look at DataCamp's Machine Learning with Tree-Based Models in Python course.

    To swap in your dataset in this template, the following is required:

    • There are at least two feature columns and a column with a categorical target variable you would like to predict.
    • The features have been cleaned and preprocessed, including categorical encoding. DataCamp has a course on preprocessing if you need more guidance.
    • There are no NaN/NA values. You can use this template to impute missing values if needed.

    The placeholder dataset in this template is consists of hotel booking data with details, such as length of stay and how the booking was made. Each row represents a booking and whether the booking was canceled (the target variable). You can find more information on this dataset's source and dictionary here.

    1. Loading packages and data

    # Load packages
    import numpy as np
    import pandas as pd
    import matplotlib.pyplot as plt
    from sklearn.model_selection import train_test_split
    from sklearn.tree import DecisionTreeClassifier
    from sklearn.metrics import accuracy_score
    from sklearn.model_selection import RandomizedSearchCV
    from sklearn import tree
    
    # Load the data and replace with your CSV file path
    df = pd.read_csv("data/hotel_bookings_clean.csv")
    df.head()
    # Check if there are any null values
    print(df.isnull().sum())
    # Check columns to make sure you have features and a target variable
    df.info()

    2. Splitting the data

    To split the data, we'll use the train_test_split() function.

    # Split the data into two DataFrames: X (features) and y (target variable)
    X = df.iloc[:, 1:]  # Specify at least two columns as features
    y = df["is_canceled"]  # Specify one column as the target variable
    
    # Split the data into train and test subsets
    # You can adjust the test size and random state
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=0.30, random_state=123
    )
    
    X_train.shape, X_test.shape, y_train.shape, y_test.shape

    3. Building a decision tree classifier

    The following code builds a scikit-learn DecisionTreeClassifier using the most fundamental parameters. As a reminder, you can learn more about these parameters in DataCamp's Machine Learning with Tree-Based Models in Python course and scikit-learn's documentation.

    # Define parameters: these will need to be tuned to prevent overfitting and underfitting
    params = {
        "criterion": "gini",  # Function to measure the quality of a split: 'Gini' or 'Entropy'
        "max_depth": 6,  # Max depth of the tree
        "min_samples_split": 2,  # Min number of samples required to split a node
        "min_samples_leaf": 1,  # Min number of samples required at a leaf node
        "ccp_alpha": 0.01,  # Cost complexity parameter for pruning
        "random_state": 123,
    }
    
    # Create a DecisionTreeClassifier object with the parameters above
    clf = DecisionTreeClassifier(**params)
    
    # Train the decision tree classifer on the train set
    clf = clf.fit(X_train, y_train)
    
    # Predict the outcomes on the test set
    y_pred = clf.predict(X_test)

    To evaluate this classifier, we will use accuracy and implement it with sklearn's metrics.accuracy_score() function. Accuracy may not be the best evaluation metric for your problem, especially if your dataset has class imbalance (in which case recall or precision may be more suitable).

    # Evaluate accuracy
    print("Accuracy:", accuracy_score(y_test, y_pred))

    4. Visualizing a decision tree

    You can visualize your trained DecisionTreeClassifier using sklearn's plot_tree().

    plt.figure(figsize=(12, 8))
    tree.plot_tree(clf, feature_names=X.columns)