Vincenzo Catanzariti/

Clipboard Health - Pricing Case Study


Clipboard Health - Pricing Case Study

You’re launching a ride-hailing service that matches riders with drivers for trips between the Toledo Airport and Downtown Toledo. It’ll be active for only 12 months. You’ve been forced to charge riders $30 for each ride. You can pay drivers what you choose for each individual ride.

The supply pool (“drivers”) is very deep. When a ride is requested, a very large pool of drivers see a notification informing them of the request. They can choose whether or not to accept it. Based on a similar ride-hailing service in the same market, you have some data on which ride requests were accepted and which were not. (The PAY column is what drivers were offered and the ACCEPTED column reflects whether any driver accepted the ride request.)

The demand pool (“riders”) can be acquired at a cost of $30 per rider at any time during the 12 months. There are 10,000 riders in Toledo, but you can’t acquire more than 1,000 in a given month. You start with 0 riders. “Acquisition” means that the rider has downloaded the app and may request rides. Requested rides may or may not be accepted by a driver. In the first month that riders are active, they request rides based on a Poisson distribution where lambda = 1. For each subsequent month, riders request rides based on a Poisson distribution where lambda is the number of rides that they found a match for in the previous month. (As an example, a rider that requests 3 rides in month 1 and finds 2 matches has a lambda of 2 going into month 2.) If a rider finds no matches in a month (which may happen either because they request no rides in the first place based on the Poisson distribution or because they request rides and find no matches), they leave the service and never return.

Submit a written document that proposes a pricing strategy to maximize the profit of the business over the 12 months. You should expect that this singular document will serve as a proposal for

  1. A quantitative executive team that wants to know how you’re thinking about the problem and what assumptions you’re making but that does not know probability theory
  2. Your data science peers so they can push on your thinking

Please submit any work you do, code or math, with your solution.

1. Strategy introduction

Since there is no control over customer acquisition, or behavior, and the cost of each ride is fixed at $30, the proposed strategy is based on fostering customer retention.

The pricing algorithm will operate under the assumption that a driver should be incentivized if and when, by rejecting a ride, they can cause a customer to drop out straight away, or cause a negative ripple effect over the following months.

An example of such an effect is a customer's lambda (the number of rides that will be requested the following month) being lowered beyond a useful threshold, which in turn damages the relationship of that customer with the company for the remaining months of the simulation.

On the other hand, excessive incentivization will erode much of the profit, even if applied aggressively only over the first months.

If a balance can be found, then this strategy becomes viable.

And indeed a very good balance will be found, that yields a 111.00% increase in net profit over the baseline.

2. Imports

# Base modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Extra modules
import scipy.stats as stats
from collections import Counter
import time

# Type hinting
from typing import Optional, Callable, Union, List

# ML
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.metrics import roc_auc_score, roc_curve, confusion_matrix, classification_report

3. Setup

# Reproducibilty
seed = 42

# Plots
plt.rcParams['figure.figsize'] = (10, 5)
plt.rcParams['figure.dpi'] = 150
plt.rcParams['figure.autolayout'] = True
plot_suptitile_font_size = 14
plot_title_fontsize = 12
plot_labels_fontsize = 12
plot_legend_ticklabels_font_size = 9

4. Custom functions

def subset_no_yes(df: pd.DataFrame) -> tuple:
    """Subsets source DataFrame according to outcome (no / yes).

        df: Pandas DataFrame containing 'PAY' column and 'ACCEPTED' column

        Tuple: [pay for negative outmcomes, pay for positive outmcomes]

    pay_for_negative_outcomes = df.loc[df['ACCEPTED'] == 0, 'PAY']
    pay_for_positive_outcomes = df.loc[df['ACCEPTED'] == 1, 'PAY']

    return pay_for_negative_outcomes, pay_for_positive_outcomes

def plot_histograms(pay_no: pd.Series, pay_yes: pd.Series) -> None:
    """Calculates bin size and plots histograms for two Pandas Series.

        pay_no: Pay data for negative outcomes
        pay_yes: Pay data for positive outcomes

    # Binning
    bins_pay_no = int(np.sqrt(len(pay_no)))
    bins_pay_yes = int(np.sqrt(len(pay_yes)))

    # Histograms
    fig, ax = plt.subplots(nrows=1, ncols=1, figsize=(6, 4))

    ax.hist(pay_no, bins=bins_pay_no, label='No', color='C0', histtype='stepfilled', alpha=0.5, density=False)
    ax.hist(pay_yes, bins=bins_pay_yes, label='Yes', color='C1', histtype='stepfilled', alpha=0.5, density=False)

    plt.suptitle('Ride Acceptance', fontsize=14, y=0.95)

    ax.set_title('Histograms', fontsize=plot_title_fontsize)
    ax.set_xlabel('Pay $', fontsize=plot_labels_fontsize)
    ax.set_ylabel('Number of rides', fontsize=plot_labels_fontsize)

    return None

def find_current_intersection(pay_no: pd.Series, pay_yes: pd.Series, report: bool = False) -> float:
    """Finds intersections between two Gaussian functions.

    The results are restricted within an interval between lowest minumim and highest maximum of the two Series.

        pay_no: Pay data for negative outcomes
        pay_yes: Pay data for positive outcomes
        report: Whether to print the intersection value

        This function has been adapted from the code found at the followng link:

        Float: intersection value in dollars

    pay_no_mean = pay_no.mean()
    pay_yes_mean = pay_yes.mean()
    pay_no_std = pay_no.std()
    pay_yes_std = pay_yes.std()

    minimum = min(pay_no.min(), pay_yes.min())
    maximum = max(pay_no.max(), pay_yes.max())

    coef_1 = 1 / (2 * pay_no_std ** 2) - 1 / (2 * pay_yes_std ** 2)
    coef_2 = pay_yes_mean / (pay_yes_std ** 2) - pay_no_mean / (pay_no_std ** 2)
    coef_3 = pay_no_mean ** 2 / (2 * pay_no_std ** 2) - pay_yes_mean ** 2 / \
             (2 * pay_yes_std ** 2) - np.log(pay_yes_std / pay_no_std)

    intersections = [np.around(root, 2) for root in np.roots([coef_1, coef_2, coef_3]) if minimum <= root <= maximum]

    if len(intersections) != 1:
        raise ValueError('WARNING: multiple intersections found')

    output = intersections[0]

    if report:
        print(f"Current intersection: $ {output}\n")

    return output

def plot_pdf_cdf(pay_no: pd.Series, pay_yes: pd.Series, intersection: float) -> None:
    """Calculates normal distribution from data, then plots KDE & Normal PDF, and ECDF & Normal CDF

        pay_no: Pay data for negative outcomes
        pay_yes: Pay data for positive outcomes
        intersection: Current point of intersection between the two distributions

    # Generate normal distributions based on data
    normal_dist_from_pay_no = stats.norm(loc=pay_no.mean(), scale=pay_no.std())
    normal_dist_from_pay_yes = stats.norm(loc=pay_yes.mean(), scale=pay_yes.std())

    # Calculate x & y for Empirical CDF
    pay_no_ecdf_x, pay_no_ecdf_y = ecdf(pay_no)
    pay_yes_ecdf_x, pay_yes_ecdf_y = ecdf(pay_yes)

    # KDE Plots & Normal PDF, ECDF & Normal CDF
    fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(12.8, 4.8))

    sns.lineplot(ax=axes[0], x=pay_no, y=normal_dist_from_pay_no.pdf(pay_no), linestyle='--', label='No (normal)',
                 color='C0', linewidth=1)
    sns.lineplot(ax=axes[0], x=pay_yes, y=normal_dist_from_pay_yes.pdf(pay_yes), linestyle='--', label='Yes (normal)',
                 color='C1', linewidth=1)
    sns.kdeplot(pay_no, ax=axes[0], fill=True, label='No', color='C0')
    sns.kdeplot(pay_yes, ax=axes[0], fill=True, label='Yes', color='C1')
    axes[0].vlines(x=intersection, ymin=0, ymax=normal_dist_from_pay_no.pdf(intersection),
                   linestyles=':', colors='black', label=f"Int.: $ {intersection}", linewidth=1)

    sns.lineplot(ax=axes[1], x=pay_no, y=normal_dist_from_pay_no.cdf(pay_no), label='No (normal)', color='black',
                 linewidth=1, linestyle='--')
    sns.lineplot(ax=axes[1], x=pay_yes, y=normal_dist_from_pay_yes.cdf(pay_yes), label='Yes (normal)', color='black',
                 linewidth=1, linestyle='-')
    sns.lineplot(ax=axes[1], x=pay_no_ecdf_x, y=pay_no_ecdf_y, linestyle='', marker='.', label='No', color='C0',
                 markeredgewidth=0, markersize=8, alpha=0.5)
    sns.lineplot(ax=axes[1], x=pay_yes_ecdf_x, y=pay_yes_ecdf_y, linestyle='', marker='.', label='Yes', color='C1',
                 markeredgewidth=0, markersize=8, alpha=0.5)
    axes[1].vlines(x=intersection, ymin=0, ymax=1, linestyles=':', colors='black', label=f"Int.: $ {intersection}", linewidth=1)

    plt.suptitle('Ride Acceptance', fontsize=14, y=0.95)

    axes[0].set_title('KDE Plots & Normal PDF', fontsize=plot_title_fontsize)
    axes[0].set_xlabel('Pay $', fontsize=plot_labels_fontsize)
    axes[0].set_ylabel('Density', fontsize=plot_labels_fontsize)

    axes[1].set_title('ECDF & Normal CDF', fontsize=plot_title_fontsize)
    axes[1].set_xlabel('Pay $', fontsize=plot_labels_fontsize)
    axes[1].set_ylabel('Fraction of data', fontsize=plot_labels_fontsize)

    return None

def statistical_overview(series: pd.Series, label: str = 'Series', of: float = 1.5, evf: float = 3.0,
                         summary: bool = False, full_report: bool = False) -> tuple:
    """Calculates outlier fences and extreme value fences for a numeric Pandas Series.

    Optionally displays a comprehensive statistical description of the data.

        series: A Pandas Series object
        label: A label for the Series
        of: Outlier Factor (for inner fences)
        evf: Extreme Value Factor (for outer fences)
        summary: Whether to display only outliers count and extreme values count
        full_report: Whether to display the full statistical description

        Tuple: outliers count, extreme values count,
               [lower outer fence, lower inner fence, upper inner fence, upper outer fence]

    total_values_inc_nan = series.size
    total_values_exc_nan = series.count()

    q1 = np.around(series.quantile(0.25), 2)
    q3 = np.around(series.quantile(0.75), 2)
    iqr = np.around(q3 - q1)
    lower_outer_fence = np.around(q1 - evf * iqr, 2)
    lower_inner_fence = np.around(q1 - of * iqr, 2)
    upper_inner_fence = np.around(q3 + of * iqr, 2)
    upper_outer_fence = np.around(q3 + evf * iqr, 2)

    outliers_count = series[((lower_outer_fence < series) & (series <= lower_inner_fence)) |
                            ((upper_inner_fence < series) & (series <= upper_outer_fence))].count()
    non_outliers_count = total_values_inc_nan - outliers_count

    extreme_values_count = series[(series < lower_outer_fence) | (series > upper_outer_fence)].count()
    non_extreme_values_count = total_values_inc_nan - extreme_values_count

    if full_report:
        print(f"SERIES: {label}\n")

        print(f"Size:   {total_values_inc_nan}")
        print(f"Count:  {total_values_exc_nan}")
        print(f"NaN:    {total_values_inc_nan - total_values_exc_nan}")
        print(f"Min:    {np.around(np.min(series), 2)}")
        print(f"Max:    {np.around(np.max(series), 2)}")
        print(f"Mean:   {np.around(np.nanmean(series), 2)}")
        print(f"Std:    {np.around(np.nanstd(series), 2)}")
        print(f"Median: {np.around(np.nanmedian(series), 2)}")
        print(f"Q1:     {q1}")
        print(f"Q3:     {q3}")
        print(f"IQR:    {iqr}\n")

        print(f"Outlier Factor:       {of}")
        print(f"Extreme Value Factor: {evf}\n")

        print(f"Lower outer fence:    {lower_outer_fence}")
        print(f"Lower inner fence:    {lower_inner_fence}")
        print(f"Upper inner fence:    {upper_inner_fence}")
        print(f"Upper outer fence:    {upper_outer_fence}\n")

        print(f"Outliers:             {outliers_count}")
        print(f"Non-outliers:         {non_outliers_count}")
        print(f"Extreme values:       {extreme_values_count}")
        print(f"Non-extreme values:   {non_extreme_values_count}\n")

        print(f"Unbiased skew:        {np.around(series.skew())}\n")

    if summary:
        print(f"SERIES:         {label}")
        print(f"Outliers:       {outliers_count}")
        print(f"Extreme values: {extreme_values_count}\n")

    return outliers_count, extreme_values_count, [lower_outer_fence, lower_inner_fence, upper_inner_fence, upper_outer_fence]

def remove_outliers_and_extreme_values(df: pd.DataFrame, fences_no: list, fences_yes: list, report: bool = False) -> pd.DataFrame:
    """Removes outliers and extreme values from a Pandas DataFrame.

        df: Source DataFrame
        fences_no: Fence values for the positive-outcome data
        fences_yes: Fence values for the negative-outcome data
        report: Whether to print confirmation that DataFrame has been cleaned

        Pandas DataFrame: clean DataFrame, without outliers or extreme values

    mask = ((df['ACCEPTED'] == 0) & (df['PAY'] > fences_no[1]) & (df['PAY'] < fences_no[2])) | \
           ((df['ACCEPTED'] == 1) & (df['PAY'] > fences_yes[1]) & (df['PAY'] < fences_yes[2]))

    df = df.loc[mask]

    if report:
        print(f"DataFrame has been cleaned: {np.invert(mask).sum()} values removed\n")

    return df

def ecdf(data: np.ndarray):
    """Compute ECDF for a one-dimensional array of values."""

    n = len(data)
    x = np.sort(data)
    y = np.arange(1, n + 1) / n

    return x, y

def generate_train_val_test(df: pd.DataFrame, test_fraction: float = 0.3, seed: int = 42) -> tuple:
    """Generates train & validation subset, and test subset.

        df: DataFrame to split
        test_fraction: Fraction of the DataFrame to randomly sample, for the test set
        seed: Controls reproducibility

        Tuple of NumPy arrays: X_train_val, X_test, y_train_val, y_test

    seed = seed

    df_test = df.sample(frac=test_fraction, random_state=seed, axis=0)
    X_test = df_test['PAY'].to_numpy().reshape(-1, 1)
    y_test = df_test['ACCEPTED'].to_numpy()

    df_train_val = df[~df.index.isin(df_test.index)]
    X_train_val = df_train_val['PAY'].to_numpy().reshape(-1, 1)
    y_train_val = df_train_val['ACCEPTED'].to_numpy()

    return X_train_val, X_test, y_train_val, y_test

def tune_train_test_logreg_svm(x_train_val: np.ndarray, x_test: np.ndarray, y_train_val: np.ndarray, y_test: np.ndarray,
                               cv: int = 10, seed: int = 42, full_reports: bool = False, plots: bool = False) -> list:
    """Tunes and evaluates a LogisticRegression model and a SupportVectorClassifier model.

        x_train_val: feature - combined train and validation
        x_test: feature - test
        y_train_val: target - combined train and validation
        y_test: target - test
        cv: Number of folds for `cross_val_score` and `GridSearchCV`
        seed: Controls reproducibility
        full_reports: Whether to print accuracy scores, confusion matrix and classification report for each model
        plots: Whether to plot confusion matrix and ROC curve

        List: [best Log Reg model, best SVM model]

    seed = seed

    models = {'Log Reg': LogisticRegression(random_state=seed, max_iter=2000),
              'SVC': SVC(probability=True, random_state=seed)}

    param_grids = {'Log Reg': {'C': [0.01, 0.1, 1, 10, 100],
                               'solver': ['liblinear', 'saga'],
                               'penalty': ['l1', 'l2']},
                   'SVC': {'C': [0.01, 0.1, 1, 10],
                           'gamma': [0.001, 0.01, 0.1, 1],
                           'kernel': ['sigmoid', 'rbf']}

    best_models = []

    for name, mod in models.items():
        # initialize model, cross validate on train & validation subset
        model = mod
        cv_model = cross_val_score(estimator=model, X=x_train_val, y=y_train_val, cv=cv, n_jobs=-1)
        accuracy_train_val_untuned = np.around(np.mean(cv_model) * 100, 2)

        # tune hyperparams and cross validate on train & validation subset
        grid_cv_model = GridSearchCV(estimator=model, param_grid=param_grids[name], cv=cv, n_jobs=-1), y_train_val)
        accuracy_train_val_tuned = np.around(grid_cv_model.best_score_ * 100, 2)

        # get best tuned model and best params
        best_model = grid_cv_model.best_estimator_
        best_params = grid_cv_model.best_params_

        # cross validate on test subset
        cv_model_test = cross_val_score(estimator=best_model, X=x_test, y=y_test, cv=cv, n_jobs=-1)
        accuracy_test_tuned = np.around(np.mean(cv_model_test) * 100, 2)

        # confusion matrix & classification report
        y_pred = best_model.predict(x_test)
        conf_matrix = confusion_matrix(y_test, y_pred)
        class_report = classification_report(y_test, y_pred)

        if full_reports:
            print(f"MODEL: {name}\n")

            print(f"• Mean accuracy - train & val set (untuned): {accuracy_train_val_untuned}%")
            print(f"• Best accuracy - train & val set (tuned):   {accuracy_train_val_tuned}%")
            print(f"• Mean accuracy - test set (tuned):          {accuracy_test_tuned}%\n")

            print(f"• Best parameters: {best_params}\n")

            print(f"• Confusion Matrix:\n{conf_matrix}\n")

            print(f"• Classification Report:\n{class_report}\n")

        if plots:

            # Get probabilities for positive outcome and calculate AUC Score
            y_pred_proba = best_model.predict_proba(X_test)[:, 1]
            auc_score = np.around(roc_auc_score(y_test, y_pred_proba) * 100, 2)

            # Get False Positive Rate and True Positive Rate (no need for Thresholds)
            fpr, tpr, _ = roc_curve(y_test, y_pred_proba, drop_intermediate=True)

            # Confusion matrix, ROC Curve
            fig, axes = plt.subplots(nrows=1, ncols=2)

            sns.heatmap(ax=axes[0], data=conf_matrix, fmt='g', cmap='cividis', square=True, linewidths=2,
                        annot=True, annot_kws={'size': plot_legend_ticklabels_font_size}, cbar_kws={'shrink':0.8})
            cbar = axes[0].collections[0].colorbar

            plt.suptitle(f"{name}", fontsize=plot_suptitile_font_size, y=0.95)

            axes[0].set_title('Confusion Matrix', fontsize=plot_title_fontsize)
            axes[0].set_xlabel('Predicted outcome', fontsize=plot_labels_fontsize)
            axes[0].set_ylabel('True outcome', fontsize=plot_labels_fontsize)
            axes[0].xaxis.set_ticklabels(['No', 'Yes'], fontsize=plot_legend_ticklabels_font_size)
            axes[0].yaxis.set_ticklabels(['No', 'Yes'], fontsize=plot_legend_ticklabels_font_size, rotation=0)

            axes[1].plot([0, 1], [0, 1], linestyle='--', color='black', alpha=0.75, label='Baseline')
            axes[1].plot(fpr, tpr, color='green', linewidth=5, solid_capstyle='round',
                         marker='.', markersize=7, markerfacecolor='white', label=f"{name}")

            axes[1].set_title('ROC Curve', fontsize=plot_title_fontsize)
            axes[1].set_xlabel('False Positive Rate', fontsize=plot_labels_fontsize)
            axes[1].set_ylabel('True Positive Rate', fontsize=plot_labels_fontsize)
            axes[1].annotate(text=f"• Accuracy: {accuracy_test_tuned}%\n• AUC Score: {auc_score}",
                             xy=(0.25, 0.675), alpha=0.5, fontsize=plot_legend_ticklabels_font_size)
            axes[1].legend(fontsize=plot_legend_ticklabels_font_size, loc='lower right')


    return best_models

def get_best_model(logreg_model: object, svm_model: object, intersection: float, segment: int = 1, seed: int = 42,
                   report: bool = False, show_performance_df: bool = False, show_trimmed: bool = True) -> object:
    """Find best model between LogisticRegression and SupportVectorClassifier by using a custom metric.

    • Custom metric: the distance in dollars between each model's first positive outcome and the point of
                     intersection between the curves, using synthetic data. The shortest distance wins.

        logreg_model: Tuned and trained instance of the LogisticRegression() class.
        svm_model: Tuned and trained instance of the SVC() class.
        intersection: The point of intersection between the Gaussian functions
        segment: Interval in dollars that establishes lower/upper bound around the point of intersection
        seed: Controls reproducibility
        report: Whether to display custom metric for each model with indication of best model
        show_performance_df: Whether to display the performance DataFrame
        show_trimmed: Whether to display a trimmed version of the performance DataFrame, centered around the first outcomes

        Model (LogisticRegression or SupportVectorClassifier)

    seed = seed

    start = np.around(intersection - segment / 2, 2)
    stop = np.around(intersection + segment / 2, 2)
    step = 0.01
    num_values = np.ceil((stop - start) / step)

    # Evaluate model performance around point of intersection on synthetic data, at one-cent ($ 0.01) level
    synthetic_data = np.arange(start, stop, step).reshape(-1, 1)

    pred_logreg = logreg_model.predict(synthetic_data)
    pred_svm = svm_model.predict(synthetic_data)

    performance_df = pd.DataFrame({'Synthetic Data': synthetic_data.ravel(),
                                   'Log Reg': pred_logreg.ravel(),
                                   'SVM': pred_svm.ravel()})

    if performance_df['Log Reg'].sum() == 0 and performance_df['SVM'].sum() == 0:
        raise ValueError(f"WARNING: no positive values found between $ {start} and $ {stop}. "
                         f"Check 'segment' value (currently: {segment})")

    if performance_df['Log Reg'].sum() == num_values and performance_df['SVM'].sum() == num_values:
        raise ValueError(f"WARNING: only positive values found between $ {start} and $ {stop}. "
                         f"Raise 'segment' value (currently: {segment})")

    pay_first_positive_outcome_logreg = np.around(performance_df.loc[performance_df['Log Reg'] == 1, 'Synthetic Data'].iloc[0], 2)
    pay_first_positive_outcome_svm = np.around(performance_df.loc[performance_df['SVM'] == 1, 'Synthetic Data'].iloc[0], 2)

    # Calculate distance of first positive outcome from point of intersection
    delta_logreg = np.around(abs(intersection - pay_first_positive_outcome_logreg), 2)
    delta_svm = np.around(abs(intersection - pay_first_positive_outcome_svm), 2)

    # Select best model to use
    model = svm if delta_logreg >= delta_svm else logreg

    if report:
        print(f"Distance of first positive outcome from point of intersection ({intersection}):")
        print(f"    • Log Reg: {delta_logreg} ({pay_first_positive_outcome_logreg})  {'<-' if model is logreg else ''}")
        print(f"    • SVM:     {delta_svm} ({pay_first_positive_outcome_svm})  {'<- best model' if model is svm else ''}\n")

    if show_performance_df:
        if show_trimmed:
            index_first_positive_outcome_logreg = performance_df.loc[performance_df['Log Reg'] == 1, 'Log Reg'].idxmin()
            index_first_positive_outcome_svm = performance_df.loc[performance_df['SVM'] == 1, 'SVM'].idxmin()
            display(performance_df.iloc[index_first_positive_outcome_logreg: index_first_positive_outcome_svm + 1])

    return model

def generate_rides_dictionary(lam: int = 1, size=None) -> dict:
    """Generates a dictionary of rides (as keys) and number of customers (as values) with the Poisson distribution.

    Example for lam=1 and size=100: {0: 39, 1: 37, 2: 17, 3: 4, 4: 3}

        lam: Lambda value, expectation of interval
        size: Samples to be drawn from the distribution

        A dictionary: the keys are the number of events (rides) and the values are the relative frequency (customers).

    requested_rides = np.random.poisson(lam=lam, size=size)
    num_of_rides, num_of_customers = np.unique(requested_rides, return_counts=True)

    return dict(zip(num_of_rides, num_of_customers))

5. Ingest Data

# Ingest data, create master DataFrame (`master_df`) and working DataFrame `df`
csv_path = 'driverAcceptanceData.csv'
master_df = pd.read_csv(csv_path, index_col=0)
df = master_df.copy()

# Display a few rows of the DataFrame

# Display stats

# Check for missing values
print(f"Missing values:\n{df.isna().sum()}")

# Subset 'PAY' according to 'ACCEPTED': 0 -> `pay_no`, 1 -> `pay_yes`
pay_no, pay_yes = subset_no_yes(df=df)

6. EDA & Outliers

Exploratory Data Analysis is a fundamental step in every Machine Learning pipeline.

It is invaluable to understand and investigate the dataset, and to discover underlying patterns and statistical properties.

6.1. Histograms

  • AI Chat
  • Code