Hospital Readmissions - A Short Bayesian Approach
  • AI Chat
  • Code
  • Report
  • Beta
    Spinner

    Hospital Readmissions - A Short Bayesian Approach

    📖 Summary

    Based on the given dataset comprising patient readmission records, I set to address three key questions:

    1. What is the most common primary diagnosis by age group? By simple tabulating occurrences, I found that for all age groups circulatory problems where the most common primary diagnosis, except for the 40-50 group where the most common one is of type "Other".

    2. Some doctors believe diabetes might play a central role in readmission. Explore the effect of a diabetes diagnosis on readmission rates Using a multivariate Bayesian generalized linear model (GLM), I found that with >99% probability, diabetes as a primary diagnosis is associated with an increased likelihood of patient readmission.

    3. On what groups of patients should the hospital focus their follow-up efforts to better monitor patients with a high probability of readmission? Based on the aforementioned GLM, by taking into account all positive coefficients I would suggest focusing on patients that present numerous emergency, inpatient or outpatient visits from the previous year and that were prescribed a diabetes medication.

    🚀 Install dependecies, load dataset

    Let us begin by installing the Python Bayesian framework pymc, then importing the dataset along with all relevant libraries.

    # Install pymc
    !pip3 install pymc -q --no-warn-script-location
    
    # Imports
    import numpy as np
    import pandas as pd
    import pymc as pm
    import seaborn as sns
    import matplotlib.pyplot as plt
    
    # Load dataset
    hosp_df = pd.read_csv('data/hospital_readmissions.csv')

    We are now ready to take on the analysis to address each of the three questions.

    ⚙️ Analysis

    1. What is the most common primary diagnosis by age group?

    This is a simple question that only requires reshaping a long frequency table, then determining the highest frequency per age group.

    # Extract relative frequency of primary diagnoses by age group
    diag1_age = hosp_df.groupby("age")["diag_1"].value_counts(normalize=True).unstack()
    # Display single most common primary diagnosis per group
    print(diag1_age.idxmax(axis=1))
    
    # Create stacked barplot
    diag1_age.plot.bar(stacked=True, colormap="Set2", ylabel="Relative frequency")
    # Add legend
    plt.legend(loc='center left', bbox_to_anchor=(1, 0.5), title="Primary diagnosis")

    As we can see, in all age groups Circulatory is the most common primary diagnosis, except for the 40-50 group, where Other is the most common one. The stacked barplot provides additional insights, for example showing how the proportions of diabetes and circulatory diagnoses decrease and increase with age, respectively.

    2. Some doctors believe diabetes might play a central role in readmission. Explore the effect of a diabetes diagnosis on readmission rates

    After determining which patients had Diabetes as a primary diagnsosis, we could next conduct a (Chi-square) test of indepencence; however, we would need to resolve differences with respect to the other variables, particularly age. Therefore, I propose instead fitting a Bayesian GLM (logistic regression) based on seven variables from the dataset to predict readmitted and ultimately ascertain the impact of diabetes primary diagnosis.

    Before formalizing the model structure, let us first handle feature encoding, i.e. encode binary and ordinal categorial variables, and capture diabetes as primary diagnosis.

    # Flag diabetes diagnosis
    hosp_df["diabetes_diag"] = (hosp_df["diag_1"] == "Diabetes").astype(int)
    # Encode age as ordinal categorical
    map_age = {
        "[40-50)": 0,
        "[50-60)": 1,
        "[60-70)": 2,
        "[70-80)": 3,
        "[80-90)": 4,
        "[90-100)": 5
    }
    hosp_df["age_ord"] = hosp_df["age"].map(map_age)
    # Binarize `readmitted`, `change` and `diabetes_med_bin`
    hosp_df["readmitted_bin"] = hosp_df["readmitted"].map({"yes": 1, "no": 0})
    hosp_df["change_bin"] = hosp_df["change"].map({"yes": 1, "no": 0})
    hosp_df["diabetes_med_bin"] = hosp_df["diabetes_med"].map({"yes": 1, "no": 0})
    # Add intercept to df
    hosp_df["intercept"] = 1
    
    # Select predictors
    predictors =  [
        "age_ord",
        "time_in_hospital",
        "n_procedures",
        "n_lab_procedures",
        "n_medications",
        "n_outpatient",
        "n_inpatient",
        "n_emergency",
        "diabetes_diag",
        "change_bin",
        "diabetes_med_bin" 
    ]

    Let us now formalize the model structure. Our GLM will be essentially built on two prior distributions, one for the intercept and - multiple instances of - one for the predictor coefficients

    as you can see these priors are normal distributions centered around zero, thus reflecting our ignorance over i) the value of the intercept relative to a 50% readmission probability and ii) the impact of the individual predictors on readmission probability. Then we can define the logit of the response variable as

    Finally, we use the logistic function to convert the logit to the estimated readmission probability , which we assume to parameterize a Bernoulli distribution that explains the observed values of readmitted

    We can now fit the model using pymc and sample the posterior samples of the parameters - this may take about 10min ⏳

    # Initialize model
    mod = pm.Model()
    
    # Fit model and sample posterior
    with mod:
        # Intercept prior
        alpha = pm.Normal("alpha", mu=0, sigma=10)
        # Beta coefficient priors
        beta = pm.Normal("beta", mu=0, sigma=5, shape=len(predictors))
        # Compute logit
        logit = alpha + sum([beta[i]*hosp_df[pred] for i, pred in enumerate(predictors)])
        # Apply logistic function to extract probability p
        p = 1 / (1 + np.exp(-logit))
        # Condition likelihood of p based on Bernoulli distribution with observed values `readmitted_bin`
        Y_obs = pm.Bernoulli("Y_obs", p=p, observed=hosp_df["readmitted_bin"])
        # Draw 1000 posterior samples
        trace = pm.sample()
        
    # Examine chains
    pm.plot_trace(trace)

    We see that the chains converge nicely and are rather stable. Next we explore the distributions of those same samples after restructuring the NumPy arrays appropriately, and building a dataframe identifying the individual coefficients. In addition, we will determine the proportion of sampled values from the coefficient parameterizing diabetes_diag that are greater than zero, which will tell us the probability - based on our model structure and assumptions - that a diabetes primary diagnosis increases the chances of patient readmission.

    # Extract posterior samples of beta
    post_alpha = trace.posterior['alpha'].values.reshape(-1, 1)
    post_beta = trace.posterior['beta'].values.reshape(-1, len(predictors))
    post_coeffs = pd.DataFrame(np.concatenate([post_alpha, post_beta], axis=1), columns=["alpha"] + predictors)
    
    # Violin plot with seaborn
    ax = sns.violinplot(data=post_coeffs, orient='h')
    ax.set(xlabel=r'$\beta$')
    plt.axvline(0., linestyle='--', color='black', alpha=.25)
    
    # Print probability that beta from diabetes_diag > 0
    prob_pos = (post_coeffs["diabetes_diag"] > 0).mean()
    print("Probability diabetes diagnosis increases readmission chances: {:.3f}".format(prob_pos))

    Lots of interesting insights: firstly, we observe with >99% confidence that diabetes diagnosis has a positive effect on readmission. This should come as no surprise since this dataset was constructed around individuals somehow diagnosed with diabetes (cf. source). Then, as one would expect, we see that the eldest patients have greater chances of readmission. We also see how the parameter adjusted itself to conform to the proportion of readmitted patients in the dataset, which is smaller than 50%.

    3. On what groups of patients should the hospital focus their follow-up efforts to better monitor patients with a high probability of readmission?

    To answer this question we can again use the GLM posterior samples; by simply taking into account the posterior distributions of the individual coefficients we learn that following up on patients that present numerous emergency, inpatient or outpatient visits the year before and that were prescribed a diabetes medication would lead to a better monitoring of those at higher risk of readmission.

    💉 Final Notes

    Concluding, we answered the first question by simply computing the relative frequencies of the different primary diagnoses per age group. Then, to address the last two questions we capitalized on the relatively small number of features in the given dataset to devise a Bayesian GLM approach, simultaneously determining the impact of a diabetes diagnosis on readmission and the subset of patients that are at a higher risk of readmission and thus deserve close follow-up.

    Further work could provide additional clues and insights. I would like to point out that the proposed GLM model could be based on a different structure and a set of different covariates.

    If you enjoyed this short Bayesian excursion, please give it a thumbs up 👍 All feedback is welcome!