Tomas Luna/

Understanding Hospital Readmissions (copy)


Understanding Hospital Readmission

Table of contents

  1. Itroduction
  2. Executive summary
  3. Data and Methods
  4. Expolratory Data Analysis (EDA)
  5. Further consideration
  6. Conclusions and Recommendations
  7. Annex

1. Introduction

Hospital readmission is a problem in healthcare where patients are discharged from the hospital and then readmitted within a certain period of time, often within 30 days of their initial discharge. This is a costly and preventable problem that can negatively impact patients' health outcomes and quality of life. Causes of readmissions include inadequate care during initial hospitalization and poor discharge planning. Patients with chronic conditions, such as heart failure, diabetes, and respiratory disease, are at a particularly high risk of readmission. To reduce readmissions, interventions such as improved care coordination, enhanced patient education, and medication management are implemented. Machine learning and artificial intelligence (AI) algorithms are also used to predict which patients are at the highest risk of readmission and enable healthcare providers to intervene proactively to prevent readmissions.

2. Executive summary

Our consulting company has been tasked with helping a hospital group improve their understanding of patient readmissions. We have been given access to ten years' worth of data on patients who were readmitted to the hospital after being discharged. Our goal is to assess whether initial diagnoses, the number of procedures, or other variables could provide insight into the probability of readmission, and to identify those patients who are at a higher risk of readmission so that the hospital can focus their follow-up calls and attention accordingly.

To achieve these objectives, we have prepared a report covering the following:

  1. Analysis of the most common primary diagnosis by age group.
  2. Exploration of the impact of a diabetes diagnosis on readmission rates.
  3. Identification of patient groups that the hospital should focus their follow-up efforts on to better monitor patients with a high probability of readmission.

The report begins with a brief overview and cleaning of the data, followed by an explanation of the methodologies employed to extract the most valuable insights. The exploratory data analysis has indicated that:

  1. The primary disease diagnosis that is most frequently observed among different age groups is 'Circulatory', except for the 40-50 age group, where 'Other' is the most common diagnosis.

  2. Investigation of the readmission rates concerning primary, secondary, and tertiary diagnoses reveals that:

    • Patients diagnosed with diabetes have a higher readmission rate than those diagnosed with other conditions, 48% for patients with diabetes against 44% for patients diagnosed with 'Other' diseases.
    • The Chi-square statistical test used to assess the dependence of primary diagnosis on readmission rate revealed that there was a significant statistical association between primary diagnosis of diabetes and hospital readmission rate.
  3. The hospital should concentrate its follow-up efforts on patient groups with a high likelihood of readmission, including:

    • patients in the age range of 50 to 90 years old
    • patients diagnosed with diabetic, circulatory and respiratory diseases
    • According to the machine learning models developed during the analysis, the features that have the most significant impact on the readmission rate include:
      • the number of outpatient visits in the year before a hospital stay
      • the number of inpatient visits in the year before a hospital stay
      • the number of medications administered during the hospital stay

Key Recommandations

  1. Further analysis should be conducted on patient groups identified as having a high probability of readmission to determine the specific factors contributing to their readmission rates.
  2. The hospital should implement targeted intervention programs for patients in the identified age group and diagnose diabetic, circulatory, and respiratory diseases to reduce their readmission rates.
  3. Evaluate the performance of different machine learning models and identify opportunities for model improvement.
  4. Analyze the impact of different hospital policies and practices, such as discharge planning and post-discharge follow-up, on readmission rates.

3. Data and Methods

As a company, we have access to a dataset that contains patient information spanning over a period of ten years.(source):

Information in the Dataset

  • "age" - age bracket of the patient
  • "time_in_hospital" - days (from 1 to 14)
  • "n_procedures" - number of procedures performed during the hospital stay
  • "n_lab_procedures" - number of laboratory procedures performed during the hospital stay
  • "n_medications" - number of medications administered during the hospital stay
  • "n_outpatient" - number of outpatient visits in the year before a hospital stay
  • "n_inpatient" - number of inpatient visits in the year before the hospital stay
  • "n_emergency" - number of visits to the emergency room in the year before the hospital stay
  • "medical_specialty" - the specialty of the admitting physician
  • "diag_1" - primary diagnosis (Circulatory, Respiratory, Digestive, etc.)
  • "diag_2" - secondary diagnosis
  • "diag_3" - additional secondary diagnosis
  • "glucose_test" - whether the glucose serum came out as high (> 200), normal, or not performed
  • "A1Ctest" - whether the A1C level of the patient came out as high (> 7%), normal, or not performed
  • "change" - whether there was a change in the diabetes medication ('yes' or 'no')
  • "diabetes_med" - whether a diabetes medication was prescribed ('yes' or 'no')
  • "readmitted" - if the patient was readmitted at the hospital ('yes' or 'no')

Remarks on the data:
The dataframe contains 25000 rows and 17 columns, with no missing values or duplicate rows. Most of the numeric columns exhibit positive skewness, likely due to a significant number of outliers, which totalled 11181. To prevent the loss of important information during analysis, we retained these outliers.

Our exploratory data analysis involved various methodologies, including data cleaning, data visualization, statistical analysis, and machine learning algorithms. To clean the data, we used Pandas to handle missing values, and outliers, and transform variables as necessary. We also used Scikit-learn tools, such as One-Hot-Encoder, to prepare the data for machine learning algorithms. For visualization, we employed Matplotlib and Seaborn to create various plots, including barplots, lineplots, and heat maps, to identify patterns and relationships. Additionally, we utilized the Pingouin library for statistical analysis, including the Chi-square test to understand relationships between variables. For machine learning, we implemented various algorithms, such as k-Nearest Neighbors, Logistic Regression, and Random Forests, and evaluated the models based on accuracy, precision, recall, F1 score, and cross-validation.

Acknowledgments: Beata Strack, Jonathan P. DeShazo, Chris Gennings, Juan L. Olmo, Sebastian Ventura, Krzysztof J. Cios, and John N. Clore, "Impact of HbA1c Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database Patient Records," BioMed Research International, vol. 2014, Article ID 781670, 11 pages, 2014.

4. EDA

4.1 What is the most common primary diagnosis by age group?

By identifying the most common diseases in each age group, prevention programs and treatments tailored to each age group can be developed. In addition, this can provide useful information for planning healthcare resources and forecasting future healthcare needs.

One effective method to identify the primary diagnosis by age group involves grouping the data frame by 'age' and 'diag_1', followed by creating a pivot table that displays the ranking of diseases diagnosed in each age group. This approach provides a clear and concise overview of the most prevalent diseases in each age group, enabling us to focus on the diseases that are most likely to affect specific age groups.

print('Table 1: Ranking diesease table by age group.')
age_group = df.groupby(['age', 'diag_1']).size().reset_index(name='counts')

pivot_table = pd.pivot_table(age_group, 
                             values='counts').drop('Missing', axis=0).rank(ascending=False, axis=0)

def color_rank_one(val):
    '''This function applies the orange background color to the first rank'''
    if val == 1:
        return 'background-color: mediumturquoise'
        return '''{:,.0f}').background_gradient(cmap='Blues_r', axis=0).applymap(color_rank_one)

According to the table, the green color indicates the most frequent diagnosis by age group. Based on the data, we can conclude the following:

  • 'Circulatory' is the most common diagnosis across all age groups except for the 40-50 age group, where 'Other' is the most common.
  • 'Other' is the second most common diagnosis across all age groups except for the 40-50 age group, where - 'Circulatory' is the second most common.
  • 'Respiratory' is the third most common diagnosis across all age groups.
  • 'Digestive' is the fourth most common diagnosis across all age groups except for the 40-50 age group, where 'Diabetes' is the fourth most common.
  • 'Injury' is the fifth most common diagnosis among patients in age ranges between 60 to 100, except for the 40-50 and 50-60 age groups, where the fifth most common diagnoses are 'Digestive' and 'Diabetes', respectively.
  • 'Diabetes' is the sixth most common diagnosis among patients in age ranges between 60 to 100, except for the 40-60 age group, where the sixth most common diagnosis is 'Injury'.
  • 'Musculoskeletal' is the least common diagnosis across all age groups.

We can also visualise the count of primary diagnoses highlighting in green what has been recognized as the most frequent of the primary diagnoses.

print('Figure 1: Count primary diagnosis by age group.')
fig, ax = plt.subplots(figsize=(10,10))

diagnosis = df[df['diag_1'] != 'Missing']
unique_diags = diagnosis.diag_1.unique()

blues = sns.color_palette('Blues', n_colors=len(unique_diags))
custom_palette = ["mediumturquoise" if diag == "Circulatory"  
                  else blues[len(unique_diags) - 1 - i] 
                  for i, diag in enumerate(unique_diags)]


for bar in ax.patches:
    width = bar.get_width()
    x = width
    y = bar.get_y() + bar.get_height() / 2
    label = f"{width:.0f}"
    ax.annotate(label, (x, y), 
                ha='left', va='center', 
                xytext=(3, 0), fontsize=10, 
                textcoords='offset points'

plt.ylabel('Age Group', fontsize=12)

  • AI Chat
  • Code