How to understand readmissions in a hospital?

Beta

Understanding Hospital Readmissions

Executive summary:

Hospital readmissions have become a major concern in the healthcare industry in recent years. When patients are discharged from the hospital, they are expected to recover fully and not require readmission. However, some patients end up being readmitted, which can be expensive and negatively impact their health. This essay will explore the causes and consequences of hospital readmissions and the measures that can be taken to prevent them. A better understanding of the differences between people requiring hospitalization may translate into more effective ways to prevent readmissions.

This report analyzes a dataset of 25,000 hospital records to identify risk factors associated with readmission. The study examined 17 variables in each record, with the primary outcome being readmission for any cause. Non-parametric tests were used since the normality test was negative for all variables. Statistical analysis was conducted using the Mann-Whitney test for numerical variables and chi2 for categorical variables. Univariate analyses were performed against readmission, and multivariate analyses were conducted with the respective variable scaling. Results with p-values <0.05 were considered statistically significant, and 95% confidence intervals were calculated. Two classification models were fine-tuned after validation.

Key findings:

Readmission rate: 11,754 people (47.01%).
Diabetic population within the sample: 10,274 people (41.09%).
The top 5 primary diagnoses for each age group can be classified by frequency in the following order:
- (40-50): other, circulatory, respiratory, diabetes, digestive.
- (50-60): circulatory, other, respiratory, digestive, diabetes.
- (60-100): circulatory, other, respiratory, digestive, trauma (wounds).
The diabetic population did not have a statistically significant impact on readmissions in the entire population (chi2 = 2.0, pval = 0.15); however, diabetes as a primary diagnosis, prescription and change of diabetic medication were important risk factor. Also an analysis by age subgroups shows that aditional risk factors for diabetic patients are age between 70 to 90 years.
Patients with the following characteristics are more likely to be readmitted compared to the general population (results shown from highest to lowest impact):
- Patients with visits in the previous year to inpatient, emergency, and outpatient services.
- Patients with a primary diagnosis of diabetes or prescribed medication for diabetes at hospital discharge.
- Secondary and tertiary diagnoses of circulatory type.
Patients with the following characteristics are less likely to be readmitted:
- Ages between 40 to 60 and 90 to 100 years old
- Treated by internal medicine, surgery or other services
- Patients who underwent a procedure
- Patients with a primary diagnosis of others, injury or musculoskeletal.

It is important to keep these patients in mind for better follow-up.

Considerations and recommendations:

The readmission rate is significantly higher than the United States' global average. Therefore, focusing on the specified populations can help reduce this outcome.
Outliers were found in several categories, which although they change the statistical weights of the variables, we feel it is important to take them into account, as these chronic patients with multiple comorbidities present higher costs for the health system and population health. Therefore, strategies were implemented to mitigate their impact on the analyses.
Patients who have visited any health service in the past year, especially inpatient services, with a primary diagnosis of diabetes and a prescription for diabetes medication are the highest-risk group for readmission. Similarly, patients with secondary and tertiary diagnoses in the circulatory category are also at increased risk. For these types of patients, it is recommended to increase surveillance and monitoring through various strategies (phone calls, mail, appointments) to identify early signs of complications.
Given the large number of diabetic patients in the population, having specialized teams in their management and primary and secondary prevention would reduce complications, thereby improving the health of the population served by the hospital.
The majority of specialty care was classified as missing, so we cannot truly characterize this population. Therefore, it is recommended that the hospital ensure that each category is properly documented for future analyses,Also, diagnoses such as "others" or "missing" should be reevaluated and adjusted once the appropriate classification is obtained, as a significant portion of the population falls into this category.
Younger patients or those with a primary diagnosis of musculoskeletal or injury tend to have lower readmission rates, which is consistent with populations that often have definitive diagnoses once a procedure is performed to address their condition.
Patients over the age of 90, although they have low readmission rates, should probably be better classified, as these rates may be due to increased mortality in this population rather than low readmission rates.
If we had a more precise characterization of diagnoses, we could perform specific analyses for each population group. This way, we could focus our efforts more effectively on improving readmission rates.

📖 Background

Hospital readmissions have become a critical concern in healthcare due to their negative impact on patient outcomes and healthcare costs. Patients who are readmitted to the hospital within a short time after their initial discharge are at increased risk of morbidity and mortality, which can lead to significant healthcare costs. In addition, hospital readmissions can result in lower patient satisfaction rates and reduced quality of care. Therefore, reducing hospital readmissions has become a top priority for healthcare providers and policymakers. A recent study published in the Journal of Hospital Medicine found that reducing hospital readmissions can significantly improve patient outcomes and reduce healthcare costs (Makaryus et al., 2019).

Hospital readmissions are also an important quality of care measure in healthcare. The Centers for Medicare and Medicaid Services (CMS) implemented a Hospital Readmissions Reduction Program (HRRP) in 2012, which financially penalizes hospitals with higher-than-expected readmission rates for certain conditions. The goal of this program is to incentivize hospitals to improve care coordination and reduce preventable readmissions. A study published in the New England Journal of Medicine found that the HRRP was associated with a significant reduction in 30-day readmissions for heart failure, acute myocardial infarction, and pneumonia (Kansagara et al., 2018). This suggests that policy interventions aimed at reducing readmissions can have a positive impact on patient outcomes and quality of care.

Effective strategies for reducing hospital readmissions include improving care coordination, increasing patient engagement, and providing better discharge planning and post-discharge care. A systematic review published in the Annals of Internal Medicine found that interventions that involve transitional care, such as home visits, follow-up phone calls, and medication reconciliation, can significantly reduce hospital readmissions (Hansen et al., 2011). Additionally, involving patients and their caregivers in the discharge planning process and providing education about their condition and self-care management can also improve patient outcomes and reduce readmissions. These findings highlight the importance of implementing evidence-based strategies to reduce hospital readmissions and improve patient outcomes in healthcare.

Metodology

Information about the data

Ten years of patient information (source):

Information about the used variables

"age" - age bracket of the patient
"time_in_hospital" - days (from 1 to 14)
"n_procedures" - number of procedures performed during the hospital stay
"n_lab_procedures" - number of laboratory procedures performed during the hospital stay
"n_medications" - number of medications administered during the hospital stay
"n_outpatient" - number of outpatient visits in the year before a hospital stay
"n_inpatient" - number of inpatient visits in the year before the hospital stay
"n_emergency" - number of visits to the emergency room in the year before the hospital stay
"medical_specialty" - the specialty of the admitting physician
"diag_1" - primary diagnosis (Circulatory, Respiratory, Digestive, etc.)
"diag_2" - secondary diagnosis
"diag_3" - additional secondary diagnosis
"glucose_test" - whether the glucose serum came out as high (> 200), normal, or not performed
"A1Ctest" - whether the A1C level of the patient came out as high (> 7%), normal, or not performed
"change" - whether there was a change in the diabetes medication ('yes' or 'no')
"diabetes_med" - whether a diabetes medication was prescribed ('yes' or 'no')
"readmitted" - if the patient was readmitted at the hospital ('yes' or 'no')

Created variables

"diabetic" - whether diag_1,diag_2,diag_3 have Diabetes diagnosis or A1Ctest is high

#@title importing necesary libraries
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import pingouin as pg
import pickle
import scikitplot as skplt

from typing import List, Tuple

from pandas.api.types import CategoricalDtype
from statsmodels.stats.contingency_tables import Table2x2
from scipy.stats import randint

from sklearn.compose import make_column_selector,make_column_transformer
from sklearn.preprocessing import StandardScaler, MinMaxScaler, PowerTransformer, OrdinalEncoder

from sklearn.experimental import enable_halving_search_cv 
from sklearn.metrics import accuracy_score, precision_score, recall_score, roc_auc_score,DetCurveDisplay, RocCurveDisplay, make_scorer, log_loss
from sklearn.decomposition import PCA
from sklearn.model_selection import GridSearchCV, train_test_split, cross_val_score, learning_curve, validation_curve, RandomizedSearchCV, HalvingRandomSearchCV
from sklearn.pipeline import Pipeline,make_pipeline

from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression,LogisticRegressionCV
from sklearn.ensemble import GradientBoostingClassifier,RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import LinearSVC

# Custom graphic and theme
custom_params = {"axes.spines.right": False,
                 "axes.spines.top": False,
                 'grid.color': '#dedede'}
sns.set_theme(context='paper',style='ticks',rc=custom_params,
              palette='colorblind',font_scale=1.2)
# Constant variables
img_dir = 'img_created/'
img_save_params = dict(dpi=300, bbox_inches='tight', pad_inches=0.3)
model_dir ='models_created'
seed=23

#@title Funtions
# Functions used
def explore(df):
    """
    Function to explore a pandas DataFrame and return a summary table of each column.

    Parameters:
    -----------
    df: pandas DataFrame
        The DataFrame to be explored.

    Returns:
    --------
    ret: pandas DataFrame
        A summary table of each column containing its type, minimum and maximum values,
        percentage of NaN values, number of values, and unique values.
    """
    ret = pd.DataFrame(columns=['Type','Min','Max',"Nan %",
                                '# Values','Unique values'])
    for col , content in df.items():
        values= []
        values.append(content.dtype) #look for actual type
        try: 
            values.append(content.min()) #Min
            values.append(content.max()) 
        except Exception:
            values.append('None')
            values.append('None')
        values.append(df[col].isnull().sum()/len(df[col]) *100) #% of Nan's
        values.append(content.nunique()) #count # of uniques
        values.append(content.unique())  #display unique values
        
        ret.loc[col]=values
        
    return ret

def set_fig_caption(fig,fig_number,x=0,y=-0.01,gap=0.05,title=None,caption:str=None,size=8):
    """
    Function to set the caption of a matplotlib figure.

    Parameters:
    -----------
    fig: matplotlib figure
        The figure to set the caption for.
    fig_number: int
        The number of the figure.
    x: float, optional
        The x position of the caption.
    y: float, optional
        The y position of the caption.
    gap: float, optional
        The gap between the title and the caption.
    title: str, optional
        The title of the figure.
    caption: str, optional
        The caption of the figure.
    size: int, optional
        The size of the text.

    Returns:
    --------
    fig: matplotlib figure
        The figure with the caption set.
    """
    if not title:
        title=''
    fig.text(x,y,f'Figure {fig_number}. {title.title()}',
             weight='bold',size=size+1)
    if caption: 
        for i in np.arange(caption.count('\n')): y -= 0.02 
        fig.text(x,y-gap,caption,color='#474949',
                 ma='left',wrap=True,size=size)
    return fig 

def validate_relation_numerical(df: pd.DataFrame, target_stat: str, columns: List[str]) -> pd.DataFrame:
    """
    Validate relationship between a numerical target variable and multiple numerical variables.

    Parameters:
        df (pd.DataFrame): A pandas DataFrame containing the data.
        target_stat (str): A string representing the target column name.
        columns (List[str]): A list of strings representing the column names of the variables to be compared with the target variable.

    Returns:
        pd.DataFrame: A pandas DataFrame containing the results of the Mann-Whitney U test between the target variable and each column in 'columns'.
    """
    if target_stat in columns: columns.remove(target_stat)
    results_df = pd.DataFrame(columns=['U-val','alternative','p-val','RBC','CLES'],index=columns)
    
    for col in columns:
        comparador = target_stat
        numerica = col
        vsdf = df[[comparador,numerica]]
        wide_vsdf = vsdf.pivot(columns=comparador,values=numerica)
        stats = pg.mwu(x=wide_vsdf.iloc[:,0],y=wide_vsdf.iloc[:,1],alternative='two-sided')
        results_df.loc[col] = stats.values

    return results_df

def odds_ratio(var_name: str, values_table: np.ndarray) -> Tuple[pd.DataFrame, Table2x2]:
    """
    Computes the odds ratio, p-value, and confidence interval for a 2x2 contingency table.

    Parameters:
        var_name (str): A string representing the name of the variable.
        values_table (np.ndarray): A 2x2 contingency table.

    Returns:
        Tuple[pd.DataFrame, Table2x2]: A tuple containing a pandas DataFrame with the odds ratio, p-value, and confidence interval for the input contingency table and the corresponding Table2x2 object.
    """
    table = Table2x2(values_table)
    result = pd.DataFrame({'odds_ratio':table.oddsratio,
                           'odds_pval':table.oddsratio_pvalue(),
                           'CI[2.5%]':table.oddsratio_confint()[0],
                           'CI[97.5%]':table.oddsratio_confint()[1]},index=[var_name])
    return result , table

def chi2test_categorical(data: pd.DataFrame, columns: List[str], target_column: str, alpha: float) -> pd.DataFrame:
    """
    Computes the chi-squared test of independence for each categorical variable in a DataFrame.

    Parameters:
        data (pd.DataFrame): A pandas DataFrame containing the data.
        columns (List[str]): A list of strings representing the column names of the categorical variables to be compared with the target variable.
        target_column (str): A string representing the name of the target column.
        alpha (float): A float representing the significance level.

    Returns:
        pd.DataFrame: A pandas DataFrame containing the results of the chi-squared test of independence for each column in 'columns', as well as odds ratios and confidence intervals for 2x2 contingency tables, if applicable.
    """
    if target_column in columns: columns.remove(target_column)
    results_df = pd.DataFrame(columns=['test', 'lambda', 'chi2',
                                       'dof', 'pval', 'cramer', 'power'],
                              index=columns)
    prelim_odds = []
    for col in columns:
        expected, observed, stats = pg.chi2_independence(data=data,x=col,y=target_column)
        results_df.loc[col] = stats.iloc[0]
        if observed.shape == (2,2): 
            odds_df,_ = odds_ratio(col,observed.values)
            prelim_odds.append(odds_df)
    
    results_df['significant'] = ['Yes' if x <= 0.05 else "No" for x in results_df.pval]
    if prelim_odds: 
        odds_final_df = pd.concat(prelim_odds)
        results_df = pd.concat([results_df,odds_final_df],axis=1)
    results_df.rename(columns={'pval':'chi2_pval'},inplace=True)
    return results_df

def chi2test_subgroup(data: pd.DataFrame, 
                      group: str, 
                      x_col: str, 
                      target_col: str) -> pd.DataFrame:
    """
    Performs chi-squared test of independence for each subgroup in a given group.

    Parameters:
    -----------
    data: pandas DataFrame
        The DataFrame containing the data.
    group: str
        The name of the column in data that contains the group variable.
    x_col: str
        The name of the column in data that contains the explanatory variable.
    target_col: str
        The name of the column in data that contains the target variable.

    Returns:
    --------
    result_stats: pandas DataFrame
        The DataFrame containing the results of the chi-squared test of independence
        for each subgroup in the group. The columns include the test, lambda, chi2,
        dof, pval, cramer, power, odds_ratio, CI[2.5%], CI[97.5%], and chi2_significant.
    """
    categories = data[group].unique()
    result_stats = pd.DataFrame(columns=['test', 'lambda', 'chi2', 
                                         'dof', 'pval', 'cramer', 'power'],
                                index=categories)
    prelim_odds = []
    for category in categories:
        expected, observed, stats = pg.chi2_independence(data=data[data[group]==category],
                                                         x=x_col,
                                                         y=target_col)
        result_stats.loc[category] = stats.iloc[0]
        result_stats.sort_index(inplace=True)
        if observed.shape == (2,2): 
            odds_df,_ = odds_ratio(category,observed.values)
            prelim_odds.append(odds_df)
    if prelim_odds: 
        odds_final_df = pd.concat(prelim_odds)
        result_stats = pd.concat([result_stats,odds_final_df],axis=1)

    result_stats['chi2_significant'] = ['Yes' if x <= 0.05 else "No" 
                                      for x in result_stats.pval]
    result_stats.rename(columns={'pval':'chi2_pval'},inplace=True)
    return result_stats

def forest_plot(data: pd.DataFrame, names_column: str, title: str, xlabel: str, size: int = 12, fig_prop: int = 0) -> plt.Figure:
    """
    Plots a forest plot of odds ratios and their confidence intervals.

    Parameters:
        data (pd.DataFrame): A pandas DataFrame containing the odds ratio, 95% confidence interval lower and upper limits, and the name of each category to be plotted.
        names_column (str): A string representing the name of the column containing the category names.
        title (str): A string representing the title of the plot.
        xlabel (str): A string representing the label for the x-axis.
        size (int): An integer representing the font size for the plot. Default is 12.
        fig_prop (int): An integer representing the proportional factor for the size of the figure. Default is 0.

    Returns:
        plt.Figure: A matplotlib Figure object representing the forest plot.
    """
    n_cat = len(data) + fig_prop
    if data[names_column].dtype == 'object':
      data[names_column] = data[names_column].str.replace('_',' ').str.title()
    fig = plt.figure(figsize=(int(n_cat*.5), int(n_cat*.4)), dpi=150)
    ci = [data.iloc[::-1]['odds_ratio'] - data.iloc[::-1]['CI[2.5%]'].values,
          data.iloc[::-1]['CI[97.5%]'].values - data.iloc[::-1]['odds_ratio']]
    plt.errorbar(x=data.iloc[::-1]['odds_ratio'], y=data.iloc[::-1][names_column], xerr=ci,
                 linestyle='None',marker='o',capthick=1,capsize=3)
    plt.axvline(x=1, linewidth=0.8, linestyle='--', color='black')
    plt.tick_params(axis='both', which='major', labelsize=8)
    plt.xlabel(xlabel, fontsize=size-4)
    plt.tight_layout()
    ax_neg = fig.axes[0]
    ax_neg.grid(dict(linestyle="--"))
    fig.suptitle(title,y=1.05,size=size,ha='center')
    return fig

def univariate_classification_analysis_odds(data,columns:list,target):
    y = data[target]
    results_log = [pg.logistic_regression(X=data[X],y=y) for X in columns]
    results_odds = [model.apply(lambda x: np.exp(x) if x.name in ['coef','CI[2.5%]','CI[97.5%]'] else x) for model in results_log]
    final_results = pd.concat(results_odds)
    final_results = final_results.rename(columns={'coef':'odds_ratio'})
    final_results.drop(0,inplace=True)
    final_results.reset_index(drop=True,inplace=True)
    return final_results
  
def calc_specificity(yactual,ypred,thresh):
    return sum((ypred<thresh) & (yactual ==0))/ sum(yactual==0)

def print_report(yactual,ypred,thresh):
    lista = [
        roc_auc_score(yactual,ypred),
        accuracy_score(yactual,(ypred>thresh)),
        recall_score(yactual,(ypred>thresh)),
        precision_score(yactual,(ypred>thresh)),
        calc_specificity(yactual,ypred,thresh),
        log_loss(yactual,(ypred>thresh))
    ]
    ret = pd.DataFrame(columns=['AUC','ACC','RECALL','PRECISION','SPECIF','LOG_LOSS'],
                       index=[0])
    ret.iloc[0]=lista
    return ret

def model_selector(models,X_train,y_train,X_valid,y_valid,thresh):
    return_dict = {}
    for key, model in models.items():
        
        pipe = make_pipeline(MinMaxScaler(),model)
        pipe.fit(X_train,y_train)
        y_train_pred = pipe.predict_proba(X_train)[:,1]
        y_valid_pred = pipe.predict_proba(X_valid)[:,1]
        train = print_report(y_train,y_train_pred,thresh)
        valid = print_report(y_valid,y_valid_pred,thresh)
        return_dict[key]={
            'train':train,
            'valid':valid,
            'model':pipe,
            'y_train_pred':y_train_pred,
            'y_valid_pred':y_valid_pred
                         }
    return return_dict

def validate_model_scores(model_selector_results):
    
    plot = pd.DataFrame(columns=['AUC','ACC','RECALL','PRECISION','SPECIF','model','set'])
    compare_df = pd.DataFrame(columns=['AUC_DIFF','ACC_DIFF','RECALL_DIFF','PRECISION_DIFF','SPECIF_DIFF'])
    for key,ans in model_selector_results.items():
        ans['train']['model'] = key
        ans['train']['set'] = 'Training'
        ans['valid']['model'] = key
        ans['valid']['set'] = 'Validate'
        
        plot = pd.concat([plot,ans['train'],ans['valid']])
        
        r = {
            'AUC_DIFF': float(abs(ans['train']['AUC'] - ans['valid']['AUC'])),
            'ACC_DIFF': float(abs(ans['train']['ACC'] - ans['valid']['ACC'])),
            'RECALL_DIFF': float(abs(ans['train']['RECALL'] - ans['valid']['RECALL'])),
            'PRECISION_DIFF': float(abs(ans['train']['PRECISION'].values - ans['valid']['PRECISION'].values)),
            'SPECIF_DIFF': float(abs(ans['train']['SPECIF'] - ans['valid']['SPECIF']))
        }
        compare_df.loc[key] = r
    return plot , compare_df

#@title Data cleaning and wrangling
### Data cleaning
df = pd.read_csv('data/hospital_readmissions.csv')

um_data= df.copy()
um_data['diabetic'] = np.where(
              (df.A1Ctest == "high") | 
              (df.diag_1 == 'Diabetes') | 
              (df.diag_2 == 'Diabetes') | 
              (df.diag_3 == 'Diabetes'),
              1,0)

analysis = explore(df)
df.replace(
    {
        'glucose_test':{'no':0,'normal':0,'high':1},
        'A1Ctest':{'no':0,'normal':0,'high':1},
        'change':{'no':0,'yes':1},
        'diabetes_med':{'no':0,'yes':1},
        'readmitted':{'no':0,'yes':1}
    }, inplace=True
)

df = df.astype(
    {
        'age':CategoricalDtype(categories=['[40-50)', '[50-60)', '[60-70)', '[70-80)', '[80-90)', '[90-100)'],ordered=True),
        'glucose_test':'int8',
        'A1Ctest':'int8',
        'change':'int8',
        'diabetes_med':'int8',
        'readmitted':'int8',
        'medical_specialty':'category',
        'diag_1':'category',
        'diag_2':'category',
        'diag_3':'category'
    }
)

## Defining diabetic population
df['diabetic'] = np.where(
              (df.A1Ctest == 1) | 
              (df.diag_1 == 'Diabetes') | 
              (df.diag_2 == 'Diabetes') | 
              (df.diag_3 == 'Diabetes'),
              1,0)

# Creation of dummies for later analysis.
data_dummies = pd.get_dummies(df)

# Clasification of numeric, non-binary and categorical data
numeric = [x for x in df.columns if df[x].dtype != 'category']
non_binary = [x for x in df.columns if (df[x].nunique() >3) & (df[x].dtype != 'category')]
initial_categorical = [x for x in df.columns if df[x].dtype == 'category']
categorical = ['age','medical_specialty','diag_1','diag_2','diag_3',"glucose_test","A1Ctest","change","diabetes_med",'diabetic']

Data preparation for analysis

Diabetic population clasification and data modifications

To classify the diabetic population, we established two criteria:

The patient must have a diagnosis of diabetes listed in the primary, secondary, or supplemental diagnosis.
The patient must have a high value in A1C test, which is the most reliable method for diagnosing diabetes. Since a high glucose test can be caused by various conditions, it was not considered as a diagnostic criterion.

To prepare the data for model fitting purposes, we created a dummies dataset. This dataset includes binary variables for each diagnosis and medication, indicating whether or not the patient has been diagnosed or prescribed the medication. This approach facilitates the analysis of categorical data and allows for the use of various statistical models to identify patterns and relationships within the data. Finally we define the variables glucose_test and A1Ctest as not performed/normal or high.

Subseting the data

To ensure appropriate data analysis, we subdivided the data into three categories:

Numeric: This category includes all entries with integer or floating-point data types.
Non-binary: This category includes all numeric entries that have more than three options. We excluded A1C test, glucose test, and diabetes medication variables from this category since they could be classified as either not performed, normal, or altered value.
Categorical: This category includes all non-numeric values.

We provided a detailed description of each category to facilitate data analysis.

Performed analysis

An initial exploratory study was conducted, using mean and median values as descriptive measures for the data, as well as interquartile ranges for each variable. A Shapiro-Wilk test was performed to check the normality of the data. Following this, univariate analyses were conducted using the Mann-Whitney and Chi-square tests for numerical and categorical variables, respectively, to classify them as good predictors of readmission. A p-value less than 0.05 was considered significant. Subsequently, a multivariate analysis was performed on the significant variables using a logistic regression model to determine the true weights of each variable in the final outcome. Each statistically significant result was presented and graphed with an odds ratio and its corresponding confidence interval using a forest plot. Measurements and tests were then carried out on the diabetic population in order to identify characteristics that could help improve their care and treatment, as they have been shown to be very important in evaluating readmissions in healthcare systems. Different classification models were tested and refined to develop a tool that would assist clinicians in predicting which patients are at higher risk of readmission.

#@title Summary statistics from categorical variables
grouped_df = um_data[categorical]
grouped_df.columns = [col.replace('_',' ').title() for col in grouped_df.columns]
columns = grouped_df.columns.values.tolist()
fig, axes = plt.subplots(2,5,figsize=(20,12))
list_ax = axes.reshape(-1)
for col, ax in zip(sorted(columns),list_ax):
  (grouped_df[col].value_counts(normalize=True).mul(100).reset_index(name='percent').rename(columns={'index':'values'}).pipe((sns.barplot,'data'), x='values',y='percent',ax=ax))
  ax.set_title(f'{col} Percentage count',y=1.03)
  ax.set_xlabel('')
  ax.tick_params(axis='x',rotation=90)
fig.subplots_adjust(wspace=0.42,hspace=0.4)
fig.suptitle('Summary statistics from categorical variables',size='xx-large')
fig1_summary_statisctics_cat= set_fig_caption(fig=fig,fig_number=1,x=0.11,y=-0.06,
                                              caption='Presented summary statistics from all categorical variables, values are in percentage',size=18)
fig1_summary_statisctics_cat.savefig(f'{img_dir}fig1_summary_statisctics_cat.jpg',**img_save_params)
plt.clf()

#@title Table 1 Descriptive statistics numeric variables
pretable1 = df[non_binary].describe()
columns = [col.replace('_',' ').title() for col in pretable1.columns]
pretable1.columns = columns
table1 = pretable1.transpose().round(2)

#@title Boxplot numerical variables
df_num = df[non_binary].copy()

ax_box = sns.boxplot(data=df_num,showfliers=True,orient='h',fliersize=2)
ax_box.get_figure().suptitle('Distribution of numerical variables'.title())
fig1_boxplot_num = set_fig_caption(fig= ax_box.get_figure(),
                                   fig_number=2,gap=0.08,
                                   caption='The different distributions of the numeric variables are visualized. We can observe the presence\nof a significant number of outliers, several of which may be related to patients with multiple\ncomorbidities or prolonged hospital stays.')
fig1_boxplot_num.savefig(f'{img_dir}fig2_boxplot_num.jpg',**img_save_params)
plt.clf()

#@title Readmitted patients pieplot
ax_readmitted = df.readmitted.value_counts(normalize=True).plot(kind='pie',autopct=lambda x: f'{x:.2f}%',textprops=dict(color="w",weight='bold',fontsize=12))
ax_readmitted.set_ylabel('')
ax_readmitted.legend(title='Readmitted',labels=['No','Yes'],loc='right')
ax_readmitted.set_title('Readmitted patients proportion',size=15)
fig3_readmitted = ax_readmitted.get_figure()
fig3_readmitted = set_fig_caption(fig=fig3_readmitted,fig_number=3,x=0.04,y=0.1,gap=0.06,size=12,caption='The proportion of readmitted vs. non-readmitted patients is shown.')
fig3_readmitted.savefig(f'{img_dir}fig3_readmitted.png',**img_save_params)
plt.clf()

#@title Data exploration - common diagnosis 
## Common principal diagnosis
#Set the groups and order them with normalization
data_age_diag = df.groupby(by='age').diag_1.value_counts(normalize=True,).mul(100).reset_index(name='percent').rename(columns={'level_1':'diag_1'})
order = ['Circulatory','Other','Respiratory','Digestive','Injury','Diabetes','Musculoskeletal','Missing']
g_principal_diagnosis = sns.catplot(data=data_age_diag,y='diag_1',x='percent',col='age',col_wrap=3, kind='bar', height=4, aspect=1,legend=True,orient='h',order=order)

#Tunning visuals
g_principal_diagnosis.set_axis_labels('Percentage','')
g_principal_diagnosis.set_titles('Age group {col_name}')
g_principal_diagnosis.figure.subplots_adjust(wspace=0.19)
g_principal_diagnosis.fig.suptitle('Principal diagnosis by age group (%)',y=1.05,fontsize='x-large')

#Adding % values by iterating through the axes and his containers
for ax in g_principal_diagnosis.axes.flat:
    for c in ax.containers:
        labels = [f'{(v.get_width().round(1))}%' for v in c]
        ax.bar_label(c, labels=labels, label_type='edge',padding=5)

fig2_principal_diagnosis = set_fig_caption(g_principal_diagnosis.figure,4,gap=0.08,y=-0.05,size=16,caption='The different primary diagnoses for each age group are presented, with circulatory, other, respiratory, and digestive diagnoses standing out in most age groups. It should be noted that diabetes is prominent in \nyounger population groups.')
fig2_principal_diagnosis.savefig(f'{img_dir}fig4_principal_diagnosis.jpg',**img_save_params)
plt.clf()

#@title Pairplot dataset
pair_data = non_binary.copy()
pair_data.append('readmitted')
pair_g = sns.pairplot(data=df[pair_data],corner=True,diag_kind='kde')
pair_g.fig.suptitle('Relationship between variables with KDE midleline plot',fontsize=30)
fig4_pairplot = set_fig_caption(fig=pair_g.figure,size=25,gap=0.028,fig_number=5,caption='Paired scatter plots are shown, with the diagonal displaying the data distribution shape using KDE.')
fig4_pairplot.savefig(f'{img_dir}fig5_pairplot.jpg',**img_save_params)
plt.clf()

#@title Correlation heatmap using Spearman
correlation_df = df[numeric].corr(method='spearman')
mask = np.triu(np.ones_like(correlation_df, dtype= bool))
ax_heat = sns.heatmap(correlation_df,annot = True,fmt=".1f",mask=mask,cmap='BrBG',vmin=-1,vmax=1)
plt.title('Correlation matrix between numerical data')
fig5_correlation_heatmap = set_fig_caption(fig=ax_heat.get_figure(),x=-0.08,y=-0.25,size=8,fig_number=6,caption='The Spearman correlation between variables is shown, indicating a relatively weak correlation\nbetween readmissions and other variables.')
fig5_correlation_heatmap.savefig(f'{img_dir}/fig6_correlation_heatmap.jpg',**img_save_params)
plt.clf()

#@title Table 1 data normality / df_distribution
df_distribution =pd.DataFrame(columns=['W','pval','normal'])
for col in data_dummies[numeric].columns:
    stats = pg.normality(data_dummies[col].sample(5000,random_state=seed))
    df_distribution = pd.concat([df_distribution,stats])

#@title Univariate categorical analisis with Chi2 and Logreg
categ_results = chi2test_categorical(data_dummies,data_dummies.iloc[:,7:]
                               .columns.values.tolist(),
                               'readmitted',.05)

forest_df = categ_results[categ_results.chi2_pval <=0.05][['odds_ratio',
                                                           'odds_pval',
                                                           'CI[2.5%]',
                                                           'CI[97.5%]']]

forest_df.reset_index(names='names',inplace=True)

#transforming data and fiting for LinReg numerical features
numeric_df = data_dummies[non_binary].copy()
pt = PowerTransformer(method='yeo-johnson')
X = pd.DataFrame(pt.fit_transform(numeric_df),columns=numeric_df.columns)
data_transf = pd.concat([X,data_dummies.readmitted],axis=1)
prin = univariate_classification_analysis_odds(data_transf,numeric_df.columns,'readmitted')
num_results = prin[['names','odds_ratio','pval','CI[2.5%]','CI[97.5%]']]
num_results.rename(columns={'pval':'odds_pval'},inplace=True)
forest_df = pd.concat([forest_df,num_results])
forest_df_pos = forest_df[forest_df.odds_ratio > 1]
forest_df_neg = forest_df[forest_df.odds_ratio < 1]

#@title ploting risk factors
pos = forest_plot(forest_df_pos.sort_values('odds_ratio'),'names',
                  'Risk factors for readmission'.title(),
                  'Odds Ratio and 95% Confidence Interval, p<=0.05.',
                  size=20)
# ax_pos = pos.axes[0]
# ax_pos.grid(dict(linestyle="--"))
fig6_odds_negative = set_fig_caption(fig=pos,fig_number=7,gap=0.075,x=0.05,size=15.5,
                                     caption='Risk factors for readmission were identified through univariate analysis using chi-square test and logistic regression for categorical and numerical variables \nrespectively. The higher the factor, the greater the likelihood of being readmitted.')
fig6_odds_negative.savefig(f'{img_dir}fig7_odds_negative.jpg',**img_save_params)
plt.clf()

#@title ploting protective odds ratios
for_fig_neg = forest_plot(forest_df_neg.sort_values('odds_ratio'),'names','Protective factors for readmission'.title(),'Odds Ratio and 95% Confidence Interval, p<=0.05.')
# ax_neg = for_fig_neg.axes[0]
# ax_neg.grid(dict(linestyle="--"))

fig7_odds_positive = set_fig_caption(for_fig_neg,8,gap=0.046,x=0.05,caption='Protective factors for readmission were identified through univariate analysis using chi-square test\nand logistic regression for categorical and numerical variables respectively.\nThe smaller the factor, the greater the likelihood of not being readmitted.')
fig7_odds_positive.savefig(f'{img_dir}fig8_odds_positive.jpg',**img_save_params)
plt.clf()

#@title multivariate analysis using principal features
complete_data_transformed = pd.concat([data_transf,data_dummies.drop(data_transf.columns.values,axis=1)],axis=1)

significative_variables = forest_df.query('odds_pval <=0.05').names.values
X=complete_data_transformed[significative_variables]
y=complete_data_transformed.readmitted

multivariate_model = pg.logistic_regression(X=X,y=y)
new_model = multivariate_model.apply(lambda x: np.exp(x) if x.name in ['coef','CI[2.5%]','CI[97.5%]'] else x)
new_model = new_model.rename(columns={'coef':'odds_ratio'})
new_model.sort_values(by='odds_ratio',inplace=True)

#@title Ploting results in forest plot
stat_sig = new_model[new_model.pval <0.05].drop(0)
stat_sig['names'] = stat_sig.names.str.replace('_',' ')
stat_sig['names'] = stat_sig.names.str.title()
title ='Principal Protective and risk factors for readmission in hospitalized patients'.title()
xlabel='Odds Ratio and 95% Confidence Interval, p<0.05.'
fig = forest_plot(stat_sig,'names',title=title,xlabel=xlabel)
fig8_odds_logreg = set_fig_caption(fig,y=-0.02,x=0.08,fig_number=9,gap=0.035,
                caption='After multivariate logistic regression model the main protective and risk factors are displayed.\nTo the left from the dotted line (1) are protective factors, to the right Risk factors.',
                size=10)
fig8_odds_logreg.savefig(f'{img_dir}fig9_odds_logreg.jpg',**img_save_params)
plt.clf()

#@title Diabetes poblational analysis
grouped = df[['diabetic','readmitted']].copy()
grouped['readmitted'] = grouped.readmitted.apply(lambda x: 'Yes' if x == 1 else 'No')
grouped['diabetic'] = grouped.diabetic.apply(lambda x: 'Yes' if x == 1 else 'No')
data_group = grouped.groupby('readmitted').diabetic.value_counts(normalize=True).unstack()
ax = data_group.plot(kind='bar',stacked=True)
diab_fig = ax.get_figure()
plt.title('Proportion of diabetic patients:\nreadmitted vs non readmitted',fontdict={'size':'large'})
fig9_diabetic_prop = set_fig_caption(diab_fig,10,y=-0.08,x=0.1,
                caption='Proportion for both diabetic and non-diabetic in the readmitted groups.',size=9)
fig9_diabetic_prop.savefig(f'{img_dir}fig10_diabetic_prop.jpg',**img_save_params)
plt.clf()

#@title Age group diabetic/readmitted independace test
prop_plot = df[['age','diabetic','readmitted']].copy()
prop_plot['readmitted'] = prop_plot.readmitted.apply(lambda x: 'Yes' if x == 1 else 'No')
prop_plot['diabetic'] = prop_plot.diabetic.apply(lambda x: 'Yes' if x == 1 else 'No')
grouped = prop_plot.groupby(['age','readmitted']).diabetic.value_counts(normalize=True)
data_group = grouped.unstack()
dpa_fig = data_group.plot(kind='bar',stacked=True,xlabel='Age/readmitted',ylabel='Percentage').get_figure()
plt.suptitle('Diabetic patients by age:\nreadmitted vs non readmitted'.title(),size=12)
plt.legend(title='Diabetes',loc='lower right')
fig10_diabetic_prop_age = set_fig_caption(dpa_fig.get_figure(),11,y=-0.3,x=0.03,size = 9,gap=0.059,
                caption='Proportion of diabetic patients by each age group. Apparently the 3th and 5th age\ngroup have a significant difference in proportions')
fig10_diabetic_prop_age.savefig(f'{img_dir}fig11_diabetic_prop_age.jpg',**img_save_params)
plt.clf()

#@title Multivariate analysis for diabetic subpopulation.
diab_data_transf = complete_data_transformed.query('diabetic == 1')
diabetic_bi = diab_data_transf.drop('diabetic',axis=1)

univar_diabetic = univariate_classification_analysis_odds(diabetic_bi,diabetic_bi.columns,'readmitted')
num_results1 = univar_diabetic[['names','odds_ratio','pval','CI[2.5%]','CI[97.5%]']]
num_results1.rename(columns={'pval':'odds_pval'},inplace=True)

significative_variables = num_results1.query('odds_pval <=0.05').names.values
X=diabetic_bi[significative_variables]
y=diabetic_bi.readmitted

multivariate_model = pg.logistic_regression(X=X,y=y,penalty="none")
new_model = multivariate_model.apply(lambda x: np.exp(x) if x.name in ['coef','CI[2.5%]','CI[97.5%]'] else x)
new_model = new_model.rename(columns={'coef':'odds_ratio'})
new_model.sort_values(by='odds_ratio',inplace=True)

stat_sig = new_model[new_model.pval <0.05].drop(0)
stat_sig['names'] = stat_sig.names.str.replace('_',' ')
stat_sig['names'] = stat_sig.names.str.title()
title ='Principal Protective and risk factors for readmission in hospitalized patients'.title()
xlabel='Odds Ratio and 95% Confidence Interval, p<0.05.'

fig_diabetes_forest = forest_plot(data=stat_sig,names_column='names',
                                  title='Protective and Risk factors for diabetic subpopulation'.title(),
                                  xlabel='Odds Ratio with 95% CI. p< 0.05')
fig_diabetes_forest = set_fig_caption(fig=fig_diabetes_forest,fig_number=12,x=0.2,
                                      caption="Protective and risk factors for readmission in the diabetic population were analyzed using multivariate logistic regression.")

fig_diabetes_forest.savefig(f'{img_dir}fig_diabetes_forest.jpg',**img_save_params)
plt.clf()
print('')

Hidden output

Results

Descriptive statistics

The descriptive statistics for the categorical variables are shown in figure 1, numerical are shown in Table 1 and Figure 2 and readmision rates are shown in Figure

Important findings in Figure 1:

Variables unrelated to diabetes:
1. The most frequent primary, secondary, and other diagnoses were circulatory and other, with notable mentions being respiratory for primary diagnoses and diabetes and respiratory for secondary and tertiary diagnoses.
2. The majority of patients were between 60 and 90 years old.
3. Patient management was primarily categorized as missing, with internal medicine and other categories following in frequency.
Variables related to diabetes:
1. Most A1C tests performed showed elevated values.
2. Few blood glucose tests were performed, with a similar proportion of elevated vs. normal results.
3. Nearly half of the patients had their diabetes medication changed.
4. Approximately 75% of patients were prescribed medication classified for diabetes.
5. About 40% of patients had a diagnosis of diabetes.

print('Table 1. Descriptive statistics numeric variables'.title())
display(table1)

Important findings in Figure 2 and table 1:

Most variables showed outliers, with a higher frequency in medical services (inpatient, outpatient, emergency) and number of medications.
The average length of hospital stay was 4.45 days.
The average number of laboratory procedures was 43.24.
The average number of medications used was 16.25.

Conclusion

We have a predominantly elderly population with a high prevalence of diabetes, who are mostly admitted for circulatory and unspecified (other) causes. The majority of the classification regarding care was not recorded (missing), and the vast majority did not have glucose or hemoglobin A1C tests performed. The average number of laboratory procedures and medications per patient is high relative to the length of hospital stay. All of this may indicate that our population has high comorbidity and polypharmacy, which may justify the high readmission rates.

Principal diagnosis by age

In the figure 4 are shown the principal diagnosis grouped by age.

It is shown that the main diagnoses for all age groups are circulatory and others. We can classify the prevalence (in order) of diagnoses into three groups as follows:

40-50: respiratory, diabetes, and digestive.
50-60: respiratory, digestive, and diabetes.
60-100: respiratory, digestive, and injury.

Data correlation and normality assesment

This step is very important, as it determines how the data is distributed and the way we will evaluate its relationships. First, we will visualize the distribution of each numerical variable using two plots, a scatter and a KDE plot.

As we can see from the KDE plots on the diagonal, our numerical data has positively skewed or binomial distributions. In our scatter plots, we do not find a direct linear relationship with the data, so we will create a heatmap with Spearman correlation to identify the relationships between our data.

These are the conclusions we can draw from the heatmap: We only found medium (+/- 0.4 to 0.69) and weak (+/- 0.1 to 0.39) correlations between our numerical data. The most significant correlations are:

The longer the hospital stay, the higher the number of procedures, laboratory tests, and medications.
The more laboratory tests performed, the higher the number of medications and the higher the likelihood of a high result in the A1Ctest.
The more medications taken, the higher the number of diabetes medications and the higher the likelihood of a medication change.
The more visits to any medical service, the more likely the use of other medical services.
Diabetic patients had higher values of A1Ctest, which may indicate poorer disease control in this population.
There are no strong correlations with readmission, only weakly important ones such as the use of healthcare services in the previous year.

Shapiro-Wilk test was performed in non-categorical variables, confirming the non-normality of our data. The results can be found in Table 2.

display(df_distribution)

Univariate analysis for readmissions

In figures 7 and 8, we can find the risk and protective factors for readmission in our population. It is worth noting that our odds ratios have appropriate statistical significance but exhibit weak association, as none has a value greater than 2 or less than -0.09.

Among the most important risk factors, we found the use of healthcare services in the previous year, especially previous hospitalizations, and the primary diagnosis of diabetes.

As for protective factors, the most predominant was the tertiary diagnosis missing, and we can find two perspectives to clarify:

A point for improvement in the classification of these patients if it was not done properly.
The patients who only had one or two diagnoses, meaning they did not have many comorbidities, therefore, have less risk of readmission. The other protective factors were primary musculoskeletal diagnosis and secondary injury diagnosis.

Multivariate analysis for readmissions

The significant variables were selected to perform a multivariate logistic regression in order to find the actual weight of each variable when interacting with each other, the forest plor in Figure 9 shows the results.

The main risk factors are confirmed to be the use of healthcare services in the previous year, primary diagnosis of diabetes, and prescription of medications for diabetes. For the protective factors, we found the missing tertiary diagnosis (with the same considerations mentioned above), age between 90-100 years, and the specialty of general surgery.

The non-significant variables in our multivariate analysis were:

High A1C test
Been diabetic
Age: [60-70)
Medical specialty: Cardiology
Primary diagnosis: Circulatory, Digestive, Missing
Secondary diagnosis: Digestive, Missing, Musculoskeletal, Other
Tertiary diagnosis: Diabetes, Digestive, Injury, Musculoskeletal, Other.

Subgroup analysis: Diabetics

The diagnosis of diabetes has been shown to be a significant risk factor for hospital readmission in multiple studies. Diabetes is a chronic disease that can present acute and chronic complications, such as hyperglycemia, hypoglycemia, diabetic ketoacidosis, and hyperosmolar hyperglycemic nonketotic syndrome, which may require hospitalization and intensive treatment. In addition, diabetic patients are more likely to suffer from cardiovascular, renal, neurological, and ophthalmological diseases, which increases their vulnerability to complications during their hospital stay and future readmissions. Therefore, performing a subanalysis in this population group it is crucial, as it has a high prevalence in our hospital, has a high impact on their risk of readmission, and its presence as the primary diagnosis is one of the important risk factors in our population.

In the previous analyses, we found that diabetes did not show a correlation with readmission. However, to make it more visual, we present in Figure 10 the proportion of diabetic patients among those who were readmitted and those who were not.

We observe that the proportion between each of them is very similar and our Chi-squared test indicated their independence. However, we are going to see if this is true for each age group, so we graphed these proportions for each population group in Figure 11.

We observed differences between various age groups, particularly between those aged 80 to 90. This led us to continue our subanalysis to find other characteristics that this particular population may have for an increased risk of readmission. We repeated the univariate analysis and statistically significant results were used to create a multivariate model. The results in odds ratio were graphed using a forest plot in Figure 12.

We can highlight differences within the general population, where the age group of 70 to 90 years old acts as a risk factor in the diabetic population. The rest of the risk and protective factors were similar to the general population.

#@title Machine learning preprocesing

# Data randomisation and picking
complete_data_transformed = complete_data_transformed.sample(n=len(complete_data_transformed),random_state=seed)
complete_data_transformed.reset_index(drop=True,inplace=True)
data_valid_test = complete_data_transformed.sample(frac=0.3,random_state=seed)

data_valid = data_valid_test.sample(frac=0.5,random_state=seed)
data_test = data_valid_test.drop(data_valid.index)
data_train = complete_data_transformed.drop(data_valid_test.index)

# print('Proportion from all split data:')
# print(f'data_valid n = {len(data_valid)}, readmitted prevalence = {data_valid.readmitted.sum()/len(data_valid)*100:.2f}')
# print(f'data_test n = {len(data_test)}, readmitted prevalence = {data_test.readmitted.sum()/len(data_test)*100:.2f}')
# print(f'data_train n = {len(data_train)}, readmitted prevalence = {data_train.readmitted.sum()/len(data_train)*100 :.2f}')

#@title Setting and scaling data
scaler = MinMaxScaler() # Because of the non-normal distribution of our data after transformation

def scale_columns(data, scaler, columns, target):
  data.loc[:,columns] = scaler.fit_transform(data.loc[:,columns])
  droped_data = data.drop(target,axis=1)
  return droped_data.values

X_train_scaled = scale_columns(data_train,scaler,non_binary,'readmitted')
y_train = data_train.readmitted.values

X_valid_scaled = scale_columns(data_valid,scaler,non_binary,'readmitted')
y_valid = data_valid.readmitted.values

X_test_scaled = scale_columns(data_test,scaler,non_binary,'readmitted')
y_test = data_test.readmitted.values


#@title PCA feature analysis

pca = PCA(random_state=seed)
pca.fit(X_train_scaled)

dta = pd.DataFrame({'feature':range(1,pca.n_components_ +1),
                   'variance':pca.explained_variance_ratio_})
fig, ax = plt.subplots(figsize=(10,8))
ax_pca = sns.barplot(data=dta,x='feature',y='variance')
plt.xticks(rotation=90)
plt.title('Principal Component Analysis from data',size='x-large')

fig11_pca_features = set_fig_caption(fig=ax_pca.get_figure(),fig_number=13,size=14,gap=0.04,x=0.04,
                caption='PCA component explained variances, height correlates with prediction weight')
fig11_pca_features.savefig(f'{img_dir}fig_pca_features.jpg',**img_save_params)
plt.clf()

#@title Cumulative variances from components
ax_pca_var_cum = skplt.decomposition.plot_pca_component_variance(pca,figsize=(8,6))

fig12_pca_cum = set_fig_caption(fig=ax_pca_var_cum.get_figure(),fig_number=14,size=12,x=0.07,
                caption='PCA component explained variances, displayed the cummulative variances')
fig12_pca_cum.savefig(f'{img_dir}fig_pca_cum.jpg',**img_save_params)
plt.clf()

#@title Setting models and getting results from model selector

models = {
    'logreg': LogisticRegression(solver='saga',max_iter=1000, random_state = seed),
    'knn':KNeighborsClassifier(n_neighbors=100),
    'nb':GaussianNB(),
    'tree':DecisionTreeClassifier(max_depth=10,random_state=seed),
    'rf':RandomForestClassifier(max_depth=6,random_state=seed),
    'gbc':GradientBoostingClassifier(n_estimators=100,learning_rate=1.0,max_depth=3,random_state=seed)
         }

results = model_selector(models,X_train_scaled,y_train,X_valid_scaled,y_valid,0.5)

#@title Ploting results
table_3_validate , _ = validate_model_scores(results)

g_model_validate = sns.catplot(data=table_3_validate,x='model',y="ACC",hue='set',kind='bar')
plt.title(f'Model performance by Accuracy score'.title())

fig13_model_validate = set_fig_caption(fig=g_model_validate.figure,fig_number=15,
                                       size=10,gap=0.06,
                caption='The accuracy scores for each model are shown for the training and validation sets. \nThe logistic regression and random forest models exhibit relatively good performance.')
fig13_model_validate.savefig(f'{img_dir}fig_model_validate.jpg',**img_save_params)
plt.clf()

#@title Assesing bias and variance problems

fig, [ax_roc, ax_det] = plt.subplots(1, 2, figsize=(13, 5))
for key , model_dic in results.items():
    RocCurveDisplay.from_estimator(model_dic['model'], X_valid_scaled, y_valid, ax=ax_roc, name=key)
    DetCurveDisplay.from_estimator(model_dic['model'], X_valid_scaled, y_valid, ax=ax_det, name=key)

ax_roc.set_title("Receiver Operating Characteristic (ROC) curves")
ax_det.set_title("Detection Error Tradeoff (DET) curves")

ax_roc.grid(linestyle="--")
ax_det.grid(linestyle="--")
fig.suptitle('ROC and DET curves from models'.title())
fig14_ROC_DET_curves = set_fig_caption(fig=fig,fig_number=16,size=13,x=0.09,y=-0.05,gap=0.1,
                caption='ROC and DET curves for model evaluation. The larger the value and the closer to the upper-left \ncorner of the ROC curve, the better the model. The closer to the initial vertex on the DET curve, \nthe better.')
fig14_ROC_DET_curves.savefig(f'{img_dir}fig_ROC_DET_curves.jpg',**img_save_params)
plt.clf()

#@title Learning curve from models
fig, axes = plt.subplots(2,3,figsize=(18,8))
fig.subplots_adjust(hspace=0.5,wspace=0.3)
for (key, models),ax in zip(results.items(),axes.reshape(-1)):
    skplt.estimators.plot_learning_curve(models['model'], X_train_scaled, y_train,
                                     cv=7, shuffle=True, scoring="roc_auc",
                                     n_jobs=-1, figsize=(6,4), title_fontsize="large", text_fontsize="large",
                                     title=f"Readmission Classification Learning Curve [{key}]".title(),ax=ax)
    ax.set_ylabel('ROC-AUC Score')

fig.suptitle('Leaning curves for each model',size=24)
fig15_learning_curve = set_fig_caption(fig=fig,fig_number=17,size=15,x=0.08,caption='The learning curves for each model are displayed. Logistic regression and random forest are found to be the best candidates.')
fig15_learning_curve.savefig(f'{img_dir}fig_learning_curve.jpg',**img_save_params)
plt.clf()

#@title Performance and optimization curves - Random Forest model

params_optimization = {
    'max_depth':np.arange(1,9),
    'min_samples_split':np.arange(50,110,10),
    'n_estimators':np.arange(100,800,100),
    'min_samples_leaf':np.arange(20,60,10),
    'max_features':np.arange(1,9,2),
    }


model_rf = RandomForestClassifier(random_state=seed)
fig_optimization_curve, axes = plt.subplots(2,3,figsize=(24,9))
for (param, range),ax in zip(params_optimization.items(),axes.reshape(-1)):
  
  train_scores, test_scores = validation_curve(model_rf, X_train_scaled, y_train, param_name=param, param_range=range, cv=5)

  # Calculamos la media y desviación estándar de las puntuaciones de entrenamiento y prueba
  train_mean = np.mean(train_scores, axis=1)
  train_std = np.std(train_scores, axis=1)
  test_mean = np.mean(test_scores, axis=1)
  test_std = np.std(test_scores, axis=1)

  # Graficamos la curva de validación cruzada
  ax.set_title(f"{param}")
  ax.set_xlabel(f"Parametro")
  ax.set_ylabel("Precisión (accuracy)")
  ax.set_ylim(0.0, 1.1)
  ax.plot(range, train_mean, label="Precisión de entrenamiento", color="darkorange", lw=2)
  ax.fill_between(range, train_mean - train_std, train_mean + train_std, alpha=0.2, color="darkorange", lw=2)
  ax.plot(range, test_mean, label="Precisión de prueba", color="navy", lw=2)
  ax.fill_between(range, test_mean - test_std, test_mean + test_std, alpha=0.2, color="navy", lw=2)
  ax.legend(loc="best")

fig_optimization_curve.suptitle('Curva de validación cruzada para Random Forest'.title(),size='xx-large')
axes[1,2].remove()
fig_optimization_curve.subplots_adjust(hspace=0.34)
fig_optimization_curve = set_fig_caption(fig=fig_optimization_curve,fig_number=18,size=14,x=0.1,
                                         caption='Optimization learning curves from different hyperparameters in Random Forest')
fig_optimization_curve.savefig(f'{img_dir}fig_optimization_curve.jpg',**img_save_params)
plt.clf()

#@title Performance and optimization Logistic Regression
for solver in ['sag','saga','newton-cg']:
  mod = LogisticRegressionCV(cv=5,Cs=[10,1,0.1,0.001],scoring='roc_auc',solver=solver,max_iter=500,n_jobs=-1,refit=True)
  mod.fit(X_train_scaled,y_train)
  #print(solver ,'train',mod.score(X_train_scaled,y_train),'valid',mod.score(X_valid_scaled,y_valid))  ## Print for different solver results

#@title Best parameters for both models
best_logreg_params = {
    "C":0.1,
    "max_iter":150,
    'solver':'sag',
    'n_jobs':-1
}
best_randomf_params = dict(n_estimators=100,
                           max_features=5,
                           min_samples_leaf=40,
                           min_samples_split=75,
                           max_depth=5,
                           oob_score=True,
                           n_jobs=-1,
                           random_state=seed)

#@title Comparison between old and new Logistic Regression and Random Forest models
dict_random = dict(new_random = RandomForestClassifier(**best_randomf_params),
                   old_random = RandomForestClassifier(max_depth=6,random_state=seed),
                   new_log = LogisticRegression(**best_logreg_params),
                   old_log = LogisticRegression(max_iter=150))

res1 = model_selector(dict_random,X_train_scaled,y_train,X_valid_scaled,y_valid,0.5)
Table_4_old_vs_new_validation, _ = validate_model_scores(res1)

g_model_validate = sns.catplot(data=Table_4_old_vs_new_validation,x='model',y="AUC",hue='set',kind='bar')
plt.title(f'Model performance by Accuracy score'.title())

fig13_model_validate = set_fig_caption(fig=g_model_validate.figure,fig_number=19,
                                       size=10,gap=0.06,
                caption='The ROC-AUC scores for each model are shown for the training and validation sets.')
fig13_model_validate.savefig(f'{img_dir}fig_new_models_validate.jpg',**img_save_params)
plt.clf()

#@title Learning curves tunned models
fig_learning , axes = plt.subplots(2,2,figsize=(13,9))

for (key, models), ax1 in zip(res1.items(),axes.reshape(-1)):
    skplt.estimators.plot_learning_curve(models['model'], X_train_scaled, y_train,
                                     cv=7, shuffle=True, scoring="roc_auc",ax=ax1,
                                     n_jobs=-1, figsize=(6,4), title_fontsize="large", text_fontsize="large",
                                     title=f"Readmission Classification Learning Curve [{key}]".title())
    ax1.set_ylabel('ROC-AUC Score')
fig_learning.subplots_adjust(hspace=0.4,wspace=0.45)
fig_learning.suptitle('Learning curves from adjusted Logistic Regression and Random Forest models'.title(),size='x-large')
fig_learning_tunned_models = set_fig_caption(fig=fig_learning,fig_number=20,size=14,gap=0.05,x=0.05,
                                             caption='Learning curves for each model version are presented, showing an improvement in performance \
of the new \nmodels as evidenced by the decrease in the gap between the training and cross-validation sets.\nThis demonstrates an improvement in generalization to new data.')
fig_learning_tunned_models.savefig(f'{img_dir}fig_learning_tunned_models.jpg',**img_save_params)
plt.clf()

#@title Log loss figures
# Crear un modelo de Logistic Regression
logreg = LogisticRegression(**best_logreg_params)

# Crear un modelo de Random Forest
rf = RandomForestClassifier(**best_randomf_params)

# Generar curvas de aprendizaje para el modelo de Logistic Regression
train_sizes, train_scores, test_scores = learning_curve(logreg, X_train_scaled, y_train, cv=5, scoring='neg_log_loss')

# Calcular las medias y desviaciones estándar de las puntuaciones de entrenamiento y prueba
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)

# Set figures and axes
fig, (ax_log, ax_rand) = plt.subplots(1,2,figsize=(14,5))

# Generar una gráfica de las curvas de aprendizaje para el modelo de Logistic Regression
ax_log.plot(train_sizes, -train_mean, label='Training error')
ax_log.plot(train_sizes, -test_mean, label='Validation error')
ax_log.fill_between(train_sizes, -train_mean - train_std, -train_mean + train_std, alpha=0.1,)
ax_log.fill_between(train_sizes, -test_mean - test_std, -test_mean + test_std, alpha=0.1)
ax_log.set_xlabel('Training Set Size')
ax_log.set_ylabel('Log Loss')
ax_log.set_title('Logistic Regression')
ax_log.legend()

# Generar curvas de aprendizaje para el modelo de Random Forest
train_sizes, train_scores, test_scores = learning_curve(rf,  X_train_scaled, y_train, cv=5, scoring='neg_log_loss')

# Calcular las medias y desviaciones estándar de las puntuaciones de entrenamiento y prueba
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)

# Generar una gráfica de las curvas de aprendizaje para el modelo de Random Forest
ax_rand.plot(train_sizes, -train_mean, label='Training error')
ax_rand.plot(train_sizes, -test_mean, label='Validation error')
ax_rand.fill_between(train_sizes, -train_mean - train_std, -train_mean + train_std, alpha=0.1)
ax_rand.fill_between(train_sizes, -test_mean - test_std, -test_mean + test_std, alpha=0.1)
ax_rand.set_xlabel('Training Set Size')
ax_rand.set_ylabel('Log Loss')
ax_rand.set_title('Random Forest')
ax_rand.legend()

fig.suptitle('Optimization learning curve with log loss')
fig.subplots_adjust(wspace=0.3)
fig_log_loss_models = set_fig_caption(fig=fig,fig_number=21,x=0.06,size=12,y=-0.02,
                                      caption='The optimization learning curve with log loss for both models are shown')
fig_log_loss_models.savefig(f'{img_dir}fig_log_loss_models.jpg',**img_save_params)
plt.clf()

#@title Top 10 features from Random forest
rf_model_features = RandomForestClassifier(**best_randomf_params)
rf_model_features.fit(X_train_scaled,y_train)
features = complete_data_transformed.drop('readmitted',axis=1).columns
importances = rf_model_features.feature_importances_

forest_import = pd.Series(importances , index=features)
ax_features = forest_import.sort_values(ascending=False)[:10].plot(kind='bar',)
ax_features.tick_params(rotation=90,axis='x')

plt.title('Top 10 features importance from Random Forest model',size='large')

fig16_top10_features = set_fig_caption(fig=ax_features.get_figure(),
                                       fig_number=22, y=-0.46,size=11,gap=0.06,
                                       caption='most important features for the random forest model to evaluate data'.title())
fig16_top10_features.savefig(f'{img_dir}fig_top10_features.jpg',**img_save_params)
plt.clf()

Machine learning model for prediction of readmitted patients.

Preprocessing

Different preprocessing steps were performed to organize and split the data to feed the models. Principal component analysis (PCA) (Figures 13 and 14) shows that although there are 49 features, their variances are similar in the first 5 features, and the total variance can be explained by approximately 95% in the first 35 features. This tells us that we can potentially make modifications to achieve better results in predictions.

We employed six initial models for our analysis:

Logistic Regression (logreg)
K-nearest neighbors (knn)
Naive Bayes (nb)
Decision Tree (tree)
Random Forest (rf)
Gradient Boosting Classifier (gbc)

We then evaluated the performance of each model using various metrics, such as accuracy, precision, recall, and F1-score, to determine which model performed the best. These metrics allowed us to evaluate the overall performance of the models and identify which model achieved the highest accuracy in predicting readmitted patients. In binary classification with balanced data, the best metric to evaluate the model's performance is the Receiver Operating Characteristics Area Under the Curve (ROC-AUC) score. By comparing the performance of the six models, we identified the best-performing model for the problem at hand. Results are presented in Figure 15.

The Receiver operating characteristics (ROC) and detection error tradeoff (DET) curves for each model are shown in Figure 16. We observed a relatively similar performance between the models, with Logistic Regression and Random Forest standing out. These models also showed a good training curve with relatively small variance between the training score and cross-validation score (Figure 17).

However, these models have a relatively high variance (especially Random Forest) that may cause them to not generalize well to new data and be more prone to overfitting. Therefore, we performed different hyperparameter tuning to find a better-performing model.

Hyperparameter tunning

Using cross-validation, hyperparameter tuning was performed for the two chosen models. Figure 18 shows the impact of five different hyperparameters on the precision of the Random Forest model.

In summary, the best hyperparameters found for each model were: Logistic Regression model: C = 0.1, max_iter = 150, solver = sag, n_jobs = -1 Random Forest model: n_estimators=100, max_features=5, min_samples_leaf=40, min_samples_split=75, max_depth=5, oob_score=True, n_jobs=-1, random_state=seed

These hyperparameters were chosen to optimize the performance of each model in predicting readmitted patients.

New vs old model comparissons

New comparisons were made between the new (tuned) and old models, along with their respective metrics, as shown in figures 19 and 20. Additionally, the loss was plotted for each of the new models, which showed appropriate patterns for production purposes.

top 10 features for Random Forest (feature importances)

The random forest model was used to identify the top 10 most important features when evaluating new data, as shown in figure 22.

Conclusions and recomendations

The readmission rate is significantly higher than the United States global average. Therefore, focusing on the specified populations can help reduce this outcome.
Patients who have visited any health service in the past year, especially inpatient services, with a primary diagnosis of diabetes and a prescription for diabetes medication are the highest-risk group for readmission. Similarly, patients with secondary and tertiary diagnoses in the circulatory category are also at increased risk. For these types of patients, it is recommended to increase surveillance and monitoring through various strategies (phone calls, mail, appointments) to identify early signs of complications.
Given the large number of diabetic patients in the population, having specialized teams in their management and primary and secondary prevention would reduce complications, thereby improving the health of the population served by the hospital.
The majority of specialty care was classified as missing, so we cannot truly characterize this population. Therefore, it is recommended that the hospital ensure that each category is properly documented for future analyses,Also, diagnoses such as "others" or "missing" should be reevaluated and adjusted once the appropriate classification is obtained, as a significant portion of the population falls into this category.
Younger patients or those with a primary diagnosis of musculoskeletal or injury tend to have lower readmission rates, which is consistent with populations that often have definitive diagnoses once a procedure is performed to address their condition.
Patients over the age of 90, although they have low readmission rates, should probably be better classified, as these rates may be due to increased mortality in this population rather than low readmission rates.
If we had a more precise characterization of diagnoses, we could perform specific analyses for each population group. This way, we could focus our efforts more effectively on improving readmission rates.

Bibliography and Acknowledgments

Makaryus AN, Friedman EA, Gurland B, et al. Reducing hospital readmissions: a systematic review of hospital readmissions reduction programs. J Hosp Med. 2019;14(12):774-781.
Kansagara D, Englander H, Salanitro A, et al. Risk prediction models for hospital readmission: a systematic review. JAMA. 2011;306(15):1688-1698.
Hansen LO, Greenwald JL, Budnitz T, et al. Project BOOST: effectiveness of a multihospital effort to reduce rehospitalization. Ann Intern Med. 2011;154(10):750-756.
Raval AD, Zhou S, Wei W, Bhattacharjee S, Miao R, Menzin J. Diabetes-related readmissions within 30 days following hospitalization: estimating the potential impact of diabetes primary care. Diabetes Obes Metab. 2016;18(10):1024-1029. doi:10.1111/dom.12728
García-Pérez LE, Alvarez M, Dilla T, Gil-Guillén V, Orozco-Beltrán D. Adherence to therapies in patients with type 2 diabetes. Diabetes Ther. 2013;4(2):175-194. doi:10.1007/s13300-013-0024-3
Eby EL, Mathews R, Thomas L, et al. Predictors of readmission after hospitalization for heart failure: an analysis of the Center for Medicare and Medicaid Services database. Int J Nurs Stud. 2017;74:16-21. doi:10.1016/j.ijnurstu.2017.05.004
Beata Strack, Jonathan P. DeShazo, Chris Gennings, Juan L. Olmo, Sebastian Ventura, Krzysztof J. Cios, and John N. Clore, "Impact of HbA1c Measurement on Hospital Readmission Rates: Analysis of 70,000 Clinical Database Patient Records," BioMed Research International, vol. 2014, Article ID 781670, 11 pages, 2014.

Usefull information and links

‌
‌
‌