Hospital readmissions // To be sick or not to be sick is the question?
  • AI Chat
  • Code
  • Report
  • Beta
    Spinner

    P.S. I used a picture with actors from a cool TV series of my youth as a screensaver for my work.

    Analysis: Aleksey Schukin

    https://www.linkedin.com/in/aleksey-schukin/

    Mart 2023

    Result

    1. The most common diagnosis is "circulation"; in patients younger than 50, this is the second most popular diagnosis. Depending on the group, on average, this diagnosis accounts for about 20-30% of the total number of referrals. The second most popular primary diagnosis refers to the "other" group and the third to respiratory diseases.

    50% of all patients fall into two groups, these are patients: from 70 to 80 of such patients 6837 from 60 to 70 of such patients 5913

    2. A third of all re-hospitalized patients have a diagnosis of Diabetes in their medical records.

    With repeated hospitalization and diabetes, the number of days spent in the hospital is greater. In most cases, recovery comes within 2 to 7 days and a median value of 4 days. Without repeated hospitalization, recovery occurs from 2 to 6 days and the median is also 4 days.

    Patients from 40 to 60 years on average spend a couple of days more in the hospital during re-hospitalization. The remaining patients have approximately the same distribution in terms of hospitalization time, regardless of its type (whether there was a re-hospitalization or not)

    Most patients were not tested for glucose and A1C levels, but about 70% of patients were prescribed medications for diabetes.

    Those doctors who suggested that diabetes affects the frequency of repeated hospitalizations were RIGHT. For all age groups diagnosed with diabetes, the number of re-hospitalizations is greater than for the primary one, and this is taking into account that our total number of re-hospitalizations was less. Also, the presence of glucose tests and A1 C tests done or not confirm our assumptions.

    3. The hospital should focus its further efforts on the following patient groups:

    The diagnosis of diabetes mellitus is always associated with a high risk of re-hospitalization, regardless of the patient's age. But with age, the hospital should carefully look at patients with the following diagnoses:

    • Circulatory;
    • Other;
    • Respirators;
    • Digestive;

    Patients who have sought inpatient or outpatient care at least once or more during a calendar year also fall into a high risk group for re-hospitalization.

    Depending on the age group, you need to pay attention to the number of days spent in the hospital for each age there is a critical number of days in the hospital, after which the risk of repeated hospitalization increases.

    Also, the number of prescribed procedures affects the risk of re-hospitalization for each age, it is different, but this is clear with the increase in the number of procedures, the number of re-hospitalized patients increases.

    I would also recommend conducting glucose and A1C tests for patients aged 60 - 90 years, at least for two diagnoses: diabetes and circulatory. Since this can subsequently reduce the risk of re-hospitalization.

    Reducing hospital readmissions

    📖 Background

    You work for a consulting company helping a hospital group better understand patient readmissions. The hospital gave you access to ten years of information on patients readmitted to the hospital after being discharged. The doctors want you to assess if initial diagnoses, number of procedures, or other variables could help them better understand the probability of readmission.

    They want to focus follow-up calls and attention on those patients with a higher probability of readmission.

    import pandas as pd
    import numpy as np
    
    import matplotlib.pyplot as plt
    import seaborn as sns
    import plotly.express as px
    import itertools
    
    
    
    from sklearn.preprocessing import LabelEncoder
    from sklearn.preprocessing import StandardScaler 
    from sklearn.model_selection import train_test_split,cross_val_score
    from sklearn.metrics import accuracy_score, mean_squared_error as MSE,roc_auc_score
    from sklearn.metrics import roc_curve, auc,roc_auc_score
    import sklearn.metrics as metrics
    
    from sklearn.linear_model import SGDClassifier
    import xgboost as xgb
    from sklearn.ensemble import RandomForestClassifier
    
    import tensorflow as tf
    from tensorflow.keras import Input
    from tensorflow.keras.layers import Dense,Input,Conv1D,MaxPool1D,Activation,Dropout,Flatten,Embedding,concatenate,LSTM,BatchNormalization
    from tensorflow.keras import regularizers
    from tensorflow.keras.callbacks import LearningRateScheduler
    from tensorflow.keras.callbacks import ReduceLROnPlateau,EarlyStopping
    
    import torch
    import torch.nn as nn
    import torch.optim as optim
    #from torch.utils.data import Dataset, DataLoader
    from sklearn.metrics import confusion_matrix, classification_report
    
    sns.set()
    
    Hidden output
    # Data
    df = pd.read_csv('data/hospital_readmissions.csv')
    df.head()

    Part 1: Familiarity with the data

    Let's start the narration of our story with standard procedures with data. I like the way the code looks to check for gaps in the data, the result of its execution makes it possible to perfectly evaluate everything that happens to the data

    def percent_hbar(df, old_threshold=None):
        percent_of_nulls = (df.isnull().sum()/len(df)*100).sort_values().round(2)
        threshold = percent_of_nulls.mean()
        ax = percent_of_nulls.plot(kind='barh', figsize=(10, 14), title='% of NaN (from {} lines)'.format(len(df)), 
                                   color='#86bf91', legend=False, fontsize=25)
        ax.set_xlabel('Count of NaN')
        dict_percent = dict(percent_of_nulls)
        i = 0
        for k in dict_percent:
            color = 'blue'
            if dict_percent[k] > 0:
                if dict_percent[k] > threshold:
                    color = 'red'
                ax.text(dict_percent[k]+0.1, i + 0.09, str(dict_percent[k])+'%', color=color, 
                        fontweight='bold', fontsize='large')
            i += 0.98
        if old_threshold is not None:
            plt.axvline(x=old_threshold,linewidth=1, color='r', linestyle='--')
            ax.text(old_threshold+0.3, .10, '{0:.2%}'.format(old_threshold/100), color='r', fontweight='bold', fontsize='large')
            plt.axvline(x=threshold,linewidth=5, color='green', linestyle='--')
            ax.text(threshold+0.3, .7, '{0:.2%}'.format(threshold/100), color='green', fontweight='bold', fontsize='large')
        else:
            plt.axvline(x=threshold,linewidth=3, color='r', linestyle='--')
            ax.text(threshold+0.3, .7, '{0:.2%}'.format(threshold/100), color='r', fontweight='bold', fontsize='large')
        ax.set_xlabel('')
        return ax, threshold
    plot, threshold = percent_hbar(df)
    
    We are faced with a successful plot that there are no gaps in our data. Let's look at the data itself in more detail
    df.drop_duplicates()
    display(df.info())
    display(df.describe())
    
    variables = pd.DataFrame(columns=['Variable','Number of unique values','Values'])
    
    for i, var in enumerate(df.columns):
        variables.loc[i] = [var, df[var].nunique(), df[var].unique().tolist()]
    variables.set_index('Variable', inplace=True)    
    variables
    We saw that we are already dealing with preprocessed data, since there are no missing values, all data has no visible errors. We have also transferred some of our data to a categorical type, we still need to think about how to make the view more comfortable for perception, and assign convenient values. I propose in the next chapter to take into account all the nuances of the dataframe and look at the primary analysis, as well as to start performing the tasks that have been completed in the work.