Hospital readmissions // To be sick or not to be sick is the question?

Beta

P.S. I used a picture with actors from a cool TV series of my youth as a screensaver for my work.

1. The most common diagnosis is "circulation"; in patients younger than 50, this is the second most popular diagnosis. Depending on the group, on average, this diagnosis accounts for about 20-30% of the total number of referrals. The second most popular primary diagnosis refers to the "other" group and the third to respiratory diseases.

2. A third of all re-hospitalized patients have a diagnosis of Diabetes in their medical records.

3. The hospital should focus its further efforts on the following patient groups:

Reducing hospital readmissions

📖 Background

Part 1: Familiarity with the data

Let's start the narration of our story with standard procedures with data. I like the way the code looks to check for gaps in the data, the result of its execution makes it possible to perfectly evaluate everything that happens to the data

We are faced with a successful plot that there are no gaps in our data. Let's look at the data itself in more detail

We saw that we are already dealing with preprocessed data, since there are no missing values, all data has no visible errors. We have also transferred some of our data to a categorical type, we still need to think about how to make the view more comfortable for perception, and assign convenient values. I propose in the next chapter to take into account all the nuances of the dataframe and look at the primary analysis, as well as to start performing the tasks that have been completed in the work.

Part 2: Exploratory Data Analysis

Since we have an age step equal to 10 years, such a designation of the categorical age variable is applicable and we will proceed to our analysis, and if we understand in the process that some adjustments are needed, then we will return to this subparagraph

1. What is the most common primary diagnosis by age group?

We have built three graphs:

From the latest graphs, we can draw several conclusions that the most common diagnosis is "circulation"; in patients younger than 50, this is the second most popular diagnosis. Depending on the group, on average, this diagnosis accounts for about 20-30% of the total number of referrals. The second most popular primary diagnosis refers to the "other" group and the third to respiratory diseases.

50% of all patients fall into two groups, these are patients:

2. Some doctors believe diabetes might play a central role in readmission. Explore the effect of a diabetes diagnosis on readmission rates.

Patients were divided almost equally, 47% of patients had repeated visits to the hospital. Let's look at these people in more detail and in a separate dataframe.

It is necessary to understand the meaning of the question, or rather why there are two clarifying factors in it:

Why exactly does diabetes have an impact on re-hospitalization and what is the purpose of the clinic under the concept of re-hospitalization.

These nuances gave food for thought and an impetus to find an answer to them, and after reading several scientific articles, I began to understand why re-hospitalization is so important for hospitals

Now I will try to convey this information

In one of the articles I found the above cost calculation:

From the listed facts, it becomes clear why increased attention is paid to these factors in this matter.

Probably, in order to correctly answer the question correctly, it is necessary to consider two sets of data both from repeated hospitalization and not and evaluate in both cases the impact of the diagnosis of diabetes on the number of patients

How do we understand which diagnosis we should take as a basis for the first or second , and maybe the third, or even only those where all three diagnoses agree in one opinion?

By choosing that the diagnosis of diabetes was present in at least one of the three columns with data on the diagnosis and filtering the data so that only those patients who were confirmed for re-hospitalization remained, we received a separate dataframe in which 35% of patients (of the total number with re-hospitalization) have at least one of the three diagnoses of Diabetes.

A third of all re-hospitalized patients have a diagnosis of Diabetes in their medical records.

The last graph gives us a good idea of both diagnoses and re-hospitalization. It can be seen that the number of patients without hospitalization in all diagnoses is greater than in repeated, in all but one - "diabetes"!

We only looked at the result of the initial diagnosis, but what. it will be if the primary diagnosis is different and the rest confirm diabetes . Let's look at the data where again at least one of the three diagnoses was diabetes and build the dependence of hospitalization on the other columns of our data

And what if we look at what other diagnoses are more common with diabetes

1. Let's describe the results. With repeated hospitalization and diabetes, the number of days spent in the hospital is greater. In most cases, recovery comes within 2 to 7 days and a median value of 4 days. Without repeated hospitalization, recovery occurs from 2 to 6 days and the median is also 4 days.

2. The number of procedures for different diagnoses is approximately the same regardless of the type of hospitalization. As I understand it, the actions of doctors and the tests prescribed by them practically do not depend on the type of hospitalization, but only on the diagnosis made to the patient.

3. Patients from 40 to 60 years on average spend a couple of days more in the hospital during re-hospitalization. The remaining patients have approximately the same distribution in terms of hospitalization time, regardless of its type (whether there was a re-hospitalization or not)

4. Most patients were not tested for glucose and A1C levels, but about 70% of patients were prescribed medications for diabetes.

5. From the last rows of the graph, we can confidently conclude that there are four main diagnoses: circulatory system, diabetes, other diseases and respiratory diseases. As can be seen, the trend for re-hospitalization and without it persists for all diagnoses with the exception of only one. And oh yes, this diagnosis is DIABETES. On these graphs, we tried to understand how our data is distributed, and in the end we came across a solution to the second question. We know that the number of patients without re - hospitalization exceeds the number with re - hospitalization and the distribution of diagnoses gives us confirmation of this . But this point is not true for everyone, with fewer patients diagnosed with diabetes, the number of re-hospitalizations in this case is higher than in patients without re-hospitalization.

We got the answer to the second question, but let's build this graph and see everything more clearly

Before finally making sure of this, let's see how the rest of the diagnoses will behave with and without re-hospitalization

We saw that in this single age subgroup, the number of repeated hospitalizations exceeds the primary one for virtually all secondary diagnoses. Let's remember these results because we can conclude that these patients are at risk and they can be paid attention to.

We can relax, exhale and move on to the third question of our research work.

3. On what groups of patients should the hospital focus their follow-up efforts to better monitor patients with a high probability of readmission?

Therefore, we will leave further model development and try to answer the third question by analyzing our data.

Let's sort out our patients by age and try to give recommendations for each group. Although we will get a large number of similar graphs, but we will understand what feature each age group may have.

Age group 40-50 years

We have described the youngest age subgroup let's write down recommendations for it:

Age group 50-60 years

Therefore, they will not differ much from the previous recommendations.

Age group 60-70 years

Our recommendations.

Age group 70-80 years

Age group 80-90 years

Age group 90-100 years

Conclusion

Once again, I apologize for such a large number of identical schedules, but something I did not have enough in this work to diversify them, either time or knowledge.

So which patient groups should the hospital focus its next efforts on?

💪 Competition challenge

P.S. I used a picture with actors from a cool TV series of my youth as a screensaver for my work.

Analysis: Aleksey Schukin

https://www.linkedin.com/in/aleksey-schukin/

Mart 2023

Result

1. The most common diagnosis is "circulation"; in patients younger than 50, this is the second most popular diagnosis. Depending on the group, on average, this diagnosis accounts for about 20-30% of the total number of referrals. The second most popular primary diagnosis refers to the "other" group and the third to respiratory diseases.

50% of all patients fall into two groups, these are patients:
from 70 to 80 of such patients 6837
from 60 to 70 of such patients 5913

2. A third of all re-hospitalized patients have a diagnosis of Diabetes in their medical records.

With repeated hospitalization and diabetes, the number of days spent in the hospital is greater. In most cases, recovery comes within 2 to 7 days and a median value of 4 days. Without repeated hospitalization, recovery occurs from 2 to 6 days and the median is also 4 days.

Patients from 40 to 60 years on average spend a couple of days more in the hospital during re-hospitalization. The remaining patients have approximately the same distribution in terms of hospitalization time, regardless of its type (whether there was a re-hospitalization or not)

Most patients were not tested for glucose and A1C levels, but about 70% of patients were prescribed medications for diabetes.

Those doctors who suggested that diabetes affects the frequency of repeated hospitalizations were RIGHT. For all age groups diagnosed with diabetes, the number of re-hospitalizations is greater than for the primary one, and this is taking into account that our total number of re-hospitalizations was less. Also, the presence of glucose tests and A1 C tests done or not confirm our assumptions.

3. The hospital should focus its further efforts on the following patient groups:

The diagnosis of diabetes mellitus is always associated with a high risk of re-hospitalization, regardless of the patient's age. But with age, the hospital should carefully look at patients with the following diagnoses:

Circulatory;
Other;
Respirators;
Digestive;

Patients who have sought inpatient or outpatient care at least once or more during a calendar year also fall into a high risk group for re-hospitalization.

Depending on the age group, you need to pay attention to the number of days spent in the hospital for each age there is a critical number of days in the hospital, after which the risk of repeated hospitalization increases.

Also, the number of prescribed procedures affects the risk of re-hospitalization for each age, it is different, but this is clear with the increase in the number of procedures, the number of re-hospitalized patients increases.

I would also recommend conducting glucose and A1C tests for patients aged 60 - 90 years, at least for two diagnoses: diabetes and circulatory. Since this can subsequently reduce the risk of re-hospitalization.

Reducing hospital readmissions

📖 Background

You work for a consulting company helping a hospital group better understand patient readmissions. The hospital gave you access to ten years of information on patients readmitted to the hospital after being discharged. The doctors want you to assess if initial diagnoses, number of procedures, or other variables could help them better understand the probability of readmission.

They want to focus follow-up calls and attention on those patients with a higher probability of readmission.

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import itertools



from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import StandardScaler 
from sklearn.model_selection import train_test_split,cross_val_score
from sklearn.metrics import accuracy_score, mean_squared_error as MSE,roc_auc_score
from sklearn.metrics import roc_curve, auc,roc_auc_score
import sklearn.metrics as metrics

from sklearn.linear_model import SGDClassifier
import xgboost as xgb
from sklearn.ensemble import RandomForestClassifier

import tensorflow as tf
from tensorflow.keras import Input
from tensorflow.keras.layers import Dense,Input,Conv1D,MaxPool1D,Activation,Dropout,Flatten,Embedding,concatenate,LSTM,BatchNormalization
from tensorflow.keras import regularizers
from tensorflow.keras.callbacks import LearningRateScheduler
from tensorflow.keras.callbacks import ReduceLROnPlateau,EarlyStopping

import torch
import torch.nn as nn
import torch.optim as optim
#from torch.utils.data import Dataset, DataLoader
from sklearn.metrics import confusion_matrix, classification_report

sns.set()

Hidden output

# Data
df = pd.read_csv('data/hospital_readmissions.csv')
df.head()

Part 1: Familiarity with the data

Let's start the narration of our story with standard procedures with data. I like the way the code looks to check for gaps in the data, the result of its execution makes it possible to perfectly evaluate everything that happens to the data

def percent_hbar(df, old_threshold=None):
    percent_of_nulls = (df.isnull().sum()/len(df)*100).sort_values().round(2)
    threshold = percent_of_nulls.mean()
    ax = percent_of_nulls.plot(kind='barh', figsize=(10, 14), title='% of NaN (from {} lines)'.format(len(df)), 
                               color='#86bf91', legend=False, fontsize=25)
    ax.set_xlabel('Count of NaN')
    dict_percent = dict(percent_of_nulls)
    i = 0
    for k in dict_percent:
        color = 'blue'
        if dict_percent[k] > 0:
            if dict_percent[k] > threshold:
                color = 'red'
            ax.text(dict_percent[k]+0.1, i + 0.09, str(dict_percent[k])+'%', color=color, 
                    fontweight='bold', fontsize='large')
        i += 0.98
    if old_threshold is not None:
        plt.axvline(x=old_threshold,linewidth=1, color='r', linestyle='--')
        ax.text(old_threshold+0.3, .10, '{0:.2%}'.format(old_threshold/100), color='r', fontweight='bold', fontsize='large')
        plt.axvline(x=threshold,linewidth=5, color='green', linestyle='--')
        ax.text(threshold+0.3, .7, '{0:.2%}'.format(threshold/100), color='green', fontweight='bold', fontsize='large')
    else:
        plt.axvline(x=threshold,linewidth=3, color='r', linestyle='--')
        ax.text(threshold+0.3, .7, '{0:.2%}'.format(threshold/100), color='r', fontweight='bold', fontsize='large')
    ax.set_xlabel('')
    return ax, threshold

plot, threshold = percent_hbar(df)

We are faced with a successful plot that there are no gaps in our data. Let's look at the data itself in more detail

df.drop_duplicates()
display(df.info())
display(df.describe())

variables = pd.DataFrame(columns=['Variable','Number of unique values','Values'])

for i, var in enumerate(df.columns):
    variables.loc[i] = [var, df[var].nunique(), df[var].unique().tolist()]
variables.set_index('Variable', inplace=True)    
variables

We saw that we are already dealing with preprocessed data, since there are no missing values, all data has no visible errors. We have also transferred some of our data to a categorical type, we still need to think about how to make the view more comfortable for perception, and assign convenient values. I propose in the next chapter to take into account all the nuances of the dataframe and look at the primary analysis, as well as to start performing the tasks that have been completed in the work.

‌
‌
‌

Hospital readmissions // To be sick or not to be sick is the question?

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}P.S. I used a picture with actors from a cool TV series of my youth as a screensaver for my work.

Analysis: Aleksey Schukin

Result

2. A third of all re-hospitalized patients have a diagnosis of Diabetes in their medical records.

3. The hospital should focus its further efforts on the following patient groups:

Reducing hospital readmissions

📖 Background

Part 1: Familiarity with the data

Let's start the narration of our story with standard procedures with data. I like the way the code looks to check for gaps in the data, the result of its execution makes it possible to perfectly evaluate everything that happens to the data

We are faced with a successful plot that there are no gaps in our data. Let's look at the data itself in more detail

P.S. I used a picture with actors from a cool TV series of my youth as a screensaver for my work.