Cracking the Turnover Analysis (Analysis and Modeling)

Beta

Can you help reduce employee turnover?

📖 Background

You work for the human capital department of a large corporation. The Board is worried about the relatively high turnover, and your team must look into ways to reduce the number of employees leaving the company.

The team needs to understand better the situation, which employees are more likely to leave, and why. Once it is clear what variables impact employee churn, you can present your findings along with your ideas on how to attack the problem.

💾 The data

The department has assembled data on almost 10,000 employees. The team used information from exit interviews, performance reviews, and employee records.

"department" - the department the employee belongs to.
"promoted" - 1 if the employee was promoted in the previous 24 months, 0 otherwise.
"review" - the composite score the employee received in their last evaluation.
"projects" - how many projects the employee is involved in.
"salary" - for confidentiality reasons, salary comes in three tiers: low, medium, high.
"tenure" - how many years the employee has been at the company.
"satisfaction" - a measure of employee satisfaction from surveys.
"avg_hrs_month" - the average hours the employee worked in a month.
"left" - "yes" if the employee ended up leaving, "no" otherwise.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import metrics
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.metrics import balanced_accuracy_score
from scipy.stats import chi2_contingency

from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
import numpy as npp

df = pd.read_csv('./data/employee_churn_data.csv')
df.head()

df.left = df["left"].map({"no": 0, "yes": 1})

df

💪 Competition challenge

Create a report that covers the following:

Which department has the highest employee turnover? Which one has the lowest?
Investigate which variables seem to be better predictors of employee departure.
What recommendations would you make regarding ways to reduce employee turnover?

🧑‍⚖️ Judging criteria

CATEGORY	WEIGHTING	DETAILS
Recommendations	35%	Clarity of recommendations - how clear and well presented the recommendation is. Quality of recommendations - are appropriate analytical techniques used & are the conclusions valid? Number of relevant insights found for the target audience.
Storytelling	35%	How well the data and insights are connected to the recommendation. How the narrative and whole report connects together. Balancing making the report in-depth enough but also concise.
Visualizations	20%	Appropriateness of visualization used. Clarity of insight from visualization.
Votes	10%	Up voting - most upvoted entries get the most points.

✅ Checklist before publishing into the competition

Rename your workspace to make it descriptive of your work. N.B. you should leave the notebook name as notebook.ipynb.
Remove redundant cells like the judging criteria, so the workbook is focused on your story.
Make sure the workbook reads well and explains how you found your insights.
Check that all the cells run without error.

⌛️ Time is ticking. Good luck!

from scipy import stats

def calc_binary_corr(feature: str, df) -> pd.DataFrame:
    """calculates point biserial correlation (Quantative and Binary)
    Args:
    	feature: the binary feature that will be used for the correlation
    Returns:
    	data_co: DataFrame with the corr values with the binary feature and other quantative features
    """
    
    features = ['review', 'projects', 'tenure', 'satisfaction', 'bonus', 'avg_hrs_month', 'salary', 'left']
    correlations = []
    
    for i in features:
        
    	# Calculating point biserial corr
        correlations.append(stats.pointbiserialr(df[i], df[feature]).correlation)
        
    data_co = pd.DataFrame({"feature": features, "correlations": correlations})
    return data_co

def cat_segnificance():
    seg = []
    non = []
    features = ['department', 'promoted', 'tenure', 'bonus', 'projects', 'salary']
    for i in features:

        CrosstabResult = pd.crosstab(index = df['left'], columns = df[i])

        stat, p, dof, expected = chi2_contingency(CrosstabResult)

        alpha = 0.05
        if p <= alpha:
            seg.append(p)
            non.append(None)
        else:
            seg.append(None)
            non.append(p)
            
    return pd.DataFrame({"Segneficant": seg, "Non Segnificant": non, "col" : features})

def plot_category(var, x_shift = 0.05, hei = 0.02):
    new = (df.groupby(["left", var])[var].count()/df.groupby(var)[var].count()).reset_index(name = "_");
    ax = new[new.left == 1][[var, "_"]].plot(x = var, kind = "bar", width = 0.8, figsize = (10,5));
    plot_legends(ax, 'Percentage of Leaving Employees', var, x_shift, hei, string = "%");

def plot_legends(ax, title, x_label, x_shift = 0.09, hei = 20, string = "", x = False):
    """Sets titles and removes unnecessary parts of the graph to maintain the data to ink ratio
    Args:
        ax: The axis instance to be plotted
        x_label: the label of the x axis
        x_shift: the amount by which the bar or line marker is shifted in x axis
        hei: the amount by which the bar or line marker is shifted in the y axis
    """
    # Add title and remove borders
    ax.set_title(title)
    ax.set_xlabel(x_label)
    # ax.set_yticks([])
    if x == 0:
      ax.spines["left"].set_visible(False)
      ax.set_yticks([])
    ax.spines["right"].set_visible(False)
    # ax.spines["bottom"].set_visible(False)
    ax.spines["top"].set_visible(False)

    # Adding text annotations at the upper part of the bar or line marker
    for p in ax.patches:
        ax.annotate(str(round(p.get_height(), 2)) + string , (p.get_x() + x_shift, p.get_height() + hei), fontsize = 14);

Introduction

There are many reasons that can force the employee to leave the company, which increases the turnover of the employees for the company. These reasons might include:

Bad management or under-recognition from the manager (pomotion, bonus, bad team)
Salary for Seniority level
Too many working hours or the number of projects is too much for the person
Dis-satisfaction of the employees from the company

All those reasons fall under three question groups:

Department related turnover
working style
Satisfaction based on (satisfaction feature, salar

In the next report we will ceheck every signle reason independently and combined and will also verify this statistically. Let's start our series of hypothesis.

General Summary:

It is obvious that the turnover rate is high 30% in average
The average working hours per month for each employee is greater than 170 hours which means that employees have to work more than 8 hours 5 days a week.
the average satisfaction rate is low among all departments about 0.5 in average.
The average review rate for employees is generally not high about 0.65 in average.
Review is higher at tenure levels 2, 3 and 4

Actions and Synthesis:

A Machine Learning Model can be used to predict employee churn and hence solving the problem before it happens.
Three main factors plays an important role combined which are, average hours, satisfaction and review.
The company should take measures to incease satisfaction levels.
Satisfaction should be broken down into team satisfaction, workplace satisfaction, salary satisfaction for future usage.
Working hours should be restricted as higher turnover is observed from 185 to 190 avg working hours regardless of other features.

H1: Is the turnover dependent on department?

If the team is not good, this means that satisfaction should be low, here I will explore if some departments tend to have lower satisfaction rate than others.

H1 Summary:

There doesn't seem to be a relation between satisfaction and department and It seems we don't have a significant difference for turnover rates per department as well, the percentage of turnover doesn't seems to change a lot just between 27 and 31 in all features.
It seems that there is a bit of a problem for all departments which needs to be invistigated further.
The feature is not statistically significant to the turnover rate.

data = pd.get_dummies(df, columns = ["department"])
df.salary = df["salary"].map({"low": 1, "medium": 2, "high": 3})

‌
‌
‌