Can you help reduce employee turnover?
📖 Background
You work for the human capital department of a large corporation. The Board is worried about the relatively high turnover, and your team must look into ways to reduce the number of employees leaving the company.
The team needs to understand better the situation, which employees are more likely to leave, and why. Once it is clear what variables impact employee churn, you can present your findings along with your ideas on how to attack the problem.
💾 The data
The department has assembled data on almost 10,000 employees. The team used information from exit interviews, performance reviews, and employee records.
- "department" - the department the employee belongs to.
- "promoted" - 1 if the employee was promoted in the previous 24 months, 0 otherwise.
- "review" - the composite score the employee received in their last evaluation.
- "projects" - how many projects the employee is involved in.
- "salary" - for confidentiality reasons, salary comes in three tiers: low, medium, high.
- "tenure" - how many years the employee has been at the company.
- "satisfaction" - a measure of employee satisfaction from surveys.
- "avg_hrs_month" - the average hours the employee worked in a month.
- "left" - "yes" if the employee ended up leaving, "no" otherwise.
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn import datasets
from sklearn import metrics
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.metrics import balanced_accuracy_score
from scipy.stats import chi2_contingency
from sklearn import tree
from sklearn.tree import DecisionTreeClassifier
import numpy as npp
df = pd.read_csv('./data/employee_churn_data.csv')
df.head()
df.left = df["left"].map({"no": 0, "yes": 1})
df
💪 Competition challenge
Create a report that covers the following:
- Which department has the highest employee turnover? Which one has the lowest?
- Investigate which variables seem to be better predictors of employee departure.
- What recommendations would you make regarding ways to reduce employee turnover?
🧑⚖️ Judging criteria
CATEGORY | WEIGHTING | DETAILS |
---|---|---|
Recommendations | 35% |
|
Storytelling | 35% |
|
Visualizations | 20% |
|
Votes | 10% |
|
✅ Checklist before publishing into the competition
- Rename your workspace to make it descriptive of your work. N.B. you should leave the notebook name as notebook.ipynb.
- Remove redundant cells like the judging criteria, so the workbook is focused on your story.
- Make sure the workbook reads well and explains how you found your insights.
- Check that all the cells run without error.
⌛️ Time is ticking. Good luck!
from scipy import stats
def calc_binary_corr(feature: str, df) -> pd.DataFrame:
"""calculates point biserial correlation (Quantative and Binary)
Args:
feature: the binary feature that will be used for the correlation
Returns:
data_co: DataFrame with the corr values with the binary feature and other quantative features
"""
features = ['review', 'projects', 'tenure', 'satisfaction', 'bonus', 'avg_hrs_month', 'salary', 'left']
correlations = []
for i in features:
# Calculating point biserial corr
correlations.append(stats.pointbiserialr(df[i], df[feature]).correlation)
data_co = pd.DataFrame({"feature": features, "correlations": correlations})
return data_co
def cat_segnificance():
seg = []
non = []
features = ['department', 'promoted', 'tenure', 'bonus', 'projects', 'salary']
for i in features:
CrosstabResult = pd.crosstab(index = df['left'], columns = df[i])
stat, p, dof, expected = chi2_contingency(CrosstabResult)
alpha = 0.05
if p <= alpha:
seg.append(p)
non.append(None)
else:
seg.append(None)
non.append(p)
return pd.DataFrame({"Segneficant": seg, "Non Segnificant": non, "col" : features})
def plot_category(var, x_shift = 0.05, hei = 0.02):
new = (df.groupby(["left", var])[var].count()/df.groupby(var)[var].count()).reset_index(name = "_");
ax = new[new.left == 1][[var, "_"]].plot(x = var, kind = "bar", width = 0.8, figsize = (10,5));
plot_legends(ax, 'Percentage of Leaving Employees', var, x_shift, hei, string = "%");
def plot_legends(ax, title, x_label, x_shift = 0.09, hei = 20, string = "", x = False):
"""Sets titles and removes unnecessary parts of the graph to maintain the data to ink ratio
Args:
ax: The axis instance to be plotted
x_label: the label of the x axis
x_shift: the amount by which the bar or line marker is shifted in x axis
hei: the amount by which the bar or line marker is shifted in the y axis
"""
# Add title and remove borders
ax.set_title(title)
ax.set_xlabel(x_label)
# ax.set_yticks([])
if x == 0:
ax.spines["left"].set_visible(False)
ax.set_yticks([])
ax.spines["right"].set_visible(False)
# ax.spines["bottom"].set_visible(False)
ax.spines["top"].set_visible(False)
# Adding text annotations at the upper part of the bar or line marker
for p in ax.patches:
ax.annotate(str(round(p.get_height(), 2)) + string , (p.get_x() + x_shift, p.get_height() + hei), fontsize = 14);
Introduction
There are many reasons that can force the employee to leave the company, which increases the turnover of the employees for the company. These reasons might include:
- Bad management or under-recognition from the manager (pomotion, bonus, bad team)
- Salary for Seniority level
- Too many working hours or the number of projects is too much for the person
- Dis-satisfaction of the employees from the company
All those reasons fall under three question groups:
- Department related turnover
- working style
- Satisfaction based on (satisfaction feature, salar
In the next report we will ceheck every signle reason independently and combined and will also verify this statistically. Let's start our series of hypothesis.
General Summary:
- It is obvious that the turnover rate is high 30% in average
- The average working hours per month for each employee is greater than 170 hours which means that employees have to work more than 8 hours 5 days a week.
- the average satisfaction rate is low among all departments about 0.5 in average.
- The average review rate for employees is generally not high about 0.65 in average.
- Review is higher at tenure levels 2, 3 and 4
Actions and Synthesis:
- A Machine Learning Model can be used to predict employee churn and hence solving the problem before it happens.
- Three main factors plays an important role combined which are, average hours, satisfaction and review.
- The company should take measures to incease satisfaction levels.
- Satisfaction should be broken down into team satisfaction, workplace satisfaction, salary satisfaction for future usage.
- Working hours should be restricted as higher turnover is observed from 185 to 190 avg working hours regardless of other features.
H1: Is the turnover dependent on department?
If the team is not good, this means that satisfaction should be low, here I will explore if some departments tend to have lower satisfaction rate than others.
H1 Summary:
-
There doesn't seem to be a relation between satisfaction and department and It seems we don't have a significant difference for turnover rates per department as well, the percentage of turnover doesn't seems to change a lot just between 27 and 31 in all features.
-
It seems that there is a bit of a problem for all departments which needs to be invistigated further.
-
The feature is not statistically significant to the turnover rate.
data = pd.get_dummies(df, columns = ["department"])
df.salary = df["salary"].map({"low": 1, "medium": 2, "high": 3})