Dr. Stripplot or: How I Learned to Stop Worrying and Understand the Churn
Context
The scenario suggests to take a role of an employee in the human capital department of a large corporation. The Board is worried about the relatively high turnover, and our team must look into ways to reduce the number of employees leaving the company.
The team needs to understand the situation better: which employees are more likely to leave, and why. The aim is to make clear what variables impact employee churn, and to present findings along with ideas on how to attack the problem.
Data structure
The department has assembled data on almost 10,000 employees. The team used information from exit interviews, performance reviews, and employee records.
variable | description |
---|---|
department | the department the employee belongs to. |
promoted | 1 if the employee was promoted in the previous 24 months, 0 otherwise. |
review | the composite score the employee received in their last evaluation. |
projects | how many projects the employee is involved in. |
salary | for confidentiality reasons, salary comes in three tiers: low, medium, high. |
tenure | how many years the employee has been at the company. |
satisfaction | a measure of employee satisfaction from surveys. |
avg_hrs_month | the average hours the employee worked in a month. |
left | "yes" if the employee ended up leaving, "no" otherwise. |
Objectives
- Which department has the highest employee turnover? Which one has the lowest?
- Investigate which variables seem to be better predictors of employee departure.
- What recommendations would you make regarding ways to reduce employee turnover?
Glossary
Before proceeding to analysis, sometimes it takes a little more effort to find terms that would effectively satisfy conditions of both brevity and clarity. Let alone complicated vocabulary of science, one could have trouble picking suitable pairs of words to name binary classes. The target variable selected for this study is the left
column, having only yes
and no
as available values, which aren't informative and no good for class labels. A single word, appropriate to describe an employee who had left a company, hadn't come to my mind quickly, so I had to refer to this article and decided to further use former employee
, or shortly former
, for yes
, and current employee
or current
for the rest.
Workspace setup
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
%matplotlib inline
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt
# Set up two decimal format for floats in pandas objects
pd.set_option('display.float_format',
'{0:.2f}'.format)
np.set_printoptions(precision=3, suppress=True)
# matplotlib rcParams for context
context = {'font.size': 11.0,
'axes.labelsize': 'large',
'axes.titlesize': 'large',
'axes.linewidth': 0.1,
'xtick.labelsize': 'large',
'ytick.labelsize': 'large',
'grid.linewidth': 0.1,
'patch.linewidth': 0.1,
'legend.fontsize': 'large',
'legend.title_fontsize': 'large'}
# matplotlib rcParams for style
style = {'axes.facecolor': 'white',
'axes.edgecolor': 'black',
'axes.labelcolor': 'black',
'figure.facecolor': 'white',
'grid.color': 'gray',
'grid.linestyle': '--',
'text.color': 'black',
'xtick.color': 'gray',
'ytick.color': 'gray',
'xtick.direction': 'out',
'ytick.direction': 'out',
'axes.grid': True,
'axes.spines.right': False,
'axes.spines.top': False,
'font.family': ['sans-serif'],
'font.sans-serif': ['Liberation Sans']}
# Set up global rcParams
sns.set_theme(context=context,
style=style)
Feature engineering
Read the dataset
We start with reading the employee_churn_data.csv
, that has been provided by the team, into a pandas DataFrame: df
. It appears that df
has a number of weighty object
columns, as well as some variables in int64
and float64
. With this, df
uses up around 2.3 MB of memory, which suggests optimisation. First, we can save up some memory by reducing int64
to int8
, since numbers in these columns are strictly single-digit, so an 8-byte integer would be enough.
# create filepath
filepath = r'data/employee_churn_data.csv'
# read the .csv file into DataFrame
df = pd.read_csv(filepath)
# display 5 random rows from df
display(df.sample(n=5))
# evaluate memory usage in kB
print('\ninitial dataset precise memory usage:',
(df.memory_usage(index=True, deep=True).values.sum() / 1000).round(2),
'KB\n')
dtypes = {'promoted': 'int8',
'projects': 'int8',
'tenure': 'int8',
'bonus': 'int8'}
df = pd.read_csv(filepath, dtype=dtypes)
print(df.info())
print('\nprecise memory usage with int8:',
(df.memory_usage(index=True, deep=True).values.sum() / 1000).round(2),
'KB')
Dtypes revision
Converting to int8
helped reduce memory usage by around 300 KB, which is a minor part of possible optimisation. Some of the int8
variables have simple 0/1 binary values and could easily be uniformed as bool
thus to be considered categorical. We can treat other int8
variables, representing discrete features, as categoricals, since they have very few unique values: like the number of years in the company, or the number of participated projects. In projects
, employees having from 1 to 3 current projects form three distinguished groups. In tenure
, each number of years in the company forms a separate group of employees.
From the description and by looking at the initial data structure it became clear what should the object
columns be converted to: category
variables, if declared with proper category names and order, are human-readable, suit for better presentation and may be codified for easier computing. The department
column contains department labels and these categories aren't ordinal (i.e. it doen't matter what number to codify a department with), that means converting to categotry
dtype when reading a csv-file is safe. This, however, doen't apply to the salary
column, where, if performed by default, categories would be ordered alphabetically, codifying 'high' as 0, 'low' as 1, an so on, which messes up the labels meaning. The order can be specified when declaring variables as CategoricalDtype
class.
Same step should first be done to safely convert the left
column to category
, because a computer can't tell for sure if a 'no' needs to be 0 or 1, same with a 'yes', thus it cannot be converted to boolean directly. Finally, there are variables containing survey scores and performance measures and are stored as floats. Having explored data structures of the original csv-file, it is safe now to change df
column dtypes as discussed. After that we can safely convert left
column to bool
.
These manipulations helped reduce memory usage from 2.3 MB to 298 KB, which is 7.5 times less.
from pandas.api.types import CategoricalDtype
dtypes.update({'department': 'category',
'salary': CategoricalDtype(categories=['low', 'medium', 'high'],
ordered=True),
'left': CategoricalDtype(categories=['no', 'yes'],
ordered=True)})
df = df.astype(dtypes)
df['left'] = df['left'].cat.codes
display(df.sample(n=5))
print(df.info())
print('\nprecise memory usage after dtype optimisation:',
(df.memory_usage(index=True, deep=True).values.sum() / 1000).round(2),
'KB')