Dr. Stripplot or: How I Learned to Stop Worrying and Understand the Churn

The scenario suggests to take a role of an employee in the human capital department of a large corporation. The Board is worried about the relatively high turnover, and our team must look into ways to reduce the number of employees leaving the company.

The team needs to understand the situation better: which employees are more likely to leave, and why. The aim is to make clear what variables impact employee churn, and to present findings along with ideas on how to attack the problem.

Data structure

The department has assembled data on almost 10,000 employees. The team used information from exit interviews, performance reviews, and employee records.

variable	description
`department`	the department the employee belongs to.
`promoted`	1 if the employee was promoted in the previous 24 months, 0 otherwise.
`review`	the composite score the employee received in their last evaluation.
`projects`	how many projects the employee is involved in.
`salary`	for confidentiality reasons, salary comes in three tiers: low, medium, high.
`tenure`	how many years the employee has been at the company.
`satisfaction`	a measure of employee satisfaction from surveys.
`avg_hrs_month`	the average hours the employee worked in a month.
`left`	"yes" if the employee ended up leaving, "no" otherwise.

Objectives

Which department has the highest employee turnover? Which one has the lowest?
Investigate which variables seem to be better predictors of employee departure.
What recommendations would you make regarding ways to reduce employee turnover?

Glossary

Before proceeding to analysis, sometimes it takes a little more effort to find terms that would effectively satisfy conditions of both brevity and clarity. Let alone complicated vocabulary of science, one could have trouble picking suitable pairs of words to name binary classes. The target variable selected for this study is the left column, having only yes and no as available values, which aren't informative and no good for class labels. A single word, appropriate to describe an employee who had left a company, hadn't come to my mind quickly, so I had to refer to this article and decided to further use former employee, or shortly former, for yes, and current employee or current for the rest.

Workspace setup

import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
%matplotlib inline

import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

# Set up two decimal format for floats in pandas objects
pd.set_option('display.float_format', 
              '{0:.2f}'.format)
np.set_printoptions(precision=3, suppress=True)

# matplotlib rcParams for context
context = {'font.size': 11.0,
           'axes.labelsize': 'large',
           'axes.titlesize': 'large',
           'axes.linewidth': 0.1,
           'xtick.labelsize': 'large',
           'ytick.labelsize': 'large',
           'grid.linewidth': 0.1,
           'patch.linewidth': 0.1,
           'legend.fontsize': 'large',
           'legend.title_fontsize': 'large'}

# matplotlib rcParams for style
style = {'axes.facecolor': 'white',
         'axes.edgecolor': 'black',
         'axes.labelcolor': 'black',
         'figure.facecolor': 'white',
         'grid.color': 'gray',
         'grid.linestyle': '--',
         'text.color': 'black',
         'xtick.color': 'gray',
         'ytick.color': 'gray',
         'xtick.direction': 'out',
         'ytick.direction': 'out',
         'axes.grid': True,
         'axes.spines.right': False,
         'axes.spines.top': False,
         'font.family': ['sans-serif'],
         'font.sans-serif': ['Liberation Sans']}

# Set up global rcParams
sns.set_theme(context=context,
              style=style)

Feature engineering

Read the dataset

We start with reading the employee_churn_data.csv, that has been provided by the team, into a pandas DataFrame: df. It appears that df has a number of weighty object columns, as well as some variables in int64 and float64. With this, df uses up around 2.3 MB of memory, which suggests optimisation. First, we can save up some memory by reducing int64 to int8, since numbers in these columns are strictly single-digit, so an 8-byte integer would be enough.

# create filepath
filepath = r'data/employee_churn_data.csv'

# read the .csv file into DataFrame
df = pd.read_csv(filepath)

# display 5 random rows from df
display(df.sample(n=5))

# evaluate memory usage in kB
print('\ninitial dataset precise memory usage:', 
      (df.memory_usage(index=True, deep=True).values.sum() / 1000).round(2), 
      'KB\n')

dtypes = {'promoted': 'int8',
          'projects': 'int8',
          'tenure': 'int8',
          'bonus': 'int8'}

df = pd.read_csv(filepath, dtype=dtypes)

print(df.info())

print('\nprecise memory usage with int8:', 
      (df.memory_usage(index=True, deep=True).values.sum() / 1000).round(2), 
      'KB')

Dtypes revision

Converting to int8 helped reduce memory usage by around 300 KB, which is a minor part of possible optimisation. Some of the int8 variables have simple 0/1 binary values and could easily be uniformed as bool thus to be considered categorical. We can treat other int8 variables, representing discrete features, as categoricals, since they have very few unique values: like the number of years in the company, or the number of participated projects. In projects, employees having from 1 to 3 current projects form three distinguished groups. In tenure, each number of years in the company forms a separate group of employees.

From the description and by looking at the initial data structure it became clear what should the object columns be converted to: category variables, if declared with proper category names and order, are human-readable, suit for better presentation and may be codified for easier computing. The department column contains department labels and these categories aren't ordinal (i.e. it doen't matter what number to codify a department with), that means converting to categotry dtype when reading a csv-file is safe. This, however, doen't apply to the salary column, where, if performed by default, categories would be ordered alphabetically, codifying 'high' as 0, 'low' as 1, an so on, which messes up the labels meaning. The order can be specified when declaring variables as CategoricalDtype class.

Same step should first be done to safely convert the left column to category, because a computer can't tell for sure if a 'no' needs to be 0 or 1, same with a 'yes', thus it cannot be converted to boolean directly. Finally, there are variables containing survey scores and performance measures and are stored as floats. Having explored data structures of the original csv-file, it is safe now to change df column dtypes as discussed. After that we can safely convert left column to bool.

These manipulations helped reduce memory usage from 2.3 MB to 298 KB, which is 7.5 times less.

from pandas.api.types import CategoricalDtype

dtypes.update({'department': 'category',
               'salary': CategoricalDtype(categories=['low', 'medium', 'high'],
                                          ordered=True),
               'left': CategoricalDtype(categories=['no', 'yes'],
                                        ordered=True)})

df = df.astype(dtypes)
df['left'] = df['left'].cat.codes

display(df.sample(n=5))

print(df.info())
print('\nprecise memory usage after dtype optimisation:', 
      (df.memory_usage(index=True, deep=True).values.sum() / 1000).round(2), 
      'KB')

‌
‌
‌

Dr. Stripplot or: How I Learned to Stop Worrying and Understand the Churn

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}Dr. Stripplot or: How I Learned to Stop Worrying and Understand the Churn

Context

Data structure

Objectives

Glossary

Workspace setup

Feature engineering

Read the dataset

Dtypes revision

Dr. Stripplot or: How I Learned to Stop Worrying and Understand the Churn