Dr. Stripplot or: How I Learned to Stop Worrying and Understand the Churn
  • AI Chat
  • Code
  • Report
  • Beta
    Spinner

    Dr. Stripplot or: How I Learned to Stop Worrying and Understand the Churn

    Context

    The scenario suggests to take a role of an employee in the human capital department of a large corporation. The Board is worried about the relatively high turnover, and our team must look into ways to reduce the number of employees leaving the company.

    The team needs to understand the situation better: which employees are more likely to leave, and why. The aim is to make clear what variables impact employee churn, and to present findings along with ideas on how to attack the problem.

    Data structure

    The department has assembled data on almost 10,000 employees. The team used information from exit interviews, performance reviews, and employee records.

    variabledescription
    departmentthe department the employee belongs to.
    promoted1 if the employee was promoted in the previous 24 months, 0 otherwise.
    reviewthe composite score the employee received in their last evaluation.
    projectshow many projects the employee is involved in.
    salaryfor confidentiality reasons, salary comes in three tiers: low, medium, high.
    tenurehow many years the employee has been at the company.
    satisfactiona measure of employee satisfaction from surveys.
    avg_hrs_monththe average hours the employee worked in a month.
    left"yes" if the employee ended up leaving, "no" otherwise.

    Objectives

    1. Which department has the highest employee turnover? Which one has the lowest?
    2. Investigate which variables seem to be better predictors of employee departure.
    3. What recommendations would you make regarding ways to reduce employee turnover?

    Glossary

    Before proceeding to analysis, sometimes it takes a little more effort to find terms that would effectively satisfy conditions of both brevity and clarity. Let alone complicated vocabulary of science, one could have trouble picking suitable pairs of words to name binary classes. The target variable selected for this study is the left column, having only yes and no as available values, which aren't informative and no good for class labels. A single word, appropriate to describe an employee who had left a company, hadn't come to my mind quickly, so I had to refer to this article and decided to further use former employee, or shortly former, for yes, and current employee or current for the rest.

    Workspace setup

    import os
    os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
    %matplotlib inline
    
    import pandas as pd
    import seaborn as sns
    import numpy as np
    import matplotlib.pyplot as plt
    # Set up two decimal format for floats in pandas objects
    pd.set_option('display.float_format', 
                  '{0:.2f}'.format)
    np.set_printoptions(precision=3, suppress=True)
    
    # matplotlib rcParams for context
    context = {'font.size': 11.0,
               'axes.labelsize': 'large',
               'axes.titlesize': 'large',
               'axes.linewidth': 0.1,
               'xtick.labelsize': 'large',
               'ytick.labelsize': 'large',
               'grid.linewidth': 0.1,
               'patch.linewidth': 0.1,
               'legend.fontsize': 'large',
               'legend.title_fontsize': 'large'}
    
    # matplotlib rcParams for style
    style = {'axes.facecolor': 'white',
             'axes.edgecolor': 'black',
             'axes.labelcolor': 'black',
             'figure.facecolor': 'white',
             'grid.color': 'gray',
             'grid.linestyle': '--',
             'text.color': 'black',
             'xtick.color': 'gray',
             'ytick.color': 'gray',
             'xtick.direction': 'out',
             'ytick.direction': 'out',
             'axes.grid': True,
             'axes.spines.right': False,
             'axes.spines.top': False,
             'font.family': ['sans-serif'],
             'font.sans-serif': ['Liberation Sans']}
    
    # Set up global rcParams
    sns.set_theme(context=context,
                  style=style)

    Feature engineering

    Read the dataset

    We start with reading the employee_churn_data.csv, that has been provided by the team, into a pandas DataFrame: df. It appears that df has a number of weighty object columns, as well as some variables in int64 and float64. With this, df uses up around 2.3 MB of memory, which suggests optimisation. First, we can save up some memory by reducing int64 to int8, since numbers in these columns are strictly single-digit, so an 8-byte integer would be enough.

    # create filepath
    filepath = r'data/employee_churn_data.csv'
    
    # read the .csv file into DataFrame
    df = pd.read_csv(filepath)
    
    # display 5 random rows from df
    display(df.sample(n=5))
    
    # evaluate memory usage in kB
    print('\ninitial dataset precise memory usage:', 
          (df.memory_usage(index=True, deep=True).values.sum() / 1000).round(2), 
          'KB\n')
    dtypes = {'promoted': 'int8',
              'projects': 'int8',
              'tenure': 'int8',
              'bonus': 'int8'}
    
    df = pd.read_csv(filepath, dtype=dtypes)
    
    print(df.info())
    
    print('\nprecise memory usage with int8:', 
          (df.memory_usage(index=True, deep=True).values.sum() / 1000).round(2), 
          'KB')

    Dtypes revision

    Converting to int8 helped reduce memory usage by around 300 KB, which is a minor part of possible optimisation. Some of the int8 variables have simple 0/1 binary values and could easily be uniformed as bool thus to be considered categorical. We can treat other int8 variables, representing discrete features, as categoricals, since they have very few unique values: like the number of years in the company, or the number of participated projects. In projects, employees having from 1 to 3 current projects form three distinguished groups. In tenure, each number of years in the company forms a separate group of employees.

    From the description and by looking at the initial data structure it became clear what should the object columns be converted to: category variables, if declared with proper category names and order, are human-readable, suit for better presentation and may be codified for easier computing. The department column contains department labels and these categories aren't ordinal (i.e. it doen't matter what number to codify a department with), that means converting to categotry dtype when reading a csv-file is safe. This, however, doen't apply to the salary column, where, if performed by default, categories would be ordered alphabetically, codifying 'high' as 0, 'low' as 1, an so on, which messes up the labels meaning. The order can be specified when declaring variables as CategoricalDtype class.

    Same step should first be done to safely convert the left column to category, because a computer can't tell for sure if a 'no' needs to be 0 or 1, same with a 'yes', thus it cannot be converted to boolean directly. Finally, there are variables containing survey scores and performance measures and are stored as floats. Having explored data structures of the original csv-file, it is safe now to change df column dtypes as discussed. After that we can safely convert left column to bool.

    These manipulations helped reduce memory usage from 2.3 MB to 298 KB, which is 7.5 times less.

    from pandas.api.types import CategoricalDtype
    
    dtypes.update({'department': 'category',
                   'salary': CategoricalDtype(categories=['low', 'medium', 'high'],
                                              ordered=True),
                   'left': CategoricalDtype(categories=['no', 'yes'],
                                            ordered=True)})
    
    df = df.astype(dtypes)
    df['left'] = df['left'].cat.codes
    
    display(df.sample(n=5))
    print(df.info())
    print('\nprecise memory usage after dtype optimisation:', 
          (df.memory_usage(index=True, deep=True).values.sum() / 1000).round(2), 
          'KB')