Isaac Awotwe
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
‌
Sign up
Beta
Spinner

Can you help reduce employee turnover?

📖 Background

You work for the human capital department of a large corporation. The Board is worried about the relatively high turnover, and your team must look into ways to reduce the number of employees leaving the company.

The team needs to understand better the situation, which employees are more likely to leave, and why. Once it is clear what variables impact employee churn, you can present your findings along with your ideas on how to attack the problem.

💾 The data

The department has assembled data on almost 10,000 employees. The team used information from exit interviews, performance reviews, and employee records.

  • "department" - the department the employee belongs to.
  • "promoted" - 1 if the employee was promoted in the previous 24 months, 0 otherwise.
  • "review" - the composite score the employee received in their last evaluation.
  • "projects" - how many projects the employee is involved in.
  • "salary" - for confidentiality reasons, salary comes in three tiers: low, medium, high.
  • "tenure" - how many years the employee has been at the company.
  • "satisfaction" - a measure of employee satisfaction from surveys.
  • "avg_hrs_month" - the average hours the employee worked in a month.
  • "left" - "yes" if the employee ended up leaving, "no" otherwise.

1. Inspecting the Data

#Import relevant packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.ticker import PercentFormatter
#read in the data and see first five rows
df = pd.read_csv('./data/employee_churn_data.csv')
df.head(20)
df.tail(20)

As we can see from our first glance at the data, the dataset has a mixture of numerical and non-numerical features. This can be fixed with some preprocessing, but before we do that, let's learn about the dataset a bit more to see if there are other dataset issues that need to be fixed.

#Check for data types and missing values
df.info()

The dataframe has 9,540 rows (employees) and 10 columns. The columns have the correct data types and no missing values.

#inspect the values of the non-float columns
nonfloat_columns=[column for column in df.columns if df[column].dtype in ['int64', 'object']]
for column in nonfloat_columns:
    print(column+":", df[column].unique())

Object and integer type columns seem to contain the correct range of values. Our data looks good, and ready for analysis.

There is a variable "bonus" which is not mentioned in the data description - we will assume that "bonus" takes a value of 1 if the employee received a bonus in the last 24 months and 0 otherwise, in keeping with how variables of similar nature were described.

2. Further Data Exploration

In this section we will be using viewing the data holsitically using summary statistics and charts. We will first compute summary statistics for the centre and spread of variables. We will procees to draw a scatter matrix for the continuous numeric variables showing the nature of their distributions and their relationships. Lastly, we generate countplots that visualize the frequency distributions of the categorical variables.

2.1 Summary Statistics

#generate and inspect summary statistics
df.describe(include="all", percentiles=[0.5]).transpose()

One key takeaway from the descriptive statistics is that the data is unbalanced in terms of turnover - as might be expected. Out of the 9,540 employees, 2,784 (representing 29%) ended up leaving the organization.

‌
‌
‌
  • AI Chat
  • Code