Cracking the Turnover Analysis (Analysis and Modeling)
  • AI Chat
  • Code
  • Report
  • Beta
    Spinner

    Can you help reduce employee turnover?

    📖 Background

    You work for the human capital department of a large corporation. The Board is worried about the relatively high turnover, and your team must look into ways to reduce the number of employees leaving the company.

    The team needs to understand better the situation, which employees are more likely to leave, and why. Once it is clear what variables impact employee churn, you can present your findings along with your ideas on how to attack the problem.

    💾 The data

    The department has assembled data on almost 10,000 employees. The team used information from exit interviews, performance reviews, and employee records.

    • "department" - the department the employee belongs to.
    • "promoted" - 1 if the employee was promoted in the previous 24 months, 0 otherwise.
    • "review" - the composite score the employee received in their last evaluation.
    • "projects" - how many projects the employee is involved in.
    • "salary" - for confidentiality reasons, salary comes in three tiers: low, medium, high.
    • "tenure" - how many years the employee has been at the company.
    • "satisfaction" - a measure of employee satisfaction from surveys.
    • "avg_hrs_month" - the average hours the employee worked in a month.
    • "left" - "yes" if the employee ended up leaving, "no" otherwise.
    import pandas as pd
    import seaborn as sns
    import matplotlib.pyplot as plt
    from sklearn.model_selection import train_test_split
    from sklearn import datasets
    from sklearn import metrics
    from sklearn.ensemble import ExtraTreesClassifier
    from sklearn.metrics import balanced_accuracy_score
    from scipy.stats import chi2_contingency
    
    from sklearn import tree
    from sklearn.tree import DecisionTreeClassifier
    import numpy as npp
    
    df = pd.read_csv('./data/employee_churn_data.csv')
    df.head()
    
    df.left = df["left"].map({"no": 0, "yes": 1})
    df

    💪 Competition challenge

    Create a report that covers the following:

    1. Which department has the highest employee turnover? Which one has the lowest?
    2. Investigate which variables seem to be better predictors of employee departure.
    3. What recommendations would you make regarding ways to reduce employee turnover?

    🧑‍⚖️ Judging criteria

    CATEGORYWEIGHTINGDETAILS
    Recommendations35%
    • Clarity of recommendations - how clear and well presented the recommendation is.
    • Quality of recommendations - are appropriate analytical techniques used & are the conclusions valid?
    • Number of relevant insights found for the target audience.
    Storytelling35%
    • How well the data and insights are connected to the recommendation.
    • How the narrative and whole report connects together.
    • Balancing making the report in-depth enough but also concise.
    Visualizations20%
    • Appropriateness of visualization used.
    • Clarity of insight from visualization.
    Votes10%
    • Up voting - most upvoted entries get the most points.

    ✅ Checklist before publishing into the competition

    • Rename your workspace to make it descriptive of your work. N.B. you should leave the notebook name as notebook.ipynb.
    • Remove redundant cells like the judging criteria, so the workbook is focused on your story.
    • Make sure the workbook reads well and explains how you found your insights.
    • Check that all the cells run without error.

    ⌛️ Time is ticking. Good luck!

    from scipy import stats
    
    def calc_binary_corr(feature: str, df) -> pd.DataFrame:
        """calculates point biserial correlation (Quantative and Binary)
        Args:
        	feature: the binary feature that will be used for the correlation
        Returns:
        	data_co: DataFrame with the corr values with the binary feature and other quantative features
        """
        
        features = ['review', 'projects', 'tenure', 'satisfaction', 'bonus', 'avg_hrs_month', 'salary', 'left']
        correlations = []
        
        for i in features:
            
        	# Calculating point biserial corr
            correlations.append(stats.pointbiserialr(df[i], df[feature]).correlation)
            
        data_co = pd.DataFrame({"feature": features, "correlations": correlations})
        return data_co
    def cat_segnificance():
        seg = []
        non = []
        features = ['department', 'promoted', 'tenure', 'bonus', 'projects', 'salary']
        for i in features:
    
            CrosstabResult = pd.crosstab(index = df['left'], columns = df[i])
    
            stat, p, dof, expected = chi2_contingency(CrosstabResult)
    
            alpha = 0.05
            if p <= alpha:
                seg.append(p)
                non.append(None)
            else:
                seg.append(None)
                non.append(p)
                
        return pd.DataFrame({"Segneficant": seg, "Non Segnificant": non, "col" : features})
    def plot_category(var, x_shift = 0.05, hei = 0.02):
        new = (df.groupby(["left", var])[var].count()/df.groupby(var)[var].count()).reset_index(name = "_");
        ax = new[new.left == 1][[var, "_"]].plot(x = var, kind = "bar", width = 0.8, figsize = (10,5));
        plot_legends(ax, 'Percentage of Leaving Employees', var, x_shift, hei, string = "%");
    def plot_legends(ax, title, x_label, x_shift = 0.09, hei = 20, string = "", x = False):
        """Sets titles and removes unnecessary parts of the graph to maintain the data to ink ratio
        Args:
            ax: The axis instance to be plotted
            x_label: the label of the x axis
            x_shift: the amount by which the bar or line marker is shifted in x axis
            hei: the amount by which the bar or line marker is shifted in the y axis
        """
        # Add title and remove borders
        ax.set_title(title)
        ax.set_xlabel(x_label)
        # ax.set_yticks([])
        if x == 0:
          ax.spines["left"].set_visible(False)
          ax.set_yticks([])
        ax.spines["right"].set_visible(False)
        # ax.spines["bottom"].set_visible(False)
        ax.spines["top"].set_visible(False)
    
        # Adding text annotations at the upper part of the bar or line marker
        for p in ax.patches:
            ax.annotate(str(round(p.get_height(), 2)) + string , (p.get_x() + x_shift, p.get_height() + hei), fontsize = 14);

    Introduction

    There are many reasons that can force the employee to leave the company, which increases the turnover of the employees for the company. These reasons might include:

    • Bad management or under-recognition from the manager (pomotion, bonus, bad team)
    • Salary for Seniority level
    • Too many working hours or the number of projects is too much for the person
    • Dis-satisfaction of the employees from the company

    All those reasons fall under three question groups:

    1. Department related turnover
    2. working style
    3. Satisfaction based on (satisfaction feature, salar

    In the next report we will ceheck every signle reason independently and combined and will also verify this statistically. Let's start our series of hypothesis.

    General Summary:
    • It is obvious that the turnover rate is high 30% in average
    • The average working hours per month for each employee is greater than 170 hours which means that employees have to work more than 8 hours 5 days a week.
    • the average satisfaction rate is low among all departments about 0.5 in average.
    • The average review rate for employees is generally not high about 0.65 in average.
    • Review is higher at tenure levels 2, 3 and 4
    Actions and Synthesis:
    • A Machine Learning Model can be used to predict employee churn and hence solving the problem before it happens.
    • Three main factors plays an important role combined which are, average hours, satisfaction and review.
    • The company should take measures to incease satisfaction levels.
    • Satisfaction should be broken down into team satisfaction, workplace satisfaction, salary satisfaction for future usage.
    • Working hours should be restricted as higher turnover is observed from 185 to 190 avg working hours regardless of other features.

    H1: Is the turnover dependent on department?

    If the team is not good, this means that satisfaction should be low, here I will explore if some departments tend to have lower satisfaction rate than others.

    H1 Summary:

    • There doesn't seem to be a relation between satisfaction and department and It seems we don't have a significant difference for turnover rates per department as well, the percentage of turnover doesn't seems to change a lot just between 27 and 31 in all features.

    • It seems that there is a bit of a problem for all departments which needs to be invistigated further.

    • The feature is not statistically significant to the turnover rate.

    data = pd.get_dummies(df, columns = ["department"])
    df.salary = df["salary"].map({"low": 1, "medium": 2, "high": 3})