The Real Reason Why Employees are Leaving
  • AI Chat
  • Code
  • Report
  • Beta
    Spinner

    Employee Turnover

    During the current economic times, there is a significant cultural shift occurring in the workplace due to work from home and lock down orders, the pandemic, and the global supply chain shock causing an elevated inflation rate, it is critical that a company can understands why employees are leaving and how to reduce the turnover rate. Companies that struggle to maintain their valuable employees will lose invaluable experience within the company and will have to spend more resources training new hires to perform identical tasks with less efficiency as the new employees get accustomed to the new environment. Keeping this in mind, as we explore the given dataset, we will be able to reduce company costs hiring new employees, prevent company brain drain (the loss of highly skilled and high level workers), and keep the company operating at full bore. Let's Go!

    install.packages(c("ggcorrplot", "vip", "xgboost", "doParallel", "glmnet"))
    
    # Processing and visualization
    library(tidyverse)
    library(ggplot2)
    library(ggcorrplot)
    library(scales)
    library(kableExtra)
    library(vip)
    
    #Tidy Models
    library(rsample)
    library(recipes)
    library(parsnip)
    library(tune)
    library(dials)
    library(workflows)
    library(yardstick)
    
    # library(vip)
    
    df <- readr::read_csv('./data/employee_churn_data.csv')
    head(df)

    Understanding the Variables

    Turnover Rate By Department

    # 
    prop_df <- df %>%
      group_by(department, left) %>%
      summarize(Count = n()) %>%
      ungroup() %>%
      pivot_wider("department", names_from = left, values_from = Count) %>%
      mutate(turnover = percent(yes/(no + yes))) %>%
      select(-c(yes, no)) %>%
      arrange(desc(turnover))
    
    mean_df <- df %>% 
      select(department, review, avg_hrs_month, satisfaction) %>%
      group_by(department) %>%
      summarize(avg_review = mean(review),
                avg_hrs_dep_month = mean(avg_hrs_month),
                avg_satisfaction = mean(satisfaction))
    
    sal_df <- df %>%
      group_by(department, salary) %>%
      summarize(Count = n())  %>% 
      ungroup() %>%
      pivot_wider("department", names_from = salary, values_from = Count) %>%
      rowwise() %>%
      mutate(high_perc = percent(low/(low + medium + high), accuracy = 0.01),
             medium_perc = percent(medium/(low + medium + high), accuracy = 0.01),
             low_perc = percent(high/(low + medium + high),  accuracy = 0.01)) %>%
      select(-c(low, medium, high))
    
    prop_df %>%
      left_join(mean_df, on = c("departments"), keep = FALSE) %>%
      left_join(sal_df, on = c("departments"), keep = FALSE) %>%
    kbl(col.names = c("Department", "Turnover Percent", "Mean Reviews", "Mean Hours Per Month", "Mean Satisfaction", "High Salary (%)", "Medium Salary (%)", "Low Salary (%)")) %>%
    kable_material_dark(c("striped", "hover"))

    In the above table information about each department is displayed from the percentage of turnover to the percentage of each salary group by department. The IT department shows the highest turnover rate with the logistic department following close behind it. Given how close the means and proportions are to one another, it is difficult to get greater meaning from this and other clues to explore.

    # Correlation Plot using ggcorrplot
    df %>%
      select(-c(promoted, bonus)) %>%
      select_if(is.numeric) %>%
      cor() %>%
      round(4) %>%
      ggcorrplot(hc.order = FALSE, 
               type = "lower", lab = TRUE) +
      labs(x = "", y = "", title = "Correlation Matrix")

    When looking at the correlation plot, average hours worked per month is strongly correlated with tenure, almost perfectly linearly correlated and thus, one of the two presented variable will be removed in the analysis. Moreover, we see that there is a weak negative correlation between satisfaction and tenure as well as satisfaction and review. This could be due to the structure and method of the review which is possibly causing a negative emotional response such as discouragement, doubt, rejection, unappreciated, etc. In addition, we see the negative correlation between tenure and the review which could mean that employees who work there for longer periods are performing worse. But I suspect that since people who have high tenure also work more hours per month on average that these individuals are overworked and thus performing less optimally.

    df %>%
      ggplot(aes(x = tenure, y = review)) +
      geom_point(color = "#E69F00") +
      geom_smooth(method = "lm", se = FALSE) +
      labs(x = "Tenure", y = "Review", title = "Employee Tenure Review Score") +
      theme_classic()

    Further exploring the relationship between tenure and average hours worked, it can be seen that there is a negative correlation between a person's tenure and review score. Since, there is a confounding variable, average hours worked per month, it is likely that people are working harder (this could be working on difficult projects, working on more projects, etc.) for the company and are receiving negative reviews.

    df %>%
      ggplot(aes(x = avg_hrs_month, y = satisfaction, color = left)) +
      geom_point() + 
      scale_color_manual(values=c("#999999", "#E69F00")) +
      labs(x = "Average Hours Worked in a Month", y = "Satisfaction", title = "Employee Satisfaction Against Average Hours Worked Per Month") +
      theme_classic()
    
    
    df %>%
      filter(left == "yes") %>%
      ggplot(aes(x = avg_hrs_month, y = satisfaction)) +
      geom_point(color = "#E69F00") +
      geom_smooth(se = FALSE) +
      labs(x = "Average Hours Worked in a Month", y = "Satisfaction", title = "Ex-Employees Satisfaction against Mean Hours per Month")  +
      theme_classic()
    
    df %>%
      filter(left == "no") %>%
      ggplot(aes(x = avg_hrs_month, y = satisfaction)) +
      geom_point() +
    #  geom_point(color = "#999999") +
      geom_smooth(se = FALSE) +
      labs(x = "Average Hours Worked in a Month", y = "Satisfaction", title = "Current Employees Satisfaction against Mean Hours per Month") +
      theme_classic()

    In the above graphs there is an interesting occurrence happening, there are people who have greater satisfaction when working more hours and less satisfaction when working less hours. In the graph for ex-employees we see that there is a hump, so a significant amount of people enjoy working around 180 hours per month or a little more. This means that on average people do enjoy working around 40 hours per week or a little less, but people have a lower satisfaction score beyond 190 hours on average per month. In the graph for current employees the graph is another curvilinear relationship with a big decrease in employee satisfaction after 180 hours worked and then what appears to be a positive relationship beyond 190 hours worked on average per month. This is probably related to the type of work people are doing such, their individual labor-leisure curves, and life situations such as family.

    Best Predictors of Employee Turnover

    set.seed(0)
    
    df <- df %>%
      mutate(promoted = ifelse(promoted == 1, "yes", "no"),
             bonus = ifelse(bonus == 1, "yes", "no"))
    
    # Prepping the Kitchen
    df_split <- initial_split(df, prop = .8, strata = left)
    df_train <- df_split %>% 
      training()
    df_test <- df_split %>%
      testing()
    
    df_normalization_rep <- recipe(formula = left ~ ., data = df_train) %>% 
      step_corr(all_numeric(), threshold = .9) %>%
      step_normalize(all_numeric()) %>%
      step_dummy(all_nominal(), -all_outcomes()) %>%
      step_interact(~ tenure:review + satisfaction:review)
    
    ## Training Folds
    df_folds <- df_train %>%
      vfold_cv(v = 3, repeats = 2, strata = left)
    
    # Heating the Ovens and Starting the burners
    elast <- logistic_reg(penalty = 0.01,
      mixture = 0.5) %>%
      set_engine(engine = "glmnet")
    
    xg_model <- boost_tree(
      trees = 1000, 
      tree_depth = tune(), 
      min_n = tune(), 
      loss_reduction = tune(),                
      sample_size = tune(), 
      mtry = tune(),
      learn_rate = tune()) %>% 
    set_engine("xgboost") %>% 
    set_mode("classification")
    
    xgb_grid <- grid_latin_hypercube(
      tree_depth(),
      min_n(),
      loss_reduction(),
      sample_size = sample_prop(),
      finalize(mtry(), df_train),
      learn_rate(),
      size = 10
    )
    
    metrics <- metric_set(roc_auc, sens, spec)
    
    # Mise En Place
    xg_wf <- workflow() %>% 
      add_model(xg_model) %>%
      add_recipe(df_normalization_rep)
    
    glmnet_wf <- workflow() %>%
      add_model(elast) %>%
      add_recipe(df_normalization_rep)