David Asogwa/

Employee Network Analysis


Improving Company's Collaboration

An analysis of a company's six months of information exchange between employees and departments. The communications were between six departments (Sales, Operations, IT, Admin, Marketing and Operations) and a total of 664 employee's in all the departments. The analysis involved creating a network visualization of the messages sent by each employee from a department to another employee in the same or different department. This was to observe the collaborations between employees/departments and also find out areas to improve collaboration within the department or with another department.

Extensive analysis was performed to determine the most active department when sending and receiving messages, in which the Sales department was the most active in both scenarios with message count of 1549 and 1227 when sending and receiving messages for the six months period, respectively, while the Marketing department was the least active on both scenarios with message count of 16 and 140 when sending and receiving messages, respectively, for the period under study.

The results also showed that the employee with id 598, aside being among the top five most influential employee (including id's 128, 605 and 586) also has the most connections. Whilst the Sales department is also the most influential department, more collaborative measures should be implemented by the HR to improve collaboration in the IT, Marketing and also the Engineering departments.

The visualization of the messages sent and received per department within the six months period under study showed a huge decline as the month progressed as shown in the trend plot. More messages were shared between departments in the 6th month than in other months, while the 11th month had the least messages.

Below is a description of the dataset used for this study.

Messages has information on the sender, receiver, and time.
  • "sender" - represents the employee id of the employee sending the message.
  • "receiver" - represents the employee id of the employee receiving the message.
  • "timestamp" - the date of the message.
  • "message_length" - the length in words of the message.
Employees has information on each employee;
  • "id" - represents the employee id of the employee.
  • "department" - is the department within the company.
  • "location" - is the country where the employee lives.
  • "age" - is the age of the employee.
# Import modules
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb
import networkx as nx
import warnings

%matplotlib inline
sb.set_theme(style = 'whitegrid')

# Loading datasets
messages = pd.read_csv('data/messages.csv', parse_dates = ['timestamp'])
employees = pd.read_csv('data/employees.csv')

# Copying data
messages_sent = messages.copy()
employees_data = employees.copy()

# Checking for missing values
print('Message data info:')
print('\nEmployee data info:')

The messages_sent and employees_data have no missing values as observed from the results obtained above. The columns also have appropriate data types assigned to them.

One issue arises, though trivial but important, sender and receiver in the messages_sent data both indicate the sender and receiver of the message(s) respectively, but are actually the id of both, thus the columns will be renamed to sender_id and receiver_id respectively. This is to properly communicate what the feature (column) actually contains.

# Renaming columns
messages_sent.rename(columns = {'sender':'sender_id', 'receiver':'receiver_id'}, inplace = True)
Checking for duplicates

Exploring further on duplicated values

# Subset duplicated values
duplicate_values = messages_sent.duplicated(keep = False)

From all indications, the values are duplicated and thus will be dropped. This decision was taken because a particular sender (sender_id) cannot send the same exact message with the same message length, at the same date and same exact timestamp, to the same receiver (receiver_id). In real time, there must have been a difference in the time in seconds.

# Drop duplicates
messages_sent = messages_sent.drop_duplicates()

# check

Merging Datasets

The messages_sent and employee_data will be merged using the sender_id, receiver_id and id. In order to achieve a successful merge, the dataframe's must have a common key, in this case, sender_id, receiver_id and id.

The merge will use a left join as this will filter all the messages sent by each employee using the sender_id, including their details and receiver_id too.

# Merging data frames using 'left join' on 'sender_id' and 'id'
df_x = pd.merge(messages_sent, employees_data, how = 'left', left_on = 'sender_id', right_on = 'id')

# Merging data frames using 'left join' on 'receiver_id' and 'id'
df = pd.merge(df_x, employees_data, how = 'left', left_on = 'receiver_id', right_on = 'id')

# Check for null values
print('Null values summary:\n',df.isnull().sum())

# Check for duplicated values
print('\n\nThere are {} duplicated values\n\n'.format(df.duplicated().sum()))

# First five rows of merged data frame

The following modifications will be made to make the dataframe suitable for our analysis:

  • department_x and department_y will be renamed to sender_department and receiver_department respectively
  • location_x and location_y will be renamed to sender_location and receiver_location respectively
  • age_x and age_y will be renamed to sender_age and receiver_age respectively
  • A new column 'month' will be created from the timestamp so the trend of messages sent and received by each department can be visualized
# Renaming dataframe columns:
df.rename(columns = {'department_x':'sender_department', 'department_y':'receiver_department',
                     'location_x':'sender_location', 'location_y':'receiver_location',
                     'age_x':'sender_age', 'age_y':'receiver_age'}, inplace = True)

# Extract the month form the timestamp
df['month'] = df['timestamp'].dt.month

# First 2 rows of the new data frame