Agness Manga/

Project: Consolidating Employee Data


You just got hired as the first and only data practitioner at a small business experiencing exponential growth. The company needs more structured processes, guidelines, and standards. Your first mission is to structure the human resources data. The data is currently scattered across teams and files and comes in various formats: Excel files, CSVs, JSON files...

You'll work with the following data in the datasets folder:

  • Office addresses are currently saved in office_addresses.csv. If the value for office is NaN, then the employee is remote.
  • Employee addresses are saved on the first tab of employee_information.xlsx.
  • Employee emergency contacts are saved on the second tab of employee_information.xlsx; this tab is called emergency_contacts. However, this sheet was edited at some point, and the headers were removed! The HR manager let you know that they should be: employee_id, last_name, first_name, emergency_contact, emergency_contact_number, and relationship.
  • Employee roles, teams, and salaries have been exported from the company's human resources management system into a JSON file titled employee_roles.json. Here are the first few lines of that file:
{"A2R5H9": { "title": "CEO", "monthly_salary": "$4500", "team": "Leadership" }, ... }

Task Instructions

  • Create a single DataFrame called employees_final containing: Index: employee_id,

Columns: first_name, last_name, employee_country, employee_city, employee_street, employee_street_number, emergency_contact, emergency_contact_number, relationship, monthly_salary, team, title, office, office_country, office_city, office_street, office_street_number.

  • Change any missing values in column names starting with office to the word "Remote".
import pandas as pd
# Start coding here... 
#Import office addresses file 
office_addresses = pd.read_csv('datasets/office_addresses.csv')
#Check the contents of the dataframe

Read Excel

For merge purposes, i assumed that an employee lives in the City where the office city is located as well, it only makes sense right?

I noticed that the value of office city for US is 'New York City' in the office_addresses table but it's 'New-York' in the employee address table, so i had to fix the value in this table since it will be merged on

#Import employee addresses from the first worksheet of employee_information.xlsx
employee_addresses=pd.read_excel('datasets/employee_information.xlsx', sheet_name=0)

#Set the name of employee city to match the one in office city
employee_addresses.loc[employee_addresses['employee_city'] == 'New-York', 'employee_city'] = 'New York City'

# Import employees emergency contacts from the second worksheet of employee_information.xlsx
column_names = ["employee_id", "last_name", "first_name", "emergency_contact", "emergency_contact_number", "relationship"]

sheet= 'emergency_contacts'

employee_emergency_contacts= pd.read_excel('datasets/employee_information.xlsx', sheet_name=sheet, header=None, names=column_names)



#Read the json file
employee_roles = pd.read_json('datasets/employee_roles.json', orient='index')

#Make a column(employee_id) out of the index
employee_roles.rename(columns={'index': 'employee_id'}, inplace=True)

# Check the contents of the file

Merge Dataframes

# merge the tables
#Merge employee adresses with employee emergency contacts
addresses_emergency_contacts = employee_addresses.merge(employee_emergency_contacts, how='left', on='employee_id')

#Merge the result of the merge above with employee roles table
addresses_emergency_role= addresses_emergency_contacts.merge(employee_roles, how='left', on='employee_id')

#Merge the resulting df from above with office addresses
merged_dfs = addresses_emergency_role.merge(office_addresses, how='left', left_on='employee_country', right_on='office_country')

#Check the contents of the merged datasets

Filter required Columns

The last action after merging is to get the required dataframe with selected columns.

#create a list of the columns i want the final dataframe to have
columns = ['employee_id',
    'first_name', 'last_name', 'employee_country', 'employee_city', 'employee_street',
    'employee_street_number', 'emergency_contact', 'emergency_contact_number',
    'relationship', 'monthly_salary', 'team', 'title', 'office', 'office_country',
    'office_city', 'office_street', 'office_street_number'

#Filter the merged dataframe on the columns
employees_final= merged_dfs[columns]

#Set the employee id column as the index for rows
employees_final.set_index('employee_id', inplace=True)

#Check for any missing values in the column names starting with 'office'
office_columns = ['office', 'office_country', 'office_city', 'office_street', 'office_street_number']
#Returns 1 row with missing values

#Fill missing values in columns starting with 'office' with 'Remote'
employees_final[office_columns] = employees_final[office_columns].fillna('Remote')

#Check the final  dataframe
  • AI Chat
  • Code