Project: Consolidating Employee Data
  • AI Chat
  • Code
  • Report
  • Beta
    Spinner

    You just got hired as the first and only data practitioner at a small business experiencing exponential growth. The company needs more structured processes, guidelines, and standards. Your first mission is to structure the human resources data. The data is currently scattered across teams and files and comes in various formats: Excel files, CSVs, JSON files...

    You'll work with the following data in the datasets folder:

    • Office addresses are currently saved in office_addresses.csv. If the value for office is NaN, then the employee is remote.
    • Employee addresses are saved on the first tab of employee_information.xlsx.
    • Employee emergency contacts are saved on the second tab of employee_information.xlsx; this tab is called emergency_contacts. However, this sheet was edited at some point, and the headers were removed! The HR manager let you know that they should be: employee_id, last_name, first_name, emergency_contact, emergency_contact_number, and relationship.
    • Employee roles, teams, and salaries have been exported from the company's human resources management system into a JSON file titled employee_roles.json. Here are the first few lines of that file:
    {"A2R5H9": { "title": "CEO", "monthly_salary": "$4500", "team": "Leadership" }, ... }
    import pandas as pd
    # clean the addresses
    office_addr = pd.read_csv('datasets/office_addresses.csv')
    
    office_addr
    import pandas as pd
    
    # Read the excel file
    employee_info_df = pd.read_excel('datasets/employee_information.xlsx', sheet_name=0)
    
    # Get the columns related to addresses from the first sheet
    employee_df.head()
    employee_addr = employee_df[['employee_id', 'employee_country', 'employee_city', 'employee_street', 'employee_street_number']]
    
    # Merge two dataframes on employee_country and office_country
    merged_df = pd.merge(employee_addr, office_addr, left_on='employee_country', right_on='office_country', how='left')
    
    # Change any NaN values in office to "Remote"
    office_cols = [col for col in merged_df.columns if col.startswith('office')]
    merged_df[office_cols] = merged_df[office_cols].fillna('Remote')
    
    # Check the merged dataframe
    merged_df.head()
    # Read the excel file
    employee_emergency_info_df = pd.read_excel('datasets/employee_information.xlsx',
                    sheet_name=1,
                    header=None,
                    names=['employee_id', 'last_name', 'first_name', 'emergency_contact', 'emergency_contact_number', 'relationship'])
    
    # employee_emergency_info_df.head()
    merged_df = pd.merge(merged_df, employee_emergency_info_df, on='employee_id', how='left')
    merged_df
    # Read the json file
    employee_roles_df = pd.read_json('datasets/employee_roles.json')
    
    # Transpose the dataframe
    employee_roles_df = employee_roles_df.transpose()
    
    # Reset the index and rename the first column
    employee_roles_df.reset_index(inplace=True)
    employee_roles_df.rename(columns={'index': 'employee_id'}, inplace=True)
    # employee_roles_df
    merged_df = pd.merge(employee_roles_df, merged_df, on='employee_id', how='left')
    merged_df
    merged_df.index
    merged_df
    merged_df.index
    # employees_final = merged_df.set_index('employee_id')
    employees_final = merged_df[['employee_id', 'first_name', 'last_name', 'employee_country', 'employee_city', 'employee_street', 'employee_street_number', 'emergency_contact', 'emergency_contact_number', 'relationship', 'monthly_salary', 'team', 'title', 'office', 'office_country', 'office_city', 'office_street', 'office_street_number']]
    employees_final.set_index('employee_id', inplace=True)
    employees_final