Competition - high fatality accidents
  • AI Chat
  • Code
  • Report
  • Beta
    Spinner

    Background and problem statement understanding

    1. Major incidents == 3+ casualties
    2. We would like to get the characteristics of these major incidents so that we could lower the number of deaths.
    3. What time of day and day of the week do most major incidents happen?
    4. Search for patterns in the time of day and day of the week when major incidents occur.
    5. Characteristics stand out in major incidents compared with other accidents.
    6. Recommendations to the planning team focus their brainstorming efforts to reduce major incidents?

    Let's start by importing the required packages!

    import pandas as pd
    import seaborn as sns
    import numpy as np
    import matplotlib.pyplot as plt
    from datetime import datetime
    
    # lets set style of seaborn to darkgrid
    sns.set_theme(style="darkgrid")

    Let's load and explore the given data.

    accident_data = pd.read_csv("data/accident-data.csv")
    lookup = pd.read_csv("data/road-safety-lookups.csv")
    
    print(accident_data.info())
    accident_data.isna().sum().plot(kind='bar', figsize=(10,8))

    Data is spread across 27 columns and 91199 rows.
    Longitude and latitude have some missing values, other than that all good.
    Columns that will be useful in determining timeframe - [ accident_year, date, day_of_week, time ]
    Columns that will be useful in determining the severity of accidents - [ accident_severity, number_of_casualties, number_of_vehicles ]
    Columns that will be useful to understand the road, location, weather conditions and other features - [ longitude, latitude, light_conditions, weather_conditions, road_type, speed_limit, junction_detail, road_surface_conditions, urban_or_rural_area, carriageway_hazards, special_conditions_at_site, junction_control, pedestrian_crossing_human_control, pedestrian_crossing_physical_facilities]
    Columns that we may drop - [ accident_index, first_road_class, first_road_number, second_road_class, second_road_number]

    Let's get more understanding about accidents

    print(accident_data.accident_year.value_counts())
    print(len(accident_data[accident_data.duplicated(['accident_index'])]))
    
    total_accidents = accident_data.shape[0]
    
    # we know that we are dealing with 3 accident severities
    print(accident_data['accident_severity'].value_counts())  

    All incidents in given dataset happened during the year 2020.
    There are no duplicate entries in given the dataset.
    Accidents by severity

    1. Fatal accidents : 1391
    2. Serious Accidents : 18355
    3. Slight Accidents : 71453

    According to the background information, we are more interested in major incidents i.e. incidents in which casualties are more than or equal to 3. Let's create a subset and call it serious_accidents

    serious_accidents = accident_data.loc[accident_data['number_of_casualties'] >= 3]
    print(serious_accidents.shape[0], (serious_accidents.shape[0] * 100 / total_accidents))
    total_serious_accidents = serious_accidents.shape[0]
    total_non_serious_accidents = accident_data.shape[0] - total_serious_accidents

    In 2020, 4817 incidents (or 5.28%) were serious, where 3 or more casualties involved. Let's find out when these incidents have happened, but before that, we need to use a lookup table to find a day of the week and replace values in our subset. Let's make our life easy, define a function that will give us a dictionary to replace code/format with labels in the dataframe.

    def lookup_to_dict(df, field_name):
        lookup_dict = df[df['field name'] == field_name][['code/format', 'label']]
        lookup_dict['code/format'] = lookup_dict['code/format'].astype(int)
        return lookup_dict.set_index('code/format').to_dict()['label']

    Let's replace day of the week with labels and plot graphs to understand what day of the week do major accidents happen, also find percentage-wise accidents on each day.

    day_of_week_lookup = lookup_to_dict(lookup, 'day_of_week')
    serious_accidents['day_of_week'].replace(day_of_week_lookup, inplace=True)
    print(serious_accidents['day_of_week'])
    print(serious_accidents['day_of_week'].value_counts())
    print((serious_accidents['day_of_week'].value_counts() * 100 / total_serious_accidents))
    
    # Increase chart size for better readability
    sns.set(rc={'figure.figsize':(12,8)})
    sns.countplot(    
        x=serious_accidents.day_of_week, 
        data=serious_accidents, 
        order = serious_accidents.day_of_week.value_counts().index)
    plt.xlabel('Day of Week')
    plt.ylabel('Accidents Count')
    plt.title('Serious Accidents(by Day of the Week)')
    plt.show()

    Large number of serious accidents happened on Saturday followed by Friday and Sunday where number of casulties are 3 or more.
    Accidents stats show that ~48% (17.02+16.56+14.26) accidents of total serious accidents occurred on these three days.

    1. Saturday = 820 Accidents (17.02% of total serious accidents)
    2. Friday = 798 Accidents (16.56% of total serious accidents)
    3. Sunday = 687 Accidents (14.26% of total serious accidents)

    The number of accidents is on Monday and Tuesday is slightly less as compared with other days of the week.

    Let's find out the time of day in which these major incidents happen.

    serious_accidents['time'].head()
    # it seems we need to convert time column into datetime, lets do it add add to main our subset as "time_bins"
    # lets turn off right flag, it indicates whether bins includes the rightmost edge or not
    serious_accidents['time_bins'] = pd.cut(
        x=serious_accidents['time'].apply(lambda x : datetime.strptime(x, '%H:%M')).dt.hour, 
        bins=list(range(0, 25)), 
        right=False)
    
    print(serious_accidents['time_bins'].value_counts().head())
    print(serious_accidents['time_bins'].value_counts().head().sum())
    print(serious_accidents['time_bins'].value_counts().head().sum() * 100 / total_serious_accidents)
    
    sns.set(rc={'figure.figsize':(12,8)})
    sns.countplot(    
        x=serious_accidents.time_bins, 
        data=serious_accidents, 
        order = serious_accidents.time_bins.value_counts().index
    )
    plt.xlabel('Time of Day')
    plt.ylabel('Accidents Count')
    plt.title('Serious Accidents(by Time of Day)')
    plt.xticks(rotation=90)
    plt.show()