Maximiliano Castelli/

Competition - high fatality accidents-MRC


Reducing the number of high fatality accidents

📖 Background

You work for the road safety team within the department of transport and are looking into how they can reduce the number of major incidents. The safety team classes major incidents as fatal accidents involving 3+ casualties. They are trying to learn more about the characteristics of these major incidents so they can brainstorm interventions that could lower the number of deaths. They have asked for your assistance with answering a number of questions.

💾 The data

The reporting department have been collecting data on every accident that is reported. They've included this along with a lookup file for 2020's accidents.

Published by the department for transport. Contains public sector information licensed under the Open Government Licence v3.0.

import pandas as pd
accidents = pd.read_csv(r'./data/accident-data.csv')

lookup = pd.read_csv(r'./data/road-safety-lookups.csv')


#lookup[lookup['field name'].isin(['accident_severity'])][['code/format','label']]
#lookup[lookup['field name'].isin(['day_of_week'])]
#lookup[lookup['field name'].isin(['day_of_week'])][['code/format','label']]
#lookup[lookup['field name'].isin(['urban_or_rural_area'])]
#lookup[lookup['field name'].isin(['pedestrian_crossing_human_control'])]
#lookup[lookup['field name'].isin(['light_conditions'])]
#lookup[lookup['field name'].isin(['first_road_class'])]
#lookup[lookup['field name'].isin(['junction_control'])]
#lookup[lookup['field name'].isin(['second_road_class'])]
#lookup[lookup['field name'].isin(['pedestrian_crossing_physical_facilities'])]
#lookup[lookup['field name'].isin(['road_type'])]
#lookup[lookup['field name'].isin(['road_surface_conditions'])]
#lookup[lookup['field name'].isin(['carriageway_hazards'])]
#lookup[lookup['field name'].isin(['special_conditions_at_site'])]
#lookup[lookup['field name'].isin(['weather_conditions'])]
#lookup[lookup['field name'].isin(['junction_detail'])]

#print(lookup[lookup['field name'].isin(['second_road_number'])]['label'])

#lookup['field name'].value_counts()

💪 Competition challenge

Create a report that covers the following:

  1. What time of day and day of the week do most major incidents happen?
  2. Are there any patterns in the time of day/ day of the week when major incidents occur?
  3. What characteristics stand out in major incidents compared with other accidents?
  4. On what areas would you recommend the planning team focus their brainstorming efforts to reduce major incidents?

🧑‍⚖️ Judging criteria

  • Clarity of recommendations - how clear and well presented the recommendation is.
  • Quality of recommendations - are appropriate analytical techniques used & are the conclusions valid?
  • Number of relevant insights found for the target audience.
  • How well the data and insights are connected to the recommendation.
  • How the narrative and whole report connects together.
  • Balancing making the report in depth enough but also concise.
  • Appropriateness of visualization used.
  • Clarity of insight from visualization.
  • Up voting - most upvoted entries get the most points.

✅ Checklist before publishing into the competition

  • Rename your workspace to make it descriptive of your work. N.B. you should leave the notebook name as notebook.ipynb.
  • Remove redundant cells like the judging criteria so the workbook is focused on your story.
  • Make sure the workbook reads well and explains how you found your insights.
  • Check that all the cells run without error.

⌛️ Time is ticking. Good luck!

import pandas as pd
import numpy as np
import seaborn as sns  
import matplotlib.pyplot as plt
from datetime import datetime

#My User Defined Functions
def addLookupField(df,lk_fieldname,lk_newfieldname):
    '''Gets corresponding mappings from lookup Dataframe and attach its description as a new lk_ field '''
    categ= lookup[lookup['field name'].isin([lk_fieldname])][['code/format','label']]
    categ['code/format'] = categ['code/format'].astype(int)
    categ['label'] = categ['label'].astype(str)

    categ_ind = categ.set_index('code/format')
    mapping_lbl = categ_ind['label'].to_dict()

    acc_copy[lk_newfieldname] = acc_copy[lk_fieldname].replace(mapping_lbl)
    return df

accidents = pd.read_csv(r'./data/accident-data.csv')
lookup = pd.read_csv(r'./data/road-safety-lookups.csv')

#get a safe copy of original's dataframe
acc_copy = accidents.copy(deep=True)

#get all description lookup as new fields
acc_copy = addLookupField(acc_copy,'accident_severity','lk_severity')
acc_copy = addLookupField(acc_copy,'day_of_week','lk_weekday')
acc_copy = addLookupField(acc_copy,'junction_detail','lk_junction_detail')
acc_copy = addLookupField(acc_copy,'weather_conditions','lk_weather_conditions')
acc_copy = addLookupField(acc_copy,'special_conditions_at_site','lk_special_conditions_at_site')
acc_copy = addLookupField(acc_copy,'carriageway_hazards','lk_carriageway_hazards')
acc_copy = addLookupField(acc_copy,'road_surface_conditions','lk_road_surface_conditions')
acc_copy = addLookupField(acc_copy,'road_type','lk_road_type')
acc_copy = addLookupField(acc_copy,'pedestrian_crossing_physical_facilities','lk_pedestrian_crossing_physical_facilities')
acc_copy = addLookupField(acc_copy,'second_road_class','lk_second_road_class')
acc_copy = addLookupField(acc_copy,'junction_control','lk_junction_control')
acc_copy = addLookupField(acc_copy,'first_road_class','lk_first_road_class')
acc_copy = addLookupField(acc_copy,'light_conditions','lk_light_conditions')
acc_copy = addLookupField(acc_copy,'pedestrian_crossing_human_control','lk_pedestrian_crossing_human_control')
acc_copy = addLookupField(acc_copy,'urban_or_rural_area','lk_urban_or_rural_area')
#acc_copy = addLookupField(acc_copy,'speed_limit','lk_speed_limit')

#add to Dataframe all related time fields that are relevant data to analyze further on
acc_copy['etime'] = pd.to_datetime(acc_copy['date']+' '+acc_copy['time'])
acc_copy['htime'] = acc_copy['etime'].dt.strftime('%H')
acc_copy['month'] = acc_copy['etime'].dt.strftime('%B')
acc_copy['nmonth'] = acc_copy['etime'].dt.strftime('%m')
acc_copy['smonth'] = acc_copy['etime'].dt.strftime('%b')

#acc_copy major when 3+ casualties occured
#acc_copy_major = acc_copy[(acc_copy['number_of_casualties']>=3) & (acc_copy['lk_severity']=='Fatal')]
acc_copy_major = acc_copy[(acc_copy['number_of_casualties']>3)]

acc_copy_major['xtype'] = 'major'

#set_acc_ind = set(acc_copy_major['accident_index'])
#inc = acc_copy['accident_index'].isin(set_acc_ind)
#acc_copy_others = acc_copy[~inc]
acc_copy_others = acc_copy[(acc_copy['number_of_casualties']<=3)]

acc_copy_others['xtype'] = 'other'

acc_copy_full= pd.concat([acc_copy_major,acc_copy_others])

Time of day and Day of the week most major incidents happen

Following displays count of those accidents with casualties >=3 by Hour of the day and Day of the week, some patterns can be noticed, as during afternoon hours is when most major incidents happen and also during weekends, specially on Friday, Saturday and Sunday evenings.
Thus could be related to hours when most people move around and go out for fun, moving from one location to another using their vehicles.
Perhaps a good recommendation is to setup transit controls on strategic areas in these days/hours, and locations such as in/outs from the cities or major highway exits or even city entertainment areas, and include driver alcohol testing in these controls.

r = acc_copy_major.groupby(['day_of_week','lk_weekday','htime'],as_index=False)\

acc_major_by_day_hour = pd.crosstab(index=r['lk_weekday'],columns=r['htime'],values=r['accident_index'],\

#Heatmap of Time of day and Day of the week most major incidents happen
fig, ax = plt.subplots()
d = sns.heatmap(data=acc_major_by_day_hour,annot=True, fmt="d",cmap="OrRd", cbar=True, \
            linewidths=.5,annot_kws={"size": 12})
d.axes.set_title('Time of day and Day of the week most major incidents happen',fontsize=20)
d.set_xlabel("Hour of the Day",fontsize=14)
d.set_ylabel("Day of the Week",fontsize=14)
r = acc_copy_major.groupby(['day_of_week','lk_weekday','nmonth','smonth'],as_index=False)\

acc_major_by_month_day = pd.crosstab(index=[r['nmonth'],r['smonth']],columns=[r['day_of_week'],r['lk_weekday']],values=r['accident_index'],\

#Heatmap of Time of day and Day of the week most major incidents happen
fig, ax = plt.subplots()
d = sns.heatmap(data=acc_major_by_month_day,annot=True, fmt="d",cmap="YlGnBu", cbar=True, \
            linewidths=.5,annot_kws={"size": 12})
d.axes.set_title('Month and Day of the week most major incidents happen',fontsize=20)
d.set_xlabel("Day of the week",fontsize=14)

Months where most major incidents with 3+ casualties occur

Following is how major accidents happened throughout the year, seems first months of the year are significative when comparing to rest of the year, and during Summer till mid Autumn seems to be very high season of major incidents too.

#Barplot on Months where most Fatal casualties do occurr
acc_on_month = acc_copy_major.groupby(['smonth','nmonth','month'],as_index=False)\

e = sns.barplot(data=acc_on_month,x='smonth',y='accident_index',palette='autumn')
e.axes.set_title('Major incidents by Month',fontsize=14)
e.set_ylabel('No. of major incidents 3+ casualties',fontsize=12)

Days with most No. of major incidents with 3+ casualties

acc_on_day = acc_copy_major.groupby(['day_of_week','lk_weekday'],as_index=False)\
fig, ax = plt.subplots()
#Barplot on Days where most Fatal casualties do occurr
plt.title('Days with most No. of major incidents with 3+ casualties', fontdict={'size': 14})
plt.ylabel('No. of major incidents 3+ casualties',fontdict={'size': 12})
plt.xlabel('Day of the week',fontdict={'size': 12})

Hours when most major incidents occur

acc_on_hour = acc_copy_major.groupby(['htime'],as_index=False)\
fig, ax = plt.subplots()
plt.title('Hours when most major incidents occur', fontdict={'size': 14})
plt.ylabel('No. of major incidents 3+ casualties',fontdict={'size': 12})
plt.xlabel('Hours of the Day',fontdict={'size': 12})

Characteristics comparing major incidents with other accidents

Regarding severity categories defined in data, we may find that there are major incidents with 3+ casualties tagged as "Slight" or "Serious", which may lead to confusion when defining these categories.
One possible solution to this could be to track the number of deaths that occur instantly in an accident, or even derived deaths from injured.
As a result of this, such categories could be redefined in order to include these death indicators

r = acc_copy_full.groupby(['xtype','lk_severity','number_of_casualties','number_of_vehicles'],as_index=False)\

hue_colors = {'major':'red','other':'black'}
c_order= ['Slight','Serious','Fatal']
g = sns.relplot(data=r,x='number_of_vehicles',y='number_of_casualties',kind='scatter',\

g.set(xlim=(0, None))
g.set_axis_labels('No. of Vehicles', 'No. of Casualties',fontdict={'size': 14})
g._legend.set(title='Incident Type')

Now when evaluating location area of major accidents, we may find that larger number of casualties occur in Rural locations rather than Urban areas, whereas a larger number of vehicles seem to be involved in the these.

r = acc_copy_full.groupby(['xtype','lk_urban_or_rural_area','number_of_casualties','number_of_vehicles'],as_index=False)\

hue_colors = {'major':'red','other':'black'}
#c_order= ['Slight','Serious','Fatal']
g = sns.relplot(data=r,x='number_of_vehicles',y='number_of_casualties',kind='scatter',\

g.set(xlim=(0, None))
g.set_axis_labels('No. of Vehicles', 'No. of Casualties',fontdict={'size': 14})
g._legend.set(title='Incident Type')