Ethan Weiland/

Mercy Rule?


Mercy Rule? Predicting the Full Time Result from the Half Time Result in the English Premier League

rm(list=ls()) # Clearing environment

ipak <- function(pkg){ # Function for installing and loading packages
    new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
    if (length(new.pkg))
      install.packages(new.pkg, dependencies = TRUE)
    sapply(pkg, require, character.only = TRUE)
packages <- c("tidyverse",

soccer <- read_csv('data/soccer18-19.csv.gz', show_col_types = FALSE) # Loading in data

1. Executive Summary

A team in the English Premier League has signed some young players. In order to optimize their playing time, the coach wants to be able to predict full time result at half time, so if the game is already won (or lost) the young players can see the field. To determine how predictable full time result is at half time, this research analyzes data from every match of the 2018/2019 Premier League season. Cross-tabulations, data visualization, and statistical inference reveal that full time result is indeed related to half time result, the home team wins more often than not, and the point of the season does not influence the probability of a certain full time result. A machine learning model - a decision tree for classification - is trained to incorporate these findings with other match statistics to best predict full time result. The model visualizes as a flowchart that the coach can easily trace based on match characteristics and predicts home wins well, away wins moderately well, and draws poorly. Overall, the findings of this analysis combined with the coach's subject matter expertise will optimize the amount of development time for the new young players.

2. Introduction

The English Premier League is the most widely watched soccer league in the world. Each year, the twenty best teams in England and Wales play a 38 game round-robin, with the best performing team over these 38 games crowned champion. A team in the league has signed some younger players, and wants to know how predictive full time result is at half time. If the coach can have confidence that the game is won (or lost) by half time, they can give playing time to these younger players, helping them to develop.

In this project, I determine how predictive full time result is from half time result and other match statistics using data from the 2018/2019 English Premier League season. I first explore the association between full time result and various features using descriptive statistics, data visualization, and statistical inference. I then build a decision tree for classification, utilizing all of the information from the features to make the best prediction of full time result.

3. Data

Data from the 2018/2019 English Premier League season is used for this analysis. It's important to note that the 2018/2019 season is the last season of the Premier League before the COVID-19 pandemic upended the league (and world) in March 2020. The data is from DataCamp Workspace, originally from Each row in the dataset is a match, with the following variables available:

DivDivision the game was played in
DateThe date the game was played
HomeTeamThe home team
AwayTeamThe away team
FTHGFull time home goals
FTAGFull time away goals
FTRFull time result
HTHGHalf time home goals
HTAGHalf time away goals
HTRHalf time result
RefereeThe referee of the game
HSNumber of shots taken by home team
ASNumber of shots taken by away team
HSTNumber of shots taken by home team on target
ASTNumber of shots taken by away team on target
HFNumber of fouls made by home team
AFNumber of fouls made by away team
HCNumber of corners taken by home team
ACNumber of corners taken by away team
HYNumber of yellow cards received by home team
AYNumber of yellow cards received by away team
HRNumber of red cards received by home team
ARNumber of red cards received by away team
# Converting Date to Month
soccer <- soccer %>%
	mutate(Month = month(date(soccer$Date)))

# Converting full time match statistics to half time match statistics
soccer <- soccer %>%
	mutate(HS = HS / 2,
		  AS = AS / 2,
		  HST = HST / 2,
		  AST = AST /2,
		  HF = HF / 2,
		  AF = AF / 2,
		  HC = HC / 2,
		  AC = AC / 2,
		  HY = HY / 2,
		  AY = AY / 2,
		  HR = HR / 2,
		  AR = AR / 2)

# Factoring appropriate variables
soccer <- soccer %>%
	mutate(HomeTeam = factor(HomeTeam),
		  AwayTeam = factor(AwayTeam),
		  FTR = factor(FTR),
		  HTR = factor(HTR),
		  Referee = factor(Referee))

# Selecting variables used in analysis
soccer <- soccer %>%

A few adjustments were made to the variables. First, the "Date" variable was converted to "Month". This aggregation reduces the number of unique values to 10. Additionally, "Month" can used for predicting future matches (Date included a year value). Second, all of the match statistics (home shots, away yellow cards, etc.) are full time match statisics. To approximate their values at the halfway point of each match (half time), I divide each of these statistics by 2. Finally, the "Div" variable was dropped becauase all of these matches have the same value for this variable: English Premier League.

4. Analysis

4.1 Simple Association Between Half Time Result and Full Time Result

table <- table(soccer$HTR, soccer$FTR, dnn=c("Half Time", "Full Time"))
sum(as.vector(diag(table))) / sum(as.vector(table))
chisq.test(table(soccer$HTR, soccer$FTR))

  • AI Chat
  • Code