1. 2022 World Cup
The 2022 FIFA World Cup was the 22nd time the prestigious tournament was held, with Argentina claiming victory over the reigning world champions, France, winning their 3rd World Cup of all time. The World Cup is held once every four years and features 32 teams from around the world who qualified through their respective regions.
This was the second time this international competition was held in Asia, with 64 games being played over 29 days through November and December. The 32 teams were divided into eight groups. Each team played each other once in the group stages, and the top two teams from each group would advance to the single-elimination knockout stages.
This project aims to see what, if any, factors contributed to teams making a further run in the tournament than other teams. The majority of the data used in this notebook was obtained from a dataset on kaggle. The FIFA rankings and group for each team were also found on kaggle here, which also shows two graphics showing how the tournament plays out.
2. Libraries and Datasets
Here we import the necessary packages used for the analysis as well as loading in our data. Unnecessary columns were also removed from the tables. Two of the datasets were merged together into one dataframe called world_cup_data
to make it easier to work with throughout the analysis.
# loading packages for analysis
library("readr")
library("dplyr")
library("forcats")
library("ggplot2")
library("stringr")
options(repr.plot.width =9, repr.plot.height =9)
Data importing
# importing dataset containing stats for each team at the world cup
team_stats <- read_csv("datasets/kaggle/Squad Standard Stats.csv")
# importing dataset containing an overall view of how each team performed
final_standings_table <- read_csv("datasets/kaggle/Final League Table.csv")
# importing dataset that contains the world ranking of each team and the group they were in for the tournament
team_rankings <- read_csv("datasets/kaggle/2022_world_cup_groups.csv")
Data cleaning and joining
#removing 21 columns from team_stats
team_stats <- team_stats %>%
select(c(1:4, Assists))
#removing 2 columns from final_standings_table
final_standings_table <- final_standings_table%>%
select(-Points, -`xG Difference per 90`)%>%
arrange(Team)
#renaming USA to United States and Korea Republic to South Korea for proper joining
final_standings_table$Team <- str_replace(final_standings_table$Team, "USA", "United States")
final_standings_table$Team <- str_replace(final_standings_table$Team, "Korea Republic", "South Korea")
team_stats$Team <- str_replace(team_stats$Team, "Korea Republic", "South Korea")
#refactoring `Depth of the Campaign` in final_standings_table
final_standings_table <- final_standings_table%>%
mutate(`Depth of the Campaign` = fct_relevel(`Depth of the Campaign`, c("F", "3P", "QF", "R16", "GR")))
#joining the dataframes together
world_cup_data <- final_standings_table %>% inner_join(team_stats, by = "Team")
str(world_cup_data)
3. Goals Scored
Visualizing the number of goals scored by each team in the tournament, colored by how far they made it shown in Graph 3-1. 'F' stands for the finals (1st and second place team), '3P' stands for the third place match (3rd and 4th placed team who got knocked out in the semi-finals), 'QF' stands for quarter-finals, 'R16' stands for the round of sixteen, and 'GR' stands for the group stages.
ggplot(world_cup_data, aes(x = `Goals For`, y = reorder(Team, `Goals For`),
color = `Depth of the Campaign`)) +
geom_point(size = 5)+
geom_segment(aes(xend = 0, yend = Team), linewidth = 3)+
labs(x = "Goals Scored", y = "Country", title = "Graph 3-1 Goals by Team") +
geom_text(aes(label = `Goals For`), color = "white", size =3) +
scale_color_discrete("Length of Campaign") +
theme(axis.text.y = element_text(size = 14))
In general we see that the more goals you score, the further you make it in the tournament, but there are some interesting observations here. Spain didn't get past the first knockout stage, but still scored more goals than both the third place match teams and a quarter-finals team. Germany has the most goals of any team that didn't get past the group stages. Belgium, a top rated team in the tournament, only managed to score a single goal in their campaign.
4. Knockout stage teams
Finding teams who made it past the group stage. This can be done two ways; one way is by filtering for teams whose campaign wasn't "GR", and the second is filtering for teams who played more than three matches. Two new dataframes are created: teams_qualified
and teams_not_qualified
which will be used later in the analysis.
# finding qualified teams by standings table, returned as a data frame for use later
teams_qualified <- world_cup_data %>%
filter(`Depth of the Campaign` != "GR")%>%
select(Team, `Depth of the Campaign`, `Goal Difference`)%>%
arrange(`Depth of the Campaign`)
# finding teams who didn't make the knockout stage, returned as a dataframe for use later
teams_not_qualified <- world_cup_data %>%
filter(`Depth of the Campaign` == "GR")%>%
select(Team, `Depth of the Campaign`, `Goal Difference`)%>%
arrange(`Depth of the Campaign`)
# finding qualified teams by team stats, returned as a vector instead of a dataframe
teams_qualified_by_matches <- world_cup_data %>%
filter(`Matches Played` > 3) %>%
pull(Team)
print(teams_qualified_by_matches)
print(teams_qualified, n = 16)
5. Team rankings
Taking a look at the different rankings of the teams who qualified and didn't qualify for the knockout stages. We can see the top 4 rated teams of those who didn't qualify, and the lowest 4 rated teams of those who did. The lower the number, the "higher" rank a team is. For example, the team with a FIFA ranking of 5 is supposed to be better than a team with a FIFA ranking of 12.
Lowest ranked teams that qualified
# lowest ranked 4 who qualified
print("Lowest ranked teams of those in knockout stages")
teams_qualified%>%
inner_join(team_rankings, by = "Team")%>%
top_n(`FIFA Ranking`, n = 4)
Highest ranked teams that did not qualify
# highest ranked 4 who didnt qualify
print("Highest ranked teams of those knocked out in group stages")
teams_not_qualified%>%
inner_join(team_rankings, by = "Team")%>%
top_n(`FIFA Ranking`, n = -4)
The 4 lowest rated teams to make the knockout stages were all rated worse than 20, and all lost in the first knockout round. In fact, seven of the eight (the bottom half) lowest rated teams got knocked out in the first round of the knockout stages, except for Morocco (FIFA ranking 22), who went on to become the first African team to ever reach the semi-finals in a World Cup.
6. Expected goals (xG)
Expected goals (xG) is a metric that measures the probability of a shot resulting in a goal; the likelihood of a goal happening is on a scale between 0 and 1. For example, a shot with an xG value of 0.2 means the shot is expected to result in a goal 20% of the time. This is based off of historical information of shots with similar characteristics, such as distance to the goal, angle to goal, type of assist, and much more. Expected goals can give an indication of team quality and show if some teams are over or under-performing.
A team that has a higher xG doesn't necessarily mean that team should have won the game, as xG only measures quality. However, xG does give quantitative numbers to age old sayings such as, "he scores that 9 times out of 10". Below we will see a visualization of the number of goals scored by each team along with their xG for the tournament. This will let us see what teams may have over or underperformed, as well as how accurate xG was for the 2022 World Cup shown in Graph 6-1.
Visualizing xG
A line with a slope of 1 was added to the plot to show how accurate the xG was for some teams. Points close to the dashed line would means a team scored about as many goals as they were expected to; points to the left of the line would indicate a team that "overperformed" or scored more than expected, while a point to the right of the line means they scored less than expected or "underperformed". This is also shown by the size of each point
# plotting goals scored vs xG for each team, colored by `Depth of the Campaign`
ggplot(world_cup_data, aes(xG, `Goals For`, color = `Depth of the Campaign`,
size = `Goals For`/xG)) +
geom_point(alpha = 0.5) +
scale_y_continuous(limits = c(0,20))+
scale_x_continuous(limits = c(0,20))+
labs(x = "Expected Goals", y = "Goals Scored", size = "Goals per xG",
color = "Length of Campaign", title = "Graph 6-1 Goals vs xG per Team")+
geom_abline(intercept = 0, slope = 1, color = "black", linetype = "dashed")
Accuracy of xG
Overall it appears that xG was quite accurate. There was approximately one goal for every expected goal.
# Seeing accuracy of xG
goal_to_XG_ratio <- world_cup_data%>%
summarize(Goal_to_xG_Ratio = sum(`Goals For`/sum(xG)))%>%
pull(Goal_to_xG_Ratio)
print("Ratio of Goals to xG")
print(goal_to_XG_ratio, digits = 5)