Workspace
Sean Stayte/

Soccer Through the Ages

0
Beta
Spinner

Soccer Through the Ages

This dataset contains information on international soccer games throughout the years. It includes results of soccer games and information about the players who scored the goals. The dataset contains data from 1872 up to 2023.

💾 The data

  • data/results.csv - CSV with results of soccer games between 1872 and 2023
    • home_score - The score of the home team, excluding penalty shootouts
    • away_score - The score of the away team, excluding penalty shootouts
    • tournament - The name of the tournament
    • city - The name of the city where the game was played
    • country - The name of the country where the game was played
    • neutral - Whether the game was played at a neutral venue or not
  • data/shootouts.csv - CSV with results of penalty shootouts in the soccer games
    • winner - The team that won the penalty shootout
  • data/goalscorers.csv - CSV with information on goal scorers of some of the soccer games in the results CSV
    • team - The team that scored the goal
    • scorer - The player who scored the goal
    • minute - The minute in the game when the goal was scored
    • own_goal - Whether it was an own goal or not
    • penalty - Whether the goal was scored as a penalty or not

The following columns can be found in all datasets:

  • date - The date of the soccer game
  • home_team - The team that played at home
  • away_team - The team that played away

These shared columns fully identify the game that was played and can be used to join data between the different CSV files.

Source: GitHub

📊 Some guiding questions and visualization to help you explore this data:

  1. Which are the 15 countries that have won the most games since 1960? Show them in a horizontal bar plot.
  2. How many goals are scored in total in each minute of the game? Show this in a bar plot, with the minutes on the x-axis. If you're up for the challenge, you could even create an animated Plotly plot that shows how the distribution has changed over the years.
  3. Which 10 players have scored the most hat-tricks?
  4. What is the proportion of games won by each team at home and away? What is the difference between the proportions?
  5. How many games have been won by the home team? And by the away team?

Question 1: Which are the 15 countries that have won the most games since 1960? Show them in a horizontal bar plot

Unknown integration
DataFrameavailable as
df1
variable
SELECT home_team, COUNT(*) as games_won
FROM
	(SELECT home_team
	FROM 'data/results.csv'
	WHERE home_score > away_score AND date >= '1960%'
	UNION ALL
	SELECT away_team
	FROM 'data/results.csv' 
	WHERE away_score > home_score AND date >= '1960%')
GROUP BY home_team
ORDER BY games_won DESC
LIMIT 15;

This query is taking long to finish...Consider adding a LIMIT clause or switching to Query mode to preview the result.

Question 2: How many goals are scored in total in each minute of the game? Show this in a bar plot, with the minutes on the x-axis

Why are so many goals scored in the 45' and 90' minutes?

Unknown integration
DataFrameavailable as
df
variable
SELECT minute, COUNT(*) as total_goals
FROM 'data/goalscorers.csv'
GROUP BY minute
ORDER BY total_goals DESC
LIMIT 20;
This query is taking long to finish...Consider adding a LIMIT clause or switching to Query mode to preview the result.

Question 2b: Why are so many goals scored in the 90th minute?

First thought: Added time isn't included. But it is - highest goal minute is 122

Theory: If the match goes to extra time, then goals are counted as being scored in e.g. 95', but if there are 5 mins of added time, then it's counted as 90'

Ans: At Qatar 2022, France scored in added time at 90+1, Poland in added time at 90+9. In the dataset both are listed as 90' In another match, Croatia-Brazil went to extra time, these goals are listed as 117 and 105. (Actually Neymar scored at 105+1) So hypothesis correct.

Unknown integration
DataFrameavailable as
df2
variable
SELECT r.date, r.home_team, r.away_team, r.home_score, r.away_score, g.minute
FROM 'data/results.csv' as r
FULL JOIN 'data/goalscorers.csv' as g
ON r.date = g.date AND r.home_team = g.home_team AND r.away_team = g.away_team
WHERE minute = '122'
This query is taking long to finish...Consider adding a LIMIT clause or switching to Query mode to preview the result.

Question 3: Which 10 players have scored the most hat-tricks?

Success! (The data on goalscorers is incomplete - hence why Messi isn't in the top 10)

Unknown integration
DataFrameavailable as
df5
variable
SELECT scorer, COUNT(scorer) AS hattrick_count
FROM (
	SELECT 
		date,
		scorer,
		COUNT(scorer)/3 AS count
	FROM 'data/goalscorers.csv'
	WHERE scorer NOT LIKE 'NA'
	GROUP BY date, scorer
	HAVING count >= 1) AS sub
GROUP BY scorer
ORDER BY hattrick_count DESC
LIMIT 10;
This query is taking long to finish...Consider adding a LIMIT clause or switching to Query mode to preview the result.

💼 Develop a case study for your portfolio

After exploring the data, you can create a comprehensive case study using this dataset. We have provided an example objective below, but feel free to come up with your own - the world is your oyster!

Example objective: The UEFA Euro 2024 tournament is approaching. Utilize the historical data to construct a predictive model that forecasts potential outcomes of the tournament based on the team draws. Since the draws are not known yet, you should be able to configure them as variables in your notebook.

You can query the pre-loaded CSV files using SQL directly. Here’s a sample query:

You can also use SQL cells to join the tables:

Alternatively, you can import the data using pandas, for example:

import pandas as pd

results = pd.read_csv("data/results.csv")



  • AI Chat
  • Code