Jagrit Singh/

Soccer Data Analysis


2018-19 English Premier League: An Exploratory Data Analysis

  • This dataset contains data of every game from the 2018-2019 season in the English Premier League.
  • In this project, I aim to explore the data and communicate some interesting findings.
  • The last section of this project shows the correlation between various columns of the data.

Source of dataset.

Data Dictionary

DivDivision the game was played in
DateThe date the game was played
HomeTeamThe home team
AwayTeamThe away team
FTHGFull time home goals
FTAGFull time away goals
FTRFull time result
HTHGHalf time home goals
HTAGHalf time away goals
HTRHalf time result
RefereeThe referee of the game
HSNumber of shots taken by home team
ASNumber of shots taken by away team
HSTNumber of shots taken by home team on target
ASTNumber of shots taken by away team on target
HFNumber of fouls made by home team
AFNumber of fouls made by away team
HCNumber of corners taken by home team
ACNumber of corners taken by away team
HYNumber of yellow cards received by home team
AYNumber of yellow cards received by away team
HRNumber of red cards received by home team
ARNumber of red cards received by away team
#Importing necessary libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Loading the dataset into a dataframe
df = pd.read_csv("soccer18-19.csv")

#Printing the number of rows and columns
print('Number of rows and columns:', df.shape)

#Printing out the first five rows

Understanding Columns & Values

  • The info() function ia useful tool to summarize the data.
  • Here, I'm going to analyze each column's name, datatype and number of non-null rows they carry.
  • This is important to see if there are any missing values and to get familiar with the overall dataset.
  • Now, let's use the isna() function and aggregating it using sum() to get the total count of missing values.
  • The data is complete as there are no null values.
  • This means that I don't have to alter the dataframe in any way.

Useful Statistics

  • Here, we'll be using the describe() function.
  • This gives us helpful descriptive stats for our data
  • Null values are excluded here. In our case, however, there aren't any.
  • Using the unique() function to print distinct values of the 'Home Team' column.
  • This will show us all the teams that participated in the season.
  • Using the value_counts() function to print out the number of rows for each unique team.
  • This shows how many matches each team played as Home Team.
  • Note: Every team playes 19 matches as Home Team and other 19 as Away.

Data Visualizations

Speculating Outliers

# Plotting a boxplot to figure out some outliers. 
	kind='box', figsize=(12,8) 
  • Largest distribution of data is of 'Hometeam shots'.
  • Every category has outliers. However, the largest amount of outliers for 'AC' (corners taken by the away team).
  • Also, there are significant outliers for 'HY'. This means that were matches with an oddly large amouont of yellow cards.

Checking the Distribution

Let's use a histogram to further analyze one column in the df.

  • Plotting the distribution of full-time home and away goals.
#Home Goals
	kind = "hist",
    y = 'FTHG',
    bins = 5,
    # figsize = (12,8)

#Away Goals
	kind = "hist",
    y = 'FTAG',
    bins = 5,
    alpha = 0.3,
    # figsize = (12,8)
  • Distribution of home and away goals is similar.
  • However, the frequency of away goals seems to be higher in the 0 - 1 range.