Beta
2018-19 English Premier League: An Exploratory Data Analysis
- This dataset contains data of every game from the 2018-2019 season in the English Premier League.
- In this project, I aim to explore the data and communicate some interesting findings.
- The last section of this project shows the correlation between various columns of the data.
Source of dataset.
Data Dictionary
Column | Explanation |
---|---|
Div | Division the game was played in |
Date | The date the game was played |
HomeTeam | The home team |
AwayTeam | The away team |
FTHG | Full time home goals |
FTAG | Full time away goals |
FTR | Full time result |
HTHG | Half time home goals |
HTAG | Half time away goals |
HTR | Half time result |
Referee | The referee of the game |
HS | Number of shots taken by home team |
AS | Number of shots taken by away team |
HST | Number of shots taken by home team on target |
AST | Number of shots taken by away team on target |
HF | Number of fouls made by home team |
AF | Number of fouls made by away team |
HC | Number of corners taken by home team |
AC | Number of corners taken by away team |
HY | Number of yellow cards received by home team |
AY | Number of yellow cards received by away team |
HR | Number of red cards received by home team |
AR | Number of red cards received by away team |
#Importing necessary libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
# Loading the dataset into a dataframe
df = pd.read_csv("soccer18-19.csv")
#Printing the number of rows and columns
print('Number of rows and columns:', df.shape)
#Printing out the first five rows
df.head()
Understanding Columns & Values
- The info() function ia useful tool to summarize the data.
- Here, I'm going to analyze each column's name, datatype and number of non-null rows they carry.
- This is important to see if there are any missing values and to get familiar with the overall dataset.
df.info()
- Now, let's use the isna() function and aggregating it using sum() to get the total count of missing values.
df.isna().sum()
- The data is complete as there are no null values.
- This means that I don't have to alter the dataframe in any way.
Useful Statistics
- Here, we'll be using the describe() function.
- This gives us helpful descriptive stats for our data
- Null values are excluded here. In our case, however, there aren't any.
df.describe()
- Using the unique() function to print distinct values of the 'Home Team' column.
- This will show us all the teams that participated in the season.
df['HomeTeam'].unique()
- Using the value_counts() function to print out the number of rows for each unique team.
- This shows how many matches each team played as Home Team.
- Note: Every team playes 19 matches as Home Team and other 19 as Away.
df['HomeTeam'].value_counts(dropna=True)
Data Visualizations
Speculating Outliers
# Plotting a boxplot to figure out some outliers.
df.plot(
kind='box', figsize=(12,8)
)
- Largest distribution of data is of 'Hometeam shots'.
- Every category has outliers. However, the largest amount of outliers for 'AC' (corners taken by the away team).
- Also, there are significant outliers for 'HY'. This means that were matches with an oddly large amouont of yellow cards.
Checking the Distribution
Let's use a histogram to further analyze one column in the df.
- Plotting the distribution of full-time home and away goals.
#Home Goals
df.plot(
kind = "hist",
y = 'FTHG',
bins = 5,
# figsize = (12,8)
)
#Away Goals
df.plot(
kind = "hist",
y = 'FTAG',
bins = 5,
alpha = 0.3,
# figsize = (12,8)
)
- Distribution of home and away goals is similar.
- However, the frequency of away goals seems to be higher in the 0 - 1 range.