Soccer Data Analysis

Beta

2018-19 English Premier League: An Exploratory Data Analysis

This dataset contains data of every game from the 2018-2019 season in the English Premier League.
In this project, I aim to explore the data and communicate some interesting findings.
The last section of this project shows the correlation between various columns of the data.

Source of dataset.

Data Dictionary

Column	Explanation
Div	Division the game was played in
Date	The date the game was played
HomeTeam	The home team
AwayTeam	The away team
FTHG	Full time home goals
FTAG	Full time away goals
FTR	Full time result
HTHG	Half time home goals
HTAG	Half time away goals
HTR	Half time result
Referee	The referee of the game
HS	Number of shots taken by home team
AS	Number of shots taken by away team
HST	Number of shots taken by home team on target
AST	Number of shots taken by away team on target
HF	Number of fouls made by home team
AF	Number of fouls made by away team
HC	Number of corners taken by home team
AC	Number of corners taken by away team
HY	Number of yellow cards received by home team
AY	Number of yellow cards received by away team
HR	Number of red cards received by home team
AR	Number of red cards received by away team

#Importing necessary libraries
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Loading the dataset into a dataframe
df = pd.read_csv("soccer18-19.csv")

#Printing the number of rows and columns
print('Number of rows and columns:', df.shape)

#Printing out the first five rows
df.head()

Understanding Columns & Values

The info() function ia useful tool to summarize the data.
Here, I'm going to analyze each column's name, datatype and number of non-null rows they carry.
This is important to see if there are any missing values and to get familiar with the overall dataset.

df.info()

Now, let's use the isna() function and aggregating it using sum() to get the total count of missing values.

df.isna().sum()

The data is complete as there are no null values.
This means that I don't have to alter the dataframe in any way.

Useful Statistics

Here, we'll be using the describe() function.
This gives us helpful descriptive stats for our data
Null values are excluded here. In our case, however, there aren't any.

df.describe()

Using the unique() function to print distinct values of the 'Home Team' column.
This will show us all the teams that participated in the season.

df['HomeTeam'].unique()

Using the value_counts() function to print out the number of rows for each unique team.
This shows how many matches each team played as Home Team.
Note: Every team playes 19 matches as Home Team and other 19 as Away.

df['HomeTeam'].value_counts(dropna=True)

‌
‌
‌

Soccer Data Analysis

.mfe-app-workspace-kj242g{position:absolute;top:-8px;}.mfe-app-workspace-11ezf91{display:inline-block;}.mfe-app-workspace-11ezf91:hover .Anchor__copyLink{visibility:visible;}2018-19 English Premier League: An Exploratory Data Analysis

Data Dictionary

Understanding Columns & Values

Useful Statistics

2018-19 English Premier League: An Exploratory Data Analysis