2018-19 English Premier League: An Exploratory Data Analysis
- This dataset contains data of every game from the 2018-2019 season in the English Premier League.
- In this project, I aim to explore the data and communicate some interesting findings.
- The last section of this project shows the correlation between various columns of the data.
|Div||Division the game was played in|
|Date||The date the game was played|
|HomeTeam||The home team|
|AwayTeam||The away team|
|FTHG||Full time home goals|
|FTAG||Full time away goals|
|FTR||Full time result|
|HTHG||Half time home goals|
|HTAG||Half time away goals|
|HTR||Half time result|
|Referee||The referee of the game|
|HS||Number of shots taken by home team|
|AS||Number of shots taken by away team|
|HST||Number of shots taken by home team on target|
|AST||Number of shots taken by away team on target|
|HF||Number of fouls made by home team|
|AF||Number of fouls made by away team|
|HC||Number of corners taken by home team|
|AC||Number of corners taken by away team|
|HY||Number of yellow cards received by home team|
|AY||Number of yellow cards received by away team|
|HR||Number of red cards received by home team|
|AR||Number of red cards received by away team|
#Importing necessary libraries import pandas as pd import seaborn as sns import matplotlib.pyplot as plt # Loading the dataset into a dataframe df = pd.read_csv("soccer18-19.csv") #Printing the number of rows and columns print('Number of rows and columns:', df.shape) #Printing out the first five rows df.head()
Understanding Columns & Values
- The info() function ia useful tool to summarize the data.
- Here, I'm going to analyze each column's name, datatype and number of non-null rows they carry.
- This is important to see if there are any missing values and to get familiar with the overall dataset.
- Now, let's use the isna() function and aggregating it using sum() to get the total count of missing values.
- The data is complete as there are no null values.
- This means that I don't have to alter the dataframe in any way.
- Here, we'll be using the describe() function.
- This gives us helpful descriptive stats for our data
- Null values are excluded here. In our case, however, there aren't any.
- Using the unique() function to print distinct values of the 'Home Team' column.
- This will show us all the teams that participated in the season.
- Using the value_counts() function to print out the number of rows for each unique team.
- This shows how many matches each team played as Home Team.
- Note: Every team playes 19 matches as Home Team and other 19 as Away.
# Plotting a boxplot to figure out some outliers. df.plot( kind='box', figsize=(12,8) )
- Largest distribution of data is of 'Hometeam shots'.
- Every category has outliers. However, the largest amount of outliers for 'AC' (corners taken by the away team).
- Also, there are significant outliers for 'HY'. This means that were matches with an oddly large amouont of yellow cards.
Checking the Distribution
Let's use a histogram to further analyze one column in the df.
- Plotting the distribution of full-time home and away goals.
#Home Goals df.plot( kind = "hist", y = 'FTHG', bins = 5, # figsize = (12,8) ) #Away Goals df.plot( kind = "hist", y = 'FTAG', bins = 5, alpha = 0.3, # figsize = (12,8) )
- Distribution of home and away goals is similar.
- However, the frequency of away goals seems to be higher in the 0 - 1 range.