FIFA World Cup Analysis
  • AI Chat
  • Code
  • Report
  • Beta
    Spinner

    Overview

    This dataset contains information about the matches played at the World Cup from 1930 to 2014. Among others, the following fields can be noted:

    • Year: The year in which the match was played
    • Datetime: The exact date the match was played in the format DD MM YYYYY - HH:MM.
    • Stage: The stage in which the match was played: {Group stage, 1/8 final, 1/4 final, 1/2 final, final}.
    • Stadium`: The name of the stadium where the match was played
    • City: The name of the city where the match took place
    • Home Team Name: Name of the home team
    • Home Team Goals: Number of goals the home team has scored
    • Away Team Name: Name of the away team
    • Away Team Goals: Number of goals scored by the home team
    • Win conditions: Specifies if the match was won under special conditions or not
    • Attendance: Number of people watching the game

    Dataset loads

    import pandas as pd
    matches = pd.read_csv("WorldCupMatches.csv", encoding="utf_8") 
    matches.info()

    We can easily notice that in reality 852 records are non-zero so let's delete the zero records

    [76]
    matches = matches.dropna()
    matches.shape

    Thus, we have 850 non-zero records, which is satisfactory

    Problematic

    • Do teams score more or less during games taking place at higher altitudes?
    • Are teams more evenly matched in later stages?
    • Do Home and Away designations affect team performance?

    Do teams score more or less during games taking place at higher altitudes?

    Before answering this question, it should be noted that altitude is the height of an object (center of gravity) in relation to a specific reference point. To evaluate this value we usually take the mean sea level (MSL) as reference. The value obtained is called elevation which is often identified as the altitude. It should be noted that when the elevation is high, the pressure and the quantity of air is lower, which can lead to complications for some people and therefore leads us to ask the question of the number of high attitude goals. To answer this question we will proceed in two steps:

    • Find the coordinates (latitude, longitude) of each stadium. To do this we will scrape data on google or wikipedia to maximize the chances of research
    • Thanks to the Google maps api we are going to find the elevation of each stadium

    Collection of coordinates

    Data Collection Process

    For the scraping of data on wikipedia, we will:

    • Perform a google search of the wikipedia pages of the stadiums
    • Retrieve on each page the coordinates corresponding to the stadiums. We will need the googlesearch and beautifulsoup libraries

    On the other hand we can also notice that for a given search on google we can also find the coordinates of a given stadium. In this analysis we will privilege the data coming from Google. In case they do not exist on google we will search them on wikipedia.

    Finally, since different cities can have stadiums with the same name, we will search according to the city and the stadium. Moreover we will add a keyword Stadium when searching to better focus our search terms related to stadium.

    !pip install -r requirements.txt
    Hidden output
    Search for Wikipedia links

    As agreed we will search the wikipedia links of all the stadiums we would use in case the google search does not return any coordinates

    from googlesearch import search
    import re
    links: dict = {}
    stadiums: set = set([(stadium, city) for stadium, city in zip(matches.Stadium, matches.City)])
    
    for stadium, city in stadiums:
        for link in search(f'stadium of {stadium} in {city}', tld="co.in", num=1, stop=None, pause=2):
            if re.match(".*wikipedia.*", link): # Until we find a wikipedia link
                links[(stadium, city)] = link
                break
    Data scraping

    The functions find_coords_wikipedia and find_coords_google will be used to search for coordinates on wikipedia and google respectively. The first one will use the wikipedia links we found before, while the second one will form a google query of the form stadium {stadium} in {city} latitude and longitude.

    import requests
    import bs4 as bs4
    
    def coord_parser(coords: str) -> tuple:
        """
        Allows to retrieve the latitude and longitude in a string of the form
        "{Latitude}(N|S), {Longitude}(E|W)"
        """
        latitude, longitude = [tuple(x.replace("°", "").split()) for x in coords.split(",")]
        latitude = float(f"{'-' if latitude[1] == 'S' else ''}{latitude[0]}")
        longitude = float(f"{'-' if longitude[1] == 'W' else ''}{longitude[0]}")
        return latitude, longitude
    
    def find_coords_wikipedia(link: str) -> tuple:
        wikipedia_page: requests.models.Response = requests.get(link)
        soup = bs4.BeautifulSoup(wikipedia_page.text, "html.parser")
        geo_span = soup.find("span", {"class": "geo"})
        latitude, longitude = [float(x) for x in geo_span.text.split(";")] if geo_span else [None for x in range(2)]
        return latitude, longitude
    
    def find_coords_google(query: str)  -> tuple:
        google_page: requests.models.Response = requests.get(f"https://www.google.com/search?q={query}")
        soup = bs4.BeautifulSoup(google_page.text, "html.parser")
        geo_span = soup.find(string=re.compile(".+° (N|S), .*° (W|E)"))
        latitude, longitude = coord_parser(geo_span)if geo_span else [None for x in range(2)]
        return latitude, longitude