Finalized_CODE_Project_Hamburg

The plan of our project was to connect air pollution observations from measurements with the LCZ (local climate zone) maps of the surrounding areas. The study area was Hamburg.

The question to answer was if the air quality decreases with increasing temperature or weather the land use type and surrounding area has a higher impact on air pollution than temperature.

The data we used were air pollution data from the European Environmental Agency, weather data from daswetter.com and a LCZ map from the standard files provided by WUDAPT.

For the timeframe, we restricted our data to the last 20 years (2003 to 2023) and the most current LCZ map of Hamburg.

Our code is organized in three main phases. In the first part we process the observation data and organize the station metadata. In the second phase, the code gets the LCZ mapping for all the station points. On the third part we merge the three dataframes to conduct our analyses.

We expected to find a correlation between the pollutants and temperature, which could give a hint on future developments of air quality with climate change. Nevertheless, if this was not found we expected to get an idea on how the surrounding areas impact the air quality.

Our results have shown that a medium to good correlation was found between O3 and the temperature. However, O3 observations were only provided at five stations. Yet, we might expect an increase of O3 in the future years due to climate change.

NO2 did not show a correlation with temperature at most stations, but it was measured at almost all stations. Regarding the LCZ classification NO2 was higher in areas with industry or trees. In open areas it was lower.

People working on this script: Marcelo Soeira and Lea Fink

read the both necessary files in the workspace, give separator (sep = 'whatever')

check what information are in the csv-Files and how the data can be used

Findings

stations: Huge file with enourmous amount of data. To acomodate for the timeframe of our project, we will be restricting the number analyzed air pollutants and also the time span of the observations. On the examples to follow, we have selected only O3 and NO2 for photooxidants and PM2.5 for particles. The time span considered is 2002 onwards, as this matches the available temperature data.

weather: just extracted mean temperature °C measured in the respective day. Necessary to download max/min temp?

For our analyses, it will be useful to have the matadata of the air pollution observation points in a separed, smaller dataframe.

Therefore, we created a third dataframe named stationMetaDataDf to store this information, based on values from the stations dataframe

Now that the station metadata is ready, we may go back to the preprocessing of the observation data itself: reformating, removal of undesired data to reduce processing requirements, removal of invalid data, etc...

First we began by removing unnecessary information from stations dataframe and storing what was actually required in a new dataframe (stationSubset)

We also included a new column (Mean_Temp) to receive data from the weather dataframe

Next it was necessary to adjust and standardize the date and time fields in preparation for later processing

Calculate the daily mean based on groups "date" and "code"

RECOVERY POINT: Backing up the dataframes

RECOVERY POINT: Loading backup dataframes

ON with the code!!!

In the final step of the data processing phase we transfered air temp values from weather dataframe to stations dataframe. This was one of the more time consuming parts of our code.

The observation data processing is over. On the next steps we are going to obtain data on the surrounding areas of each observation site and append this to the metadata.

LCZ processing: Step 1

LCZ Processing: Step 2

Correlations...

Checking for global correlation between the different observation datasets

date should be in readable date format and be the index

Correlation for the whole data set grouped by year

Correlation for the whole data set grouped by year and monitoring station

maybe we can plot temperature on the x-axis and o3 on the y-axis to get a better idea plus add a regression line?

then we can see, how the pollution will develop with higher temperature

The plan of our project was to connect air pollution observations from measurements with the LCZ (local climate zone) maps of the surrounding areas. The study area was Hamburg.

The question to answer was if the air quality decreases with increasing temperature or weather the land use type and surrounding area has a higher impact on air pollution than temperature.

The data we used were air pollution data from the European Environmental Agency, weather data from daswetter.com and a LCZ map from the standard files provided by WUDAPT.

For the timeframe, we restricted our data to the last 20 years (2003 to 2023) and the most current LCZ map of Hamburg.

Our code is organized in three main phases. In the first part we process the observation data and organize the station metadata. In the second phase, the code gets the LCZ mapping for all the station points. On the third part we merge the three dataframes to conduct our analyses.

We expected to find a correlation between the pollutants and temperature, which could give a hint on future developments of air quality with climate change. Nevertheless, if this was not found we expected to get an idea on how the surrounding areas impact the air quality.

Our results have shown that a medium to good correlation was found between O3 and the temperature. However, O3 observations were only provided at five stations. Yet, we might expect an increase of O3 in the future years due to climate change.

NO2 did not show a correlation with temperature at most stations, but it was measured at almost all stations. Regarding the LCZ classification NO2 was higher in areas with industry or trees. In open areas it was lower.

People working on this script: Marcelo Soeira and Lea Fink

import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.dates as mdates
import datetime as dt
import seaborn as sns 
import numpy as np
import rasterio 
import rasterio.plot
from rasterio.mask import mask
import matplotlib as mpl
import geopandas as gpd
from fiona.crs import from_epsg
from shapely.geometry import box
from geopy.distance import lonlat, distance, geodesic
import random #module to generate random numbers for testing the program

read the both necessary files in the workspace, give separator (sep = 'whatever')

### load and store air temperature and air pollution contained in csv csv files as panda dataframe 
### stations -> air pollution observations
### weather  -> air temperature observations

stations = pd.read_csv('hamburgstation.csv', sep=",", na_values=0) 
weather = pd.read_csv('weatherdata_hamburg.csv', sep= ";", na_values=0)

check what information are in the csv-Files and how the data can be used

stations.info()
weather.info()

Hidden output

Findings

stations: Huge file with enourmous amount of data. To acomodate for the timeframe of our project, we will be restricting the number analyzed air pollutants and also the time span of the observations. On the examples to follow, we have selected only O3 and NO2 for photooxidants and PM2.5 for particles. The time span considered is 2002 onwards, as this matches the available temperature data.

weather: just extracted mean temperature °C measured in the respective day. Necessary to download max/min temp?

For our analyses, it will be useful to have the matadata of the air pollution observation points in a separed, smaller dataframe.

Therefore, we created a third dataframe named stationMetaDataDf to store this information, based on values from the stations dataframe

### Brief preprocessing of station metadata
### Extract desired fields from stations dataframe and store them into stationMetaDataDf, eliminating any duplicate lines according to the code of the station

stationsMetadata=stations[['code', 'site', 'site_type', 'longitude', 'latitude']].drop_duplicates(subset=['code'])

#Uncomment the followint line to write it to csv to have it saved in my files
#stationsMetadata.to_csv("stationsMetadata.csv", sep=';')

### Function created to process weather/airpollution observation stations into a standard dataframe useful for further processing

def importStationData(sourceDf,agencyIdHeader, siteNameHeader, siteDescripHeader, lonHeader,latHeader):
    destinyDf=pd.DataFrame()
    
    sourceListHeader=["Source_ID",agencyIdHeader, siteNameHeader, siteDescripHeader, lonHeader,latHeader]
    sourceIndexDf=pd.DataFrame(sourceDf.index,columns=["Source_ID"])
    sourceStationCount=len(sourceDf)

    destinyListHeader=["Source_ID","Agency_ID","Site_Name","Site_Description", "Lon", "Lat"]
    destinyListTypes=[int(-99999),str(-99999),str(-99999),str(-99999),float(-99999),float(-99999)]

    dataMatrix={}
    for i in range(len(destinyListHeader)):
        dataColumn=[destinyListTypes[i]]*sourceStationCount
        dataMatrix[destinyListHeader[i]]=dataColumn
    destinyDf=pd.DataFrame(data=dataMatrix)

    for i in range(len(destinyDf)):
        for g in range(len(destinyListHeader)):
            if g == 0:
                destinyDf.at[i,destinyListHeader[g]]=sourceIndexDf.iloc[i][sourceListHeader[g]]
            else:
                destinyDf.at[i,destinyListHeader[g]]=sourceDf.iloc[i][sourceListHeader[g]]
            
    return destinyDf

### Main processing of station metadata
stationsMetadata=importStationData(stationsMetadata,"code", "site", "site_type", "longitude","latitude")
#print(stationsMetadata)

Now that the station metadata is ready, we may go back to the preprocessing of the observation data itself: reformating, removal of undesired data to reduce processing requirements, removal of invalid data, etc...

First we began by removing unnecessary information from stations dataframe and storing what was actually required in a new dataframe (stationSubset)

We also included a new column (Mean_Temp) to receive data from the weather dataframe

### Here we extracted the required information from the station dataframe and stored it into a noew one called stationSubset. 
###We desired to have a new column in this dataframe to store the temperature falues, so we assigned it a list of -99999.0 to its end. This value was chosen to easily identify wrong values later on.

#Data extraction into the new dataframe
stationsSubset = stations[['date','code', 'no2', 'o3', 'pm2.5']]
#Create a list with matching length to stationsSubset and constant value of -99999.0
meanTemp=[float(-99999.0)]*len(stationsSubset)
#Add this list as column
stationsSubset=stationsSubset.assign(Mean_Temp=meanTemp)

‌
‌
‌