Duplicate of [Answer Sheet] Analyzing a Time Series of the Thames River in Python
Analyzing a Time Series of the Thames River in Python
Time series data is everywhere, from watching your stock portfolio to monitoring climate change, and even live-tracking as local cases of a virus become a global pandemic. In this live code-along, you’ll work with a time series that tracks the tide levels of the Thames River. You’ll first load the data and inspect it data visually, and then perform calculations on the dataset to generate some summary statistics. Next, you’ll decompose the time series into its component attributes. You’ll end with a taster of autocorrelation: a first step in time series forecasting.
Here's a map of the locations of the tidal gauges along the River Thames in London.
# Package imports import pandas as pd # for data manipulation import seaborn as sns # for data visualization import matplotlib.pyplot as plt # for data visualization
Task 1: Read one file to explore the data format and prepare the data for analysis.
The dataset consists of 13 .txt files, containing comma separated data. We'll begin by analyzing one of them and preparing it for analysis. We can then create a helper function in case you are interested in analyzing other data later.
The dataset comes with a file called
|Date and time||Date and time of measurement to GMT. Note the tide gauge is accurate to one minute.||dd/mm/yyyy hh:mm:ss|
|Water level||High or low water level measured by tide gauge. Tide gauges are accurate to 1 centimetre.||metres (Admiralty Chart Datum (CD), Ordnance Datum Newlyn (ODN or Trinity High Water (THW))|
|Flag||High water flag = 1, low water flag = 0||Categorical (0 or 1)|
Let's begin by loading the London Bridge data. When loading time series data, always look out for the time zone the data is provided in. Sometimes, your data might be provided in UTC, and will need to be converted to local time if you want to do local analysis. Fortunately, the description above tells us the data is in GMT, which is the same as coordinated universal time (UTC).
- Use pandas to read the London Bridge dataset from the CSV file named
Data/10-11_London_Bridge.txtand assign it to the variable
lb = pd.read_csv('Data/10-11_London_Bridge.txt') # Comma-separated .txt file lb
Since one of the column headings in the csv file had a comma (
"flag, HW=1 or LW=0"),
pd.read_csv has created an extra, empty column. We'll need to drop this extra column and rename our column headings. Shorter and more memorable column names will facilitate our analysis later on.
lb.describe()to confirm that the last column is empty and contains no data.
- Create a new DataFrame
dfwhich takes only the first three columns and rename them as
# Take only the first three columns df = lb[lb.columns[0:3]] # Rename columns df.columns = ['datetime', 'water_level', 'is_high_tide']
lb.info() above showed us both the
water_level columns are of type
object. We'll convert these to
water_level, respectively. We'll also add two columns,
year, which we'll need to access later on.
pd.to_datetime()to convert the
datetimecolumn to the
datetimeformat. Since the dataset is large, this step can take a few seconds.
.astype(float)to convert the
water_levelcolumn to the
# Convert to datetime df['datetime'] = pd.to_datetime(df['datetime']) # Convert to float df['water_level'] = df.water_level.astype(float) # Create extra month and year columns for easy access df['month'] = df['datetime'].dt.month df['year'] = df['datetime'].dt.year df
Before moving on to conduct analysis, let's define a helper function for data cleaning so we don't have to do this each time. The function takes a
DataFrame (which we'll read from our
.txt files), renames the columns, formats the
datetime column, and converts
water_level to a
float data type.
def clean_data(data): # Take only the first three columns data = data[data.columns[0:3]] # Rename columns data.columns = ['datetime', 'water_level', 'is_high_tide'] # Convert `datetime` to `datetime` format data['datetime'] = pd.to_datetime(data['datetime']) # Conver `water_level` to float format data['water_level'] = data['water_level'].astype(float) # Create extra month and year columns for easy access data['month'] = data['datetime'].dt.month data['year'] = data['datetime'].dt.year return data
Task 2. Analyze the London Bridge data to get a sense of the water level.
Let's begin analyzing the data with a
water_level. This plot shows that the data is bimodal, meaning it has two separate peaks. When we plot the data with
is_high_tide=1, separately we get two approximately normal distributions, with separate means and variances. Moving forward, we'll analyze low tide and high tide data separately.
- Create a
- Create a
plt.hist(df.query('is_high_tide==0')['water_level'],bins=100) plt.hist(df.query('is_high_tide==1')['water_level'],bins=100) plt.show()
Boxplots will give us a sense of the min, max, range, and outliers of the data. We'll create these separately for high tide and low tide. By default, the whiskers of the boxplot will show us 1.5 * the interquartile range.
- Create a boxplot of
- Create a boxplot of
plt.figure(figsize=(8,2)) sns.boxplot(data=df.query('is_high_tide==0'),x='water_level',color='SkyBlue') plt.show() plt.figure(figsize=(8,2)) sns.boxplot(data=df.query('is_high_tide==1'),x='water_level',color='Tomato') plt.show()