Code-along 2024-02-27 Analyzing Google Sheets Data in Python
In this code-along, we will be analyzing Google Analytics data. The sample dataset contains obfuscated GA360 data for August 1, 2017 from the Google Merchandise Store, a real ecommerce store selling Google branded merchandise (Source).
This data is typical for what you'd see for an ecommerce website. It contains session data with traffic source, location and transcation info. Other data has been anonymized. The data is available for in a publicly available Google Sheet. We will be using a combination of SQL and Python to analyze this dataset, getting more insights into what is driving sales on this particular day.
Task 0: Setup
- Make sure you're signed into Google.
- Click this link to copy a Google Sheet with the data to your own Google account.
Task 1: Import the session data
In the SQL cell below, click "Connect a database", click "Google Sheets", follow the instructions, and select your copy of the Google Sheets file. Use a SQL statement to pull in all the data from the Google Sheet.
SELECT * FROM main.ga_sessions
Task 2: Import packages
We'll use pandas
and plotly
, so let's pull these in.
# Import packages
import pandas as pd
import plotly.express as px
Task 3: Clean the data
Let's inspect the types of the columns, and make adjustments where needed.
df.dtypes
df_clean = df.copy()
df_clean
Task 4: Explor the data
Use the predefined plotting function to explore different grouped session counts as either a bar chart or a pie chart
def plot_sessions_per_group(df, group, viz_type='bar'):
sessions_per_group = df.groupby(group).size().reset_index(name='sessions').sort_values(by='sessions')
if viz_type == 'bar':
return px.bar(sessions_per_group,
x=group,
y='sessions',
title=f'Number of sessions per {group}',
text='sessions')
elif viz_type == 'pie':
return px.pie(sessions_per_group,
names=group,
values='sessions',
title=f'Distribution of sessions per {group}')
else:
raise ValueError("viz_type can only be 'bar' or 'pie'")
Task 5: Show hourly visits per channelGrouping
Adjust the code below to show the hourly visits throughout the day per channelGrouping
.
import pandas as pd
import plotly.express as px
# Assuming df_clean is a DataFrame that has been defined earlier in the code
df_clean['visitStartTime'] = pd.to_datetime(df_clean['visitStartTime'], unit='s').dt.tz_localize('UTC').dt.tz_convert('America/Los_Angeles')
df_clean['hour'] = df_clean['visitStartTime'].dt.to_period('H')
visits_per_month = df_clean.groupby(['hour']).size().reset_index(name='visits')
visits_per_month['hour'] = visits_per_month['hour'].astype(str)
px.line(visits_per_month, x='hour', y='visits', title='Visits per hour throughout the day', markers=True)