Window Functions in SQL
Window functions are used to perform calculations on a group of rows without combining them into a single result. This can be useful when you are calculating metrics such as an average over a period of time or a running total. Unlike a GROUP BY
, it does not group rows into a single row. In this workspace, you will learn:
- How to define a window function that includes a set of rows
- How to divide the set of rows included within a window function into partitions or frames
- How to calculate moving averages and running totals.
This workspace uses the Olympics data from the course PostgreSQL Summary Stats and Window Functions. You are free to create an integration to your own data or use one of the existing integrations. You can learn more about integrations here.
Calculating row numbers
The example below uses ROW_NUMBER()
to assign a row number to each row. The row numbers are assigned based on the results of the query. The OVER()
clause defines the range of rows upon which the calculations would be performed. By default, the OVER()
clause will use the entire result of the query.
Note: ROW_NUMBER()
assigns a unique number to each row. If you want rows with the same value to receive identical numbers, you can use RANK()
or DENSE_RANK()
. Refer to this video to learn more about the differences between these two functions.
👇 To run a SQL cell like the one below, click inside the cell to select it and click "Run" or the ► icon. You can also use Shift-Enter to run a selected cell.
SELECT
year,
athlete,
-- Assign a row number to athletes
ROW_NUMBER() OVER() AS athlete_number
FROM medals.summer_medals
LIMIT 5
Ordering the rows within a window
To define an order, you can use ORDER BY
within the OVER()
clause. This order is applied to the range of rows defined within the window function. In the example below, row numbers are assigned to each event based on the descending number of tickets sold.
If you want to know more about ordering in window functions, you can check out this video.
SELECT
athlete,
medal_count,
-- Assign row numbers to athletes based on the descending number of medals won
ROW_NUMBER() OVER(ORDER BY medal_count DESC) as medal_ranking
FROM
(
-- Fetch athlete and the number of medals they won
SELECT athlete, COUNT(*) AS medal_count
FROM medals.summer_medals
GROUP BY athlete
) AS medal_counts
Calculating window functions over partitions
OVER()
has a clause called PARTITION BY
, which divides the range of rows defined within the window function into partitions. If it is not used, the entire result of the query is treated as one partition. The window is divided into partitions based on the country
in the example below.
You can learn more about partitions in this video.
SELECT
athlete,
country,
medal_count,
--Assign row numbers to each country based on the descending number of medals won while partitioning by country
ROW_NUMBER() OVER(PARTITION BY country ORDER BY medal_count DESC) AS medal_ranking
FROM
(
-- Fetch athlete, country, and the number of medals athlete won
SELECT
athlete,
country,
COUNT(*) AS medal_count
FROM medals.summer_medals
WHERE country IN ('QAT', 'PHI')
GROUP BY athlete, country
) AS medal_counts
ORDER BY country, medal_count DESC
Calculating a moving average
Up until now, ROW_NUMBER()
was used to assign numbers to rows. Window functions can also be used to perform aggregations over rows. Let's begin by calculating a moving average. Moving averages take the average value over a previous number of rows and can help eliminate noise from data and identify trends.
To calculate the moving average, we again use OVER()
, but this time in combination with the AVG()
function. By default, this will calculate the average over all previous rows.
You can find more information about moving averages here.
SELECT
year,
medal_count,
-- Calculate the moving average
AVG(medal_count) OVER(ORDER BY year) as average_medal_count
FROM
(
-- Fetch the year and number of medals won
SELECT
year, COUNT(*) AS medal_count
FROM medals.summer_medals
-- Narrow the results to Canada
WHERE country = 'CAN'
GROUP BY year
) AS canadian_medals
Adjusting the range of window functions
Frames
are used to define the range of rows for a windows function. They are specified within OVER()
. The range can be defined by using ROWS BETWEEN [start] and [finish]
. The start and finish can be one of the three clauses - PRECEDING
, CURRENT ROW
, and FOLLOWING
.
For example:
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW
will define a frame that starts two rows before the current row and ends at the current row.ROWS BETWEEN 1 PRECEDING AND 3 FOLLOWING
will define a frame that starts one row before the current row and ends three rows after the current row.ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
will define a frame that starts from the first row and ends at the current row.
The following query calculates the moving average for the Canadian team's medal count for the past three summer Olympics. You can find more information about frames here.
SELECT
year,
medal_count,
-- Calculate moving average over 3 rows
AVG(medal_count) OVER(ORDER BY year ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) as last_three_average_medal_count
FROM
(
-- Fetch the year and number of medals won
SELECT
year,
COUNT(*) AS medal_count
FROM medals.summer_medals
-- Narrow the results to Canada
WHERE country = 'CAN'
GROUP BY year
) AS canadian_medals
Plotting the moving average
The plot below shows how a moving average removes noise by plotting both the original column and the moving average.
# Import libraries and set plotting style
import seaborn as sns
import matplotlib.pyplot as plt
sns.set_style('darkgrid')
plt.rcParams["figure.figsize"]=(12, 6)
# Create the line plots
sns.lineplot(x='year', y='medal_count', data=moving_average)
sns.lineplot(x='year', y='last_three_average_medal_count', data=moving_average)
# Add labels and show the plot
plt.title('Medals Won by Canada', fontsize=20)
plt.xlabel('Date', fontsize=15)
plt.ylabel('Number of Medals', fontsize=15)
plt.xticks(rotation ='45')
plt.legend(labels=["Average Medal Count","Medal Count"])
plt.show()
Calculating a running total
A running total calculates the total sum of a column over the preceding values and the current value. It is performed in much the same way as a moving average, except that SUM()
is used in place of AVG()
.
The following query calculates the total number of medals won by Canada over time.
SELECT
year,
medal_count,
-- Calculate a running total
SUM(medal_count) OVER(ORDER BY year) as total_medal_count
FROM
(
-- Fetch the year and number of medals won
SELECT
year,
COUNT(*) AS medal_count
FROM medals.summer_medals
-- Narrow the results to Canada
WHERE country = 'CAN'
GROUP BY year
) AS canadian_medals