Customizing Aggregations With Window Frames

SQL
5 min read

Window Frames

Dolt is a database that speaks the MySQL wire protocol but uses a custom SQL Engine to implement relational logic. Our engine, go-mysql-server, is increasingly inching towards 100% drop-in MySQL compatibility, and today we are happy introduce the latest feature: window aggregations and frames.

Window frames are a more recent SQL syntax addition used heavily in analytical processing (OLAP) queries. MySQL and Postgres added full support in 2018, but some databases like SQLite and SQL Server still lack support for RANGE frames (described below).

Zach added initial windowing support almost one year ago, including window functions like first_value and row_number. The most recent release expands our ability to execute aggregation functions, like sum, avg, and count, inside windows. The new release also includes window frames, a way to customize the input set for aggregation functions. Window frames include ROWS and RANGE units, and UNBOUNDED, CURRENT ROW, and expression PRECEDING and FOLLOWING bounds. Refer to the MySQL docs for more info on window frame syntax.

In general, you might be interested in advanced windowing features if you run data analytics workflows. This blog will focus on practical examples of how to use these new features. We encourage you to install dolt yourself, take windows for a test ride, and let us know what you think!

Tutorial: Using Window Frames in Dolt

Group By

First, let’s create a parts table with id, lbs, and cnt fields:

mysql> create table parts (id varchar(8) primary key, lbs int, cnt int);
mysql> insert into parts values ('a', 10, 3), ('b', 10, 6), ('c', 10, 2),
                                ('d', 10, 0), ('e', 10, 12), ('f', 10, 7),
                                ('g', 20, 1), ('h', 20, 2), ('i', 20, 3);
Query OK, 8 rows affected

We will look at the simplest aggregation, GROUP BY, before building towards more complicated window functions. A GROUP BY partitions a dataset before rolling up a rows with an aggregation function, like SUM:

mysql> SELECT lbs, SUM(cnt) FROM parts GROUP BY lbs;
+-----+----------+
| lbs | SUM(cnt) |
+-----+----------+
| 10  | 30       |
| 20  | 6        |
+-----+----------+

The first thing we notice is that a GROUP BY aggregation outputs a a single row for each partition. Two parts.lbs partitions yields two output rows. The diagram below demonstrates this split and collapse:

Group By Partitions

Partitioning on the lbs column divides our table into two sets of rows, lbs = 10 and lbs = 20. A GROUP BY passes each of those partitions to the aggregation function, SUM, outputting two result rows. We have included purple "frames" that overlap with partition boundaries.

Window Over Partitions

Window syntax is similar to GROUP BY, but introduces two key features:

  1. Windows yield an output row for every input row.

  2. Windows use frames to vary their scope of computation, whereas GROUP BY is always scoped to an entire partition.

To make the differences concrete, we will first look at a simplified window that uses the same frame as GROUP BY:

mysql> SELECT SUM(cnt)
   OVER (
       PARTITION BY lbs
       ROWS BETWEEN
           UNBOUNDED PRECEDING AND
           UNBOUNDED FOLLOWING
    ) as `sum(cnt)` FROM parts;
+----------+
| sum(cnt) |
+----------+
|       30 |
|       30 |
|       30 |
|       30 |
|       30 |
|       30 |
|        6 |
|        6 |
|        6 |
+----------+

The output values are similar to GROUP BY, but the sums are duplicated for every input row in a partition. Again, we specified this using window frames, a way to customize an aggregation function's input set. Here we used a frame definition that mimics GROUP BY: the beginning of the partition (Unbounded Preceding) to the end of the partition (Unbounded Following). The diagram below uses the purple frames to highlight the duplicated frames.

Window Frame Unbounded

Window Rows Framing

Next, we will execute a query with a more interesting window frame: the sum of every two trailing rows:

mysql> SELECT SUM(cnt)
    OVER (
        PARTITION BY lbs
        ROWS BETWEEN
            2 PRECEDING AND
            1 PRECEDING
    ) as `sum(cnt)` FROM parts;
+----------+
| sum(cnt) |
+----------+
|     NULL |
|        3 |
|        9 |
|        8 |
|        2 |
|       12 |
|     NULL |
|        1 |
|        3 |
+----------+

The first row in each partition lacks trailing rows, so the first frame is empty, and the first sum is NULL. There are two partitions in our table, so the beginning of the second partition is NULL for the same reason. The second frames are also special, overrunning the partition boundary and yielding 1 row frames. The second row's output is the first row's cnt. The remaining rows in each partition yield the sum of the two previous rows.

Window Frame Rows

The diagram above details a frame for every input row, as in the previous diagrams. The frames now use our more complicated frame boundary: two rows back (2 Preceding) to one row back (1 Preceding).

Window Range Framing

So far we have focused on ROWS framed windows. A row frame uses index counts to select and move frame bounds. In this final example, we will consider RANGE frames.

RANGE frames use values to select window bounds. For example, a RANGE frame could match all rows within 5% of of the current row, or all rows within 5 units of the current row. A window's ORDER BY expression serves as a reference point for ranking frame peer values.

To make this explicit, we will make a second table, grades, with a date column to RANGE over:

mysql> create table grades (student varchar(8), subject varchar(8), grade float, date date, primary key (student, subject, date));
mysql> insert into grades values ('max', 'English', 2, '2021-01-21'), ('max', 'English', 6, '2021-02-16'),
                                 ('max', 'English', 5, '2021-03-17'), ('max', 'English', 4, '2021-04-20'),
                                 ('max', 'English', 8, '2021-05-17'), ('max', 'English', 8, '2021-06-14'),
                                 ('max', 'English', 9, '2021-06-23'), ('max', 'English', 10, '2021-06-30');

Now we take a 3 month moving average of grades:

> SELECT avg(grade)
    OVER (
        PARTITION BY student, subject
        ORDER BY date
        RANGE BETWEEN
            INTERVAL '1' MONTH PRECEDING and
            INTERVAL '1' MONTH FOLLOWING
    ) as `avg(grade)` FROM  grades;
+-------------------+
| avg(grade)        |
+-------------------+
|                 4 |
|                 4 |
|                 5 |
|                 6 |
| 6.666666666666667 |
|              8.75 |
|                 9 |
|                 9 |
+-------------------4

Window Frame Range

The diagram above is similar to the others, considering a frame represents a row mapping between an input row and its output aggregation set. The difference is that the width of RANGE frames vary. A ROWS frame will always be a fixed length (unless overrunning the start or end of a partition).

The date column in the ORDER BY expression is chosen as the reference point for computing frames. The INTERVAL '1' MONTH bounds mean that a RANGE frame here matches one or more rows within one month of the current row's date. For example, the exam on 2021-03-17 only matches itself, because there are no other exams between 2021-02-17 and 2021-04-17. The exam on 2021-06-14 matches 4 rows, on the other hand, every exam in June and one exam in May following the interval start, 2021-05-14.

Summary

The latest Dolt release supports window aggregations and customized window framing, a step towards greater MySQL compatibility and stronger OLAP support.

Applying window frames in practice is a bit trickier than in theory, but if you are new to SQL windows hopefully we have provided some starters for how to apply them in your own work.

There are also several improvements we look forward to delivering. We would like to add more functions, like quantiles. There are parser features, like named windows, that we would like to add. And there are performance optimizations that can improve query latency, like sliding window aggregations to reduce memory footprint, partition level sorting to minimize unnecessary compute, and inter/intra partition parallelism to take advantage of multithreading.

Give window frames a try! Let us know if you find bugs or are interested in new features! And if you would like to contribute to our SQL Engine yourself, many of the window extensions mentioned above are good starter tasks.

Feel free to reach out on Twitter, Discord, and GitHub if you have any questions.

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt

Or join our mailing list to get product updates.