Filter-Branch in Dolt

FEATURE RELEASE
6 min read

Dolt is a tool built for collaboration and data distribution, a SQL database you can branch, merge, diff, clone, fork, push and pull. Today, we're announcing support for filter-branch in Dolt.

"Customer focus" is a mantra for our company. In August, we launched a Discord server to be more available to our users. It allows us to be more responsive to the Dolt community and get feedback on what their needs are. Recently, a user requested a method to delete sensitive data from a repository. Deleting rows or dropping tables isn't sufficient, because the sensitive data will still be reachable from previous commits. What we need is a way to rewrite history and excise the data from all the commits. The canonical way to do this in Git is filter-branch. It provides a way to replay the commit history, applying alterations or "filters" to each revision.

"Git for Data"

Open data projects are few and far between; we're hoping that Dolt can provide the infrastructure to change that. Git helped spark an era of Open Source Software collaboration, and we modeled Dolt after it for that reason. It's a concrete interface for us to replicate, and it allows our users to transfer their knowledge from Git to Dolt. Even for those familiar with Git, filter-branch is a more obscure command. Git's own implementation of filter-branch comes with a huge disclaimer.

Create commit message

They even recommend a third-party tool filter-repo over their own implementation. Much of the difficulty in git filter-branch comes from the complexity of manipulating an entire repo at once. Where Git struggles, Dolt shines. The power of a relational database is in its strongly-structured schema; it's tailor-made for these manipulations. Queries can tractably be applied to each commit, and Dolt's branching model can be used to safely backup your data during the filter-branch operation.

Rewriting History

Let's look at an example of removing a table of sensitive data from the repository. In our repository we have one branch and two tables:

dolt> SELECT * FROM dolt_branches;
+--------+----------------------------------+------------------+------------------------+-----------------------------------+-----------------------+
| name   | hash                             | latest_committer | latest_committer_email | latest_commit_date                | latest_commit_message |
+--------+----------------------------------+------------------+------------------------+-----------------------------------+-----------------------+
| master | amk8rdipcrst8lb8qfkt9o3se3jdf9lm | Andy Arthur      | andy@dolthub.com       | 2020-11-25 17:35:30.392 -0800 PST | f                     |
+--------+----------------------------------+------------------+------------------------+-----------------------------------+-----------------------+
dolt> show tables;
+------------------+
| Table            |
+------------------+
| credit_card_info |
| users            |
+------------------+

We've decided we want to remove this financial information from our repository. We'll need to edit the history, so let's inspect what it looks like with the dolt_commits system table:

dolt> SELECT cm.message, cm.commit_hash, anc.parent_hash
    FROM dolt_commits AS cm
    JOIN dolt_commit_ancestors AS anc
    ON cm.commit_hash = anc.commit_hash
    ORDER BY cm.date;
+----------------------------+----------------------------------+----------------------------------+
| message                    | commit_hash                      | parent_hash                      |
+----------------------------+----------------------------------+----------------------------------+
| Initialize data repository | jqgd1gd04tmi9jei4nb10esorktlpo99 | NULL                             |
| a                          | 4t4qukc2i5aa7b0vv6hnl2p1c8u5b9fp | jqgd1gd04tmi9jei4nb10esorktlpo99 |
| b                          | omlbloin2jhu8kc5a0ua4ej72n8oimjl | 4t4qukc2i5aa7b0vv6hnl2p1c8u5b9fp |
| c                          | mmcqrpl27hg4ht0rj4nv5f838ni0jvet | 4t4qukc2i5aa7b0vv6hnl2p1c8u5b9fp |
| d                          | 9k3b4i3i5jfhqvneuvccfde6hu7eku36 | mmcqrpl27hg4ht0rj4nv5f838ni0jvet |
| e                          | cn264o484uvd8ssj34c6ha8cnhgumao4 | omlbloin2jhu8kc5a0ua4ej72n8oimjl |
| e                          | cn264o484uvd8ssj34c6ha8cnhgumao4 | 9k3b4i3i5jfhqvneuvccfde6hu7eku36 |
| f                          | amk8rdipcrst8lb8qfkt9o3se3jdf9lm | cn264o484uvd8ssj34c6ha8cnhgumao4 |
+----------------------------+----------------------------------+----------------------------------+

Let's first create a backup branch, and then we can remove the credit_card_info table.

dolt branch backup HEAD
dolt filter-branch "DROP TABLE credit_card_info;"

filter-branch starts at a tip of the current branch, follows the commits back to the beginning of the history, and then replays each commit, modifying it with the provided query. The replay process forks the history of the current branch, so our backup branch still holds a reference to the unmodified version of the data until we delete it:

Before:
               c --- d         -- backup
              /       \      /
        0 -- a -- b -- e -- f -- master

After:
               c' --- d'
              /         \
        0 -- a' -- b' -- e' -- f' -- master'
         \
           -- a -- b -- e -- f -- backup
               \       /
                c --- d

Once we're satisfied with the results, we can complete the removal by deleting our backup branch and running dolt gc. This will garbage collect all unreferenced data at the storage layer and delete it permanently.

SQL and Filter-Branch

While deleting data is a central use-case for filter-branch, it's capable of much more. The power and simplicity of SQL means filter-branch is much more powerful in Dolt than it can be in Git. As an example we're going to modify the NOAA repository, a database of weather data sourced from the National Oceanic and Atmospheric Administration. This repo has a unique structure: it uses the commit history to model time series data. Each commit contains a day's worth of hourly measurements; its parent commit is the previous day and its child commit is the subsequent day. As a result, each commit's tables are relatively small and fast to query. The entirety of the data is still accessible at once through the dolt_history tables.

Using filter-branch we can modify and add to this data while maintaining its commit-per-day structure. If we look closer at the measurement tables, we can see that each row contains summary statistics about the measurements and metadata about the number of measurements taken at that station on that day.

noaa> select station, data_points from air_temp limit 10;
select station, `avg`, `min`, `max`, median, data_points from air_temp limit 10 offset 100;
+-------------+--------------------+------+------+--------------------+-------------+
| station     | avg                | min  | max  | median             | data_points |
+-------------+--------------------+------+------+--------------------+-------------+
| 95288099999 | 27.549999999999997 | 24.9 | 28.4 | 27.799999999999997 | 8           |
| 95292099999 | 23.857142857142858 | 22.7 | 26.9 | 23.5               | 7           |
| 95293099999 | 22.014285714285712 | 19.1 | 26.4 | 20.4               | 7           |
| 95295099999 | 24.814285714285713 | 19.7 | 29.2 | 25.2               | 7           |
| 95296099999 | 25.82857142857143  | 21.3 | 29   | 26.3               | 7           |
| 95303099999 | 30.428571428571427 | 28.8 | 33.1 | 30.5               | 7           |
| 95304099999 | 30.373333333333335 | 27.9 | 34.2 | 29.7               | 15          |
| 95305099999 | 32.333333333333336 | 28   | 38.9 | 30.4               | 15          |
| 95307099999 | 33.199999999999996 | 29.3 | 38   | 32.4               | 15          |
| 95317099999 | 36.17333333333333  | 28   | 42   | 35.7               | 15          |
+-------------+--------------------+------+------+--------------------+-------------+

Let's create some summary statistics about the number of measurements taken at each station to get a better idea of how reliably each station is reporting data. First we'll create a table to hold the data, then for each data table we're interested in, we'll insert a row. filter-branch currently runs one query at a time, so we'll write a little bash loop for this.

dolt filter-branch "CREATE TABLE stability (measurement varchar(20) primary key, mean float, total bigint);"
for tbl in  air_temp dew_point_temp sea_level_pressure sky_ceiling_height visibility wind_speed
do
	dolt filter-branch "
        INSERT INTO stability (measurement, mean, total)
        SELECT '$tbl', avg(data_points), sum(data_points)
        FROM $tbl;"
done

When filter-branch is done rewriting the history we can inspect the results:

noaa> select * from dolt_history_stability where measurement = 'air_temp' limit 10;
+-------------+-----------+--------+----------------------------------+-----------+-------------------------------+
| measurement | mean      | total  | commit_hash                      | committer | commit_date                   |
+-------------+-----------+--------+----------------------------------+-----------+-------------------------------+
| air_temp    | 16.118803 | 195376 | vj54m78qqj4th9lob37frkudv6sfvqs0 | bheni     | 2020-03-18 00:00:00 +0000 UTC |
| air_temp    | 16.83494  | 206127 | g0c1p251k8rs70filueqh9iggu473m1s | bheni     | 2020-03-17 00:00:00 +0000 UTC |
| air_temp    | 17.125208 | 206530 | o5ng2cdsn01tr4vp1lro4r95sktgoibe | bheni     | 2020-03-15 00:00:00 +0000 UTC |
| air_temp    | 17.01406  | 208150 | 40ert8l7k5abl08dvalb8089suacfecs | bheni     | 2020-03-16 00:00:00 +0000 UTC |
| air_temp    | 14.678022 | 175784 | nt5o0ess55kt3k31sqgid8vntr5usngt | bheni     | 2020-03-14 00:00:00 +0000 UTC |
| air_temp    | 15.271547 | 185870 | rmdt1t8p0v692o56n3l9qfsv4am0j1e9 | bheni     | 2020-03-13 00:00:00 +0000 UTC |
| air_temp    | 16.954575 | 206778 | hmrbqgi0pqee2vs3i5fcomami2b7ob0v | bheni     | 2020-03-12 00:00:00 +0000 UTC |
| air_temp    | 15.727369 | 192503 | k0hu7nrann50c1uu6kc473h6qm0p01s9 | bheni     | 2020-03-11 00:00:00 +0000 UTC |
| air_temp    | 16.983017 | 207991 | 858s2fh102upoi92r3plpu9npfv20bkm | bheni     | 2020-03-10 00:00:00 +0000 UTC |
| air_temp    | 16.225971 | 196172 | c5smoofb4398c8j5ks2gdatr5k1b0v9g | bheni     | 2020-03-09 00:00:00 +0000 UTC |
+-------------+-----------+--------+----------------------------------+-----------+-------------------------------+

Conclusion

filter-branch is just the latest in a line of cool features that we're adding to Dolt. We envisioned Dolt as the cross-product of Git and MySQL. The result might end up being something more powerful than either one. Our goal is to put the data you're searching for at your fingertips and make it effortless to work with. If you've got an idea for what you'd like to see in Dolt, let us know!

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt

Or join our mailing list to get product updates.