Tracking Data Changes with Dolt Blame

8 min read

Ever look at some data and wonder where a particular value came from, how long it's been there, or what the reason for changing it was? This is important information, but current data storage formats don't track or expose it—certainly not in a convenient, programmable manner.

In Dolt, as in Git, changes to the data are tracked as commits. Each commit has a timestamp, an author, and a description of the changes. Because every value in the data originated in some commit or another, all of the big questions above are easily answerable.

Enter dolt blame. It is named after, and works similarly to, git blame, but works on table rows instead of text lines (a cellwise version is on the roadmap). Give it a revision and a table name and it will output a pretty-printed table showing the timestamp, author, description, and hash of the commit that most recently modified the row identified by each primary key.

Given a table lunch-places with a primary key name and some commits:

$ dolt blame HEAD lunch-places
+--------------------+----------------------------------------------------+-----------------+------------------------------+----------------------------------+
| NAME               | COMMIT MSG                                         | AUTHOR          | TIME                         | COMMIT                           |
+--------------------+----------------------------------------------------+-----------------+------------------------------+----------------------------------+
| Wendy's            | added visitors table and removed it from lunch-pl… | Tim Sehn        | Fri Mar 22 12:09:16 PDT 2019 | gjio4j5hk58ce1mqpavg65cot43j3bmu |
| McDonald's         | added visitors table and removed it from lunch-pl… | Tim Sehn        | Fri Mar 22 12:09:16 PDT 2019 | gjio4j5hk58ce1mqpavg65cot43j3bmu |
| Sake House         | fixed ratings                                      | bheni           | Thu Apr  4 14:07:36 PDT 2019 | rqpd7ga1nic3jmc54h44qa05i8124vsp |
| Seasalt Fish Grill | fixed ratings                                      | bheni           | Thu Apr  4 14:07:36 PDT 2019 | rqpd7ga1nic3jmc54h44qa05i8124vsp |
| Sidecar            | added visitors table and removed it from lunch-pl… | Tim Sehn        | Fri Mar 22 12:09:16 PDT 2019 | gjio4j5hk58ce1mqpavg65cot43j3bmu |
| Chipotle           | lunch-places: Added Chipotle                       | katie mcculloch | Thu Aug 29 11:38:00 PDT 2019 | m2jbro89ou8g6rv71rs7q9f3jsmjuk1d |
| Espresso Cielo     | added Espresso Cielo                               | Matt Jesuele    | Wed Jul 10 12:20:39 PDT 2019 | 314hls5ncucpol2qfdphf923s21luk16 |
| Art's Table        | added visitors table and removed it from lunch-pl… | Tim Sehn        | Fri Mar 22 12:09:16 PDT 2019 | gjio4j5hk58ce1mqpavg65cot43j3bmu |
| Boa                | added visitors table and removed it from lunch-pl… | Tim Sehn        | Fri Mar 22 12:09:16 PDT 2019 | gjio4j5hk58ce1mqpavg65cot43j3bmu |
| Tocaya             | update tocaya rating                               | bheni           | Thu Jun  6 17:22:24 PDT 2019 | qi331vjgoavqpi5am334cji1gmhlkdv5 |
| Sunnin             | change rating                                      | bheni           | Thu Apr  4 15:43:00 PDT 2019 | 137qgvrsve1u458briekqar5f7iiqq2j |
| Swingers           | added visitors table and removed it from lunch-pl… | Tim Sehn        | Fri Mar 22 12:09:16 PDT 2019 | gjio4j5hk58ce1mqpavg65cot43j3bmu |
+--------------------+----------------------------------------------------+-----------------+------------------------------+----------------------------------+

A very classy assortment of Santa Monica lunch spots (RIP Swingers Diner)

The rest of this article will discuss some of the choices and technical challenges faced in implementing the initial version of dolt blame.

What counts as a change?

In designing dolt blame, one central question we must answer is what constitutes a change to a row. Clearly, if a value in the row differs between a commit and its parent, we'd say the commit modified the row.

But what if the table schema changed between commits? If a commit consisted of adding a nullable column without a default value, should that count as a change to every row even though no non-null values originated from or were modified by that commit?

Similarly, if there was a commit that simply dropped a column, do we want to count that as a change to every row, even though none of the current data originated from or was modified by that commit?

We decided to shelve these considerations for now and simply count schema changes as changes to every row, but we plan to support more blame strategies in the future.

How should we order the commit history?

Another question in designing blame is how to traverse the commit history. Dolt, like Git, stores its commit history as a directed acyclic graph of commit nodes.

It seems obvious enough that we want to walk the graph backwards from the commit passed to blame—after all, we're looking for the most recent change—but what do we do when we encounter branches in the graph, i.e. commits with more than one parent? We need to decide on a rule for ordering commits that takes branches into account.

What we need is a topological ordering. According to Wikipedia, a topological ordering of a directed graph is:

a linear ordering of its vertices such that for every directed edge uv from vertex u to vertex v, u comes before v in the ordering

For a Dolt commit graph, this means an ordering of commits such that if commit abcdef is a parent of commit 12345, abcdef must come before 12345 in the ordering. But since we're walking the commit graph backwards for blame, we actually want a reverse topological ordering: commit 12345 must come before its parent abcdef.

As it turns out, we had some code for producing double-dot logs which does exactly this, with tiebreaking done based on the height of the commit in the graph (higher nodes appearing first). This was almost what we needed, but the ordering of nodes at the same height was non-deterministic as written. We addressed this by adding a second tiebreaker based on timestamp, with more recent nodes appearing first.

Now, with our commit ordering established and a function for computing it implemented, we have the data we need to compute blame.

How do we deal with merges?

The blamed commit for a row is defined as the first commit in the ordered list of commits where the row in question differs from in that commit's first parent.

Why only the first parent? Imagine you have a long-running branch for some unit of work—in Git they are often called feature branches—and over time it gets out of sync with the main branch, usually called master.

Imagine that master now has a whole bunch of new data that doesn't exist in the feature branch. Eventually, the feature branch gets merged into master, creating a merge commit. The merge commit has two parents: the first from master and the second from the feature branch.

Let's make this concrete with an example:

                    C  E
some-feature     /--O--O
master ----O----O
           A    B

In the diagram above, some-feature branches off of master at commit B. Some work is done and committed to some-feature as commits C and E.

While this is happening, a bunch of completely independent data is added to master in commit D:

                    C  E
some-feature     /--O--O
master ----O----O-----O
           A    B     D

Finally, some-feature is merged back into master, creating F, a merge commit with parents D and E:

                    C  E
some-feature     /--O--O--\
master ----O----O-----O----O
           A    B     D    F

If we consider non-first-parents when computing blame, F will be blamed for the data added to master by D, since that data did not exist in its second parent E. This is not helpful behavior.

If instead we restrict our checks to first parents, D will correctly be blamed for the data it added, since the data does not change from D to F.

Tying it together

With this decided, the logic pretty much falls out. In Go, with the low-level details abstracted away:

for _, rowPK := range rowPrimaryKeys {
  for _, commit := range orderedCommits {
    parent := firstParentOf(commit)
    if (rowChanged(commit, parent, rowPK)) {
      blameGraph.AssignBlame(commit, rowPK)
      break  // stop iterating commits
    }
  }
}

In reality, a lot more goes on under the hood, mostly dealing with internal details of how commits and tables are represented. There's also plenty of room for optimization, as a naive approach based on the above code does lots of duplicate work. You can take a look at what we came up with if you're curious.

(As an aside, I looked at git blame's source hoping for inspiration but was instead greeted with over 2500 lines of arcane C. Then I found go-git, which implements a simple version of blame in Go, the language Dolt is written in. That was very helpful, and I highly recommend taking a peek if you're interested.)

The final step before we have a releasable feature is some kind of output. In the future there will likely be other consumers of blame information, but for now we just use go-pretty to pretty-print some tables to the console. The end result looks like this:

$ dolt blame HEAD employees
+-------+------------------------------+--------------+------------------------------+----------------------------------+
| NAME  | COMMIT MSG                   | AUTHOR       | TIME                         | COMMIT                           |
+-------+------------------------------+--------------+------------------------------+----------------------------------+
| Alex  | employees: add alex and barb | Matt Jesuele | Wed Oct  9 11:29:09 PDT 2019 | 91osuaj28kpbl8fc2jahq90nv0chpps5 |
| Barb  | employees: add alex and barb | Matt Jesuele | Wed Oct  9 11:29:09 PDT 2019 | 91osuaj28kpbl8fc2jahq90nv0chpps5 |
| Fran  | employees: add fran          | Tim Sehn     | Tue Oct 15 19:12:02 PDT 2019 | t82jp9r42sump19jhibvko54muh6dqm0 |
+-------+------------------------------+--------------+------------------------------+----------------------------------+

Conclusion

We have lots of ideas about where to take blame next. Our two highest priorities are:

  1. Where-clause filtering for row blames, so you can get blame for a subset of the table's rows or even a single row
  2. Cellwise blame, so you can see which commit last modified a single cell

Beyond that, our vision is to expose blame information via plain SQL, perhaps by creating special system tables which hold blame information, or by intercepting certain queries and constructing virtual blame tables to return results from. We're still working out the details.

If you use or are considering using Dolt and you have another feature you'd like to see (for blame or in general), let us know by filing an issue on our GitHub repository. And if you haven't used Dolt yet, grab a copy and then head over to DoltHub to find some interesting public data repositories to clone. We're looking forward to hearing your feedback.

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt