Diffing Queries in Dolt

8 min read

Dolt is a SQL database built to wrangle datasets. Its tables are versioned, queryable, and shareable. We've recreated Git's functionality in a relational database so you can collaborate on data in the same ways you collaborate on code. One of Dolt's most powerful features is diff; dolt diff works on tables in the same way that git diff works on files. To make diff even more useful, we extended the concept to diffing queries.

#The Basics

To see how it works, we're going to use the National Vulnerability Database (NVD) repo, which is updated hourly. The most central table in this dataset is CVE which lists the security bugs, their severity and their descriptions. We'll need a commit to diff against, so let's query the dolt_log table to find something from about a week ago.

dolt sql -q 'select commit_hash from dolt_log where date < (now() - interval 1 week) limit 1'
+----------------------------------+
| commit_hash                      |
+----------------------------------+
| p30hoseurm9qfl7jhb9t4l7jfnshk6v9 |
+----------------------------------+

A standard diff looks like dolt diff p30hoseurm9qfl7jhb9t4l7jfnshk6v9 CVE and returns all of the rows from the table CVE that differ between HEAD commit and p30hoseurm9qfl7jhb9t4l7jfnshk6v9:

diff --dolt a/CVE b/CVE
--- a/CVE @ sg1tqntkv6n87akbksnlvalk8efvoot4
+++ b/CVE @ r41c452cegprgetpeht4oksgq0amcj31
+-----+------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------+----------------------+--------------------+---------------------+---------------------+
|     | cve_id           | description                                                                                                                                                                                                                                              | problem_type     | exploitability_score | impact_score       | date_published      | date_last_modified  |
+-----+------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------+----------------------+--------------------+---------------------+---------------------+
|  <  | CVE-2005-1513    | Integer overflow in the stralloc_readyplus function in qmail, when running on 64 bit platforms with a large amount of virtual memory, allows remote attackers to cause a denial of service and possibly execute arbitrary code via a large SMTP request. | NVD-CWE-Other    | 10                   | 2.9000000953674316 | 2005-05-11 04:00:00 | 2020-06-17 00:15:00 |
|  >  | CVE-2005-1513    | Integer overflow in the stralloc_readyplus function in qmail, when running on 64 bit platforms with a large amount of virtual memory, allows remote attackers to cause a denial of service and possibly execute arbitrary code via a large SMTP request. | NVD-CWE-Other    | 10                   | 2.9000000953674316 | 2005-05-11 04:00:00 | 2020-06-23 23:15:00 |
...
|  <  | CVE-2020-9844    | A double free issue was addressed with improved memory management. This issue is fixed in iOS 13.5 and iPadOS 13.5, macOS Catalina 10.15.5. A remote attacker may be able to cause unexpected system termination or corrupt kernel memory.               | CWE-415          | 10                   | 6.900000095367432  | 2020-06-09 17:15:00 | 2020-06-09 23:41:00 |
|  >  | CVE-2020-9844    | A double free issue was addressed with improved memory management. This issue is fixed in iOS 13.5 and iPadOS 13.5, macOS Catalina 10.15.5. A remote attacker may be able to cause unexpected system termination or corrupt kernel memory.               | CWE-415          | 10                   | 6.900000095367432  | 2020-06-09 17:15:00 | 2020-06-25 23:15:00 |
+-----+------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------+----------------------+--------------------+---------------------+---------------------+

The equivalent query diff would be dolt diff -q 'select * from cve', and gives the same set of results.

+-----+------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------+----------------------+--------------------+---------------------+---------------------+
|     | cve_id           | description                                                                                                                                                                                                                                              | problem_type     | exploitability_score | impact_score       | date_published      | date_last_modified  |
+-----+------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------+----------------------+--------------------+---------------------+---------------------+
|  <  | CVE-2005-1513    | Integer overflow in the stralloc_readyplus function in qmail, when running on 64 bit platforms with a large amount of virtual memory, allows remote attackers to cause a denial of service and possibly execute arbitrary code via a large SMTP request. | NVD-CWE-Other    | 10                   | 2.9000000953674316 | 2005-05-11 04:00:00 | 2020-06-17 00:15:00 |
|  >  | CVE-2005-1513    | Integer overflow in the stralloc_readyplus function in qmail, when running on 64 bit platforms with a large amount of virtual memory, allows remote attackers to cause a denial of service and possibly execute arbitrary code via a large SMTP request. | NVD-CWE-Other    | 10                   | 2.9000000953674316 | 2005-05-11 04:00:00 | 2020-06-23 23:15:00 |
...
|  <  | CVE-2020-9844    | A double free issue was addressed with improved memory management. This issue is fixed in iOS 13.5 and iPadOS 13.5, macOS Catalina 10.15.5. A remote attacker may be able to cause unexpected system termination or corrupt kernel memory.               | CWE-415          | 10                   | 6.900000095367432  | 2020-06-09 17:15:00 | 2020-06-09 23:41:00 |
|  >  | CVE-2020-9844    | A double free issue was addressed with improved memory management. This issue is fixed in iOS 13.5 and iPadOS 13.5, macOS Catalina 10.15.5. A remote attacker may be able to cause unexpected system termination or corrupt kernel memory.               | CWE-415          | 10                   | 6.900000095367432  | 2020-06-09 17:15:00 | 2020-06-25 23:15:00 |
+-----+------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------------+----------------------+--------------------+---------------------+---------------------+

Digging In

The advantage of diffing queries is that we can use all of the functionality of SQL to inspect how our data has changed. First let's see how the size of the dataset has changed:

% dolt diff p30hoseurm9qfl7jhb9t4l7jfnshk6v9 -q 'select count(*) from cve'
+-----+----------+
|     | COUNT(*) |
+-----+----------+
|  <  | 145913   |
|  >  | 146256   |
+-----+----------+

We can break this down by impact_score to get a sense of what portions of the data are changing:

%dolt diff p30hoseurm9qfl7jhb9t4l7jfnshk6v9 -q 'select floor(impact_score), count(*) from cve group by floor(impact_score)'
+-----+-------------------------+----------+
|     | FLOOR(CVE.impact_score) | COUNT(*) |
+-----+-------------------------+----------+
|  <  | <NULL>                  | 8468     |
|  >  | <NULL>                  | 8428     |
|  <  | 1                       | 3125     |
|  >  | 1                       | 3124     |
|  <  | 2                       | 41465    |
|  >  | 2                       | 41668    |
|  <  | 3                       | 14324    |
|  >  | 3                       | 14318    |
|  <  | 4                       | 5385     |
|  >  | 4                       | 5418     |
|  <  | 5                       | 22039    |
|  >  | 5                       | 22034    |
|  <  | 6                       | 33798    |
|  >  | 6                       | 33890    |
|  <  | 9                       | 282      |
|  >  | 9                       | 283      |
|  <  | 10                      | 16798    |
|  >  | 10                      | 16864    |
+-----+-------------------------+----------+

The biggest set of changes came from vulnerabilities with scores of 2.0 - 3.0. We can focus on this set of changes by using query diff to filter out the rest.

% dolt diff p30hoseurm9qfl7jhb9t4l7jfnshk6v9 -q 'select cve_id, impact_score from cve where impact_score >= 2.0 and impact_score < 3.0;'
+-----+----------------+--------------------+
|     | cve_id         | impact_score       |
+-----+----------------+--------------------+
|  +  | CVE-2013-6648  | 2.9000000953674316 |
|  +  | CVE-2015-7916  | 2.9000000953674316 |
...
|  +  | CVE-2020-9600  | 2.9000000953674316 |
|  +  | CVE-2020-9601  | 2.9000000953674316 |
+-----+----------------+--------------------+

Diffing Views

SQL views in Dolt are implemented as subqueries. The ability to diff queries means we can diff views just like we would diff a regular table. The NVD repo contains a view Chrome_CVEs that contains all CVEs that mention Google Chrome. Using a commit created on May 29th, we can see Chrome CVEs that were created or modified in the last month:

% dolt diff true4a15mifapnqoui86p0tac65mtott -q 'select * from chrome_cves'
+-----+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+----------------------+--------------------+---------------------+---------------------+
|     | cve_id        | description                                                                                                                                                                                                                                                                                                                                  | problem_type   | exploitability_score | impact_score       | date_published      | date_last_modified  |
+-----+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+----------------------+--------------------+---------------------+---------------------+
|  <  | CVE-2010-4008 | libxml2 before 2.7.8, as used in Google Chrome before 7.0.517.44, Apple Safari 5.0.2 and earlier, and other products, reads from invalid memory locations during processing of malformed XPath expressions, which allows context-dependent attackers to cause a denial of service (application crash) via a crafted XML document.            | CWE-119        | 8.600000381469727    | 2.9000000953674316 | 2010-11-17 01:00:00 | 2017-09-19 01:31:00 |
|  >  | CVE-2010-4008 | libxml2 before 2.7.8, as used in Google Chrome before 7.0.517.44, Apple Safari 5.0.2 and earlier, and other products, reads from invalid memory locations during processing of malformed XPath expressions, which allows context-dependent attackers to cause a denial of service (application crash) via a crafted XML document.            | CWE-119        | 8.600000381469727    | 2.9000000953674316 | 2010-11-17 01:00:00 | 2020-06-04 20:31:00 |
|  <  | CVE-2011-0776 | The sandbox implementation in Google Chrome before 9.0.597.84 on Mac OS X might allow remote attackers to obtain potentially sensitive information about local files via vectors related to the stat system call.                                                                                                                            | CWE-200        | 10                   | 2.9000000953674316 | 2011-02-04 18:00:00 | 2017-09-19 01:32:00 |
|  >  | CVE-2011-0776 | The sandbox implementation in Google Chrome before 9.0.597.84 on Mac OS X might allow remote attackers to obtain potentially sensitive information about local files via vectors related to the stat system call.                                                                                                                            | CWE-200        | 10                   | 2.9000000953674316 | 2011-02-04 18:00:00 | 2020-06-04 20:33:00 |
|  <  | CVE-2011-0777 | Use-after-free vulnerability in Google Chrome before 9.0.597.84 allows remote attackers to cause a denial of service or possibly have unspecified other impact via vectors related to image loading.                                                                                                                                         | CWE-399        | 10                   | 10                 | 2011-02-04 18:00:00 | 2017-09-19 01:32:00 |
|  >  | CVE-2011-0777 | Use-after-free vulnerability in Google Chrome before 9.0.597.84 allows remote attackers to cause a denial of service or possibly have unspecified other impact via vectors related to image loading.                                                                                                                                         | CWE-416        | 10                   | 6.400000095367432  | 2011-02-04 18:00:00 | 2020-06-04 20:41:00 |
|  <  | CVE-2011-0779 | Google Chrome before 9.0.597.84 does not properly handle a missing key in an extension, which allows remote attackers to cause a denial of service (application crash) via a crafted extension.                                                                                                                                              | CWE-20         | 10                   | 2.9000000953674316 | 2011-02-04 18:00:00 | 2017-09-19 01:32:00 |
|  >  | CVE-2011-0779 | Google Chrome before 9.0.597.84 does not properly handle a missing key in an extension, which allows remote attackers to cause a denial of service (application crash) via a crafted extension.                                                                                                                                              | CWE-20         | 10                   | 2.9000000953674316 | 2011-02-04 18:00:00 | 2020-06-04 20:43:00 |
...
|  +  | CVE-2020-6498 | Incorrect implementation in user interface in Google Chrome on iOS prior to 83.0.4103.88 allowed a remote attacker to perform domain spoofing via a crafted HTML page.                                                                                                                                                                       | CWE-276        | 8.600000381469727    | 2.9000000953674316 | 2020-06-03 23:15:00 | 2020-06-04 17:57:00 |
|  +  | CVE-2020-6499 | Inappropriate implementation in AppCache in Google Chrome prior to 80.0.3987.87 allowed a remote attacker to bypass AppCache security restrictions via a crafted HTML page.                                                                                                                                                                  | NVD-CWE-noinfo | 8.600000381469727    | 2.9000000953674316 | 2020-06-03 23:15:00 | 2020-06-04 16:30:00 |
|  +  | CVE-2020-6500 | Inappropriate implementation in interstitials in Google Chrome prior to 80.0.3987.87 allowed a remote attacker to spoof the contents of the Omnibox (URL bar) via a crafted HTML page.                                                                                                                                                       | NVD-CWE-noinfo | 8.600000381469727    | 2.9000000953674316 | 2020-06-03 23:15:00 | 2020-06-04 16:27:00 |
|  +  | CVE-2020-6501 | Insufficient policy enforcement in CSP in Google Chrome prior to 80.0.3987.87 allowed a remote attacker to bypass content security policy via a crafted HTML page.                                                                                                                                                                           | CWE-276        | 8.600000381469727    | 2.9000000953674316 | 2020-06-03 23:15:00 | 2020-06-04 16:33:00 |
|  +  | CVE-2020-6502 | Incorrect implementation in permissions in Google Chrome prior to 80.0.3987.87 allowed a remote attacker to spoof security UI via a crafted HTML page.                                                                                                                                                                                       | CWE-276        | 8.600000381469727    | 2.9000000953674316 | 2020-06-03 23:15:00 | 2020-06-04 16:35:00 |
|  +  | CVE-2020-6503 | Inappropriate implementation in accessibility in Google Chrome prior to 74.0.3729.108 allowed a remote attacker to obtain potentially sensitive information from process memory via a crafted HTML page.                                                                                                                                     | CWE-200        | 8.600000381469727    | 2.9000000953674316 | 2020-06-03 23:15:00 | 2020-06-09 12:08:00 |
|  +  | CVE-2020-6504 | Insufficient policy enforcement in notifications in Google Chrome prior to 74.0.3729.108 allowed a remote attacker to bypass notification restrictions via a crafted HTML page.                                                                                                                                                              | CWE-276        | 8.600000381469727    | 2.9000000953674316 | 2020-06-03 23:15:00 | 2020-06-04 18:19:00 |
+-----+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+----------------------+--------------------+---------------------+---------------------+

Conclusion

Dolt query diff is a powerful way to inspect datasets and understand your data. Extending diff logic beyond tables brings us closer to merging the functionality of Git and MySql. We're always looking for ways to improve Dolt and make it a better tool for managing data.
If you have an idea you'd like us to create, let us know!

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt