Common Vulnerabilities and Exposures in Dolt

DATASET
3 min read

TLDR: The NVD is a lot more useful when you can simply clone it and query it.

The National Vulnerability Database (NVD) is the authoritative source for the publication of Common Vulnerabilities and Exposures (CVE). The vulnerabilities cataloged in the NVD represent the most severe and most impactful cyber-security events. The database exists as a collaboration between the MITRE corporation and The National Institute of Standards and Technology (NIST). Within the NVD you'll find entries for famous security bugs such as DROWN, Shellshock, and Heartbleed. But the NVD also contains information about more routine security issues, like the 2120 CVEs affecting Chrome 😬.

The trouble with the NVD is that it's just not that easy to use. Surely somewhere within NIST there is a database that contains all of this data, but for everyone else there's a JSON api where you can download around twenty zip files containing portions of the dataset. Cat all these files together and you still wouldn't have anything that would answer a question like 'how many Redhat Linux bugs allowed root access?'. A lot of code has been written trying to make this dataset available and useful, but having to add more software to your toolchain to run a query seems kind of broken right?

Using this data through Dolt is an order of magnitude improvement over writing your own parsing code. Simply dolt pull to download the latest data. Then you can run a SQL query against it with dolt sql -q '...'. If you come across an interesting query result, you can save the query and distribute it along with the dataset. The NVD is constantly adding new data and updating information for old entries, you can track all of these changes with dolt diff. Within the time it took to write this blog, another CVE was added! It's a pretty cool dataset, let's take a look.

Getting the Data

As I mentioned above, when it comes to doing anything useful with the data, you're on your own.

NVD data feeds require knowledge of JSON and/or XML parsing

Sweet. Luckily our friends at Facebook maintain a Github Repo with tools to download and parse the feed. The CVE feed API follows this JSON schema, which is not all that useful without specific tools that know how to query it. Thankfully your friends at dolthub mapped it to a SQL schema so you didn't have to. After doing a little digging, it seemed like there were some not-so-useful fields in the data.

% dolt sql
NVD> SELECT DISTINCT data_format FROM CVE;
+-------------+
| data_format |
+-------------+
| MITRE       |
+-------------+
NVD> SELECT DISTINCT data_type FROM CVE;
+-----------+
| data_type |
+-----------+
| CVE       |
+-----------+
NVD> SELECT DISTINCT data_version FROM CVE;
+--------------+
| data_version |
+--------------+
| 4.0          |
+--------------+

These fields don't add any information to the CVE entries, so I deleted these columns and added this info to the docs. You can view the diff of the schema change on DoltHub. Once we're all done normalizing our data, we get the following:

% dolt sql -q 'describe CVE;'
  +---------------------+-------------+------+-----+---------+-------+
  | Field               | Type        | Null | Key | Default | Extra |
  +---------------------+-------------+------+-----+---------+-------+
  | cve_id              | VARCHAR(30) | NO   | PRI |         |       |
  | description         | LONGTEXT    | YES  |     |         |       |
  | problem_type        | LONGTEXT    | YES  |     |         |       |
  | expoitability_score | FLOAT       | YES  |     |         |       |
  | impact_score        | FLOAT       | YES  |     |         |       |
  | date_published      | DATETIME    | YES  |     |         |       |
  | date_last_modified  | DATETIME    | YES  |     |         |       |
  +---------------------+-------------+------+-----+---------+-------+

The CVE table contains the date we're really interested in. Each row lists a single CVE along with the data about the nature and severity of the problem. The rest of the tables contain links to references, products affected, and data for scoring the CVE.

So what about Redhat Linux?

nvd> select count(distinct cvss2.cve_id) from cvss2
  -> join affected_products on cvss2.cve_id = affected_products.cve_id
  -> where cvss2.obtain_all_privilege = 1
  -> and affected_products.vendor = 'redhat'
  -> and affected_products.product = 'linux';
+------------------------------+
| COUNT(DISTINCT CVSS2.cve_id) |
+------------------------------+
| 92                           |
+------------------------------+

Keeping up to date

The NVD is a constantly evolving dataset. The API feeds are updated every hour and import into this Dolt repository automatically. Security vulnerabilities are extremely time sensitive, so it's vital to have up-to-date information. I've created some saved queries in the Query Catalog to quickly access the most recently published vulnerabilities. If you clone a local copy of this repo, you can quickly get the latest updates in your shell.

% dolt pull
Everything up-to-date
% dolt sql -x "New CVEs Today"
Executing saved query 'New CVEs Today':
select * from cve where date_published > (now() - interval 24 hour) order by date_published desc;
+----------------+------------------------------------------------------------------------------+
| cve_id         | description                                                            | ... |
+----------------+------------------------------------------------------------------------------+
| CVE-2019-20102 | The attachment-uploading feature in Atlassian Confluence Server from ...     |
| CVE-2018-21145 | Certain NETGEAR devices are affected by a stack-based buffer overflow ...    |
...                                                                                             |
| CVE-2020-1757  | A flaw was found in all undertow-2.x.x SP1 versions prior to ...             |
| CVE-2020-5268  | In Saml2 Authentication Services for ASP.NET before versions 2.7.0 ...       |
+----------------+------------------------------------------------------------------------------+

Conclusion

There's a lot more to explore with the NVD repo, we hope you'll clone your own copy and see what you can find. We believe that Dolt adds a lot of value in distributing and consuming dataset. The ability to push, pull, diff, branch and merge your data brings a new level of fluidity to data analysis. This form of collaboration is the norm in software development, why shouldn't data collaboration be just as simple? Check out these dataset we're publishing on DoltHub or create your own!

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt

Or join our mailing list to get product updates.