LICENSE.md and README.md in Dolt

February 13, 2020

4 min read

Dolt and DoltHub strive to be the best data distribution platform on the internet. Having documentation versioned alongside data, and a standard, easy way to read the documentation online are features we admire in Git and GitHub.

Following in Git’s footsteps, we released the documents feature in Dolt (release 0.13.0 & patch 0.13.1) and DoltHub. These are repository-level documents that are rendered with Markdown on DoltHub. The contents of these files are stored in a system table called dolt_docs.

You can check out documents in action on DoltHub in our Corona Virus dataset.

How to add a LICENSE.md and README.md in Dolt#

LICENSE.md and README.md files are created when setting up a Dolt repository with dolt init:

shell$ dolt init
shell$ ls
LICENSE.md README.md
shell$ dolt status
On branch master
Untracked files:
  (use "dolt add <table|doc>" to include in what will be committed)
        new doc:        LICENSE.md
        new doc:        README.md

Or, when a user creates the file(s):

shell$ echo "This is the new README" > README.md
shell$ dolt status
On branch master
Untracked files:
  (use "dolt add <table|doc>" to include in what will be committed)
        new doc:        README.md

Just like tables, these docs can be added, removed, modified, diffed, committed, checked out and reset.

Design Decisions: Storage and User Workflow#

shell$ dolt schema show dolt_docs;
dolt_docs @ working
CREATE TABLE `dolt_docs` (
  `doc_name` LONGTEXT NOT NULL COMMENT 'tag:0',
  `doc_text` LONGTEXT COMMENT 'tag:1',
  PRIMARY KEY (`doc_name`)
);

From an engineering standpoint, the simplest approach was to set up a dolt_docs system table, and have users dolt sql insert into dolt_docs (README.md, "README content..."). We wanted to avoid a separate storage map, with its own getters and setters, if possible. We achieved this by storing the documents in a system table, which had the added benefit of allowing us to take advantage of the Dolt command code structure, and thus modify or remove docs from existing commands.

So we settled on the storage being a dolt_docs system table, but it didn’t feel right to expect users to know any SQL or understand the role of system tables in Dolt just to check in a LICENSE.md or README.md. For this reason, we decided to make changes to dolt_docs under the hood, and provide the file-based document workflow that we know from Git.

What the Filesystem Changes#

Theory#

Similar to Git, Dolt uses 3 root values to represent repository state. Each root value stores a commit history and plays a different role in the user workflow. The working root stores local changes in the repository. The staged root stores the changes that have been added and are ready for commit. The head root stores the commit history and represents the tip of the current branch.

# Tables root structure:
working root --> staged root --> head root


# Docs root structure:
filesystem --> working/staged root --> head root

Practice#

dolt status

Local table changes are computed by diffing the working root with the staged root, whereas local document changes are computed by diffing the filesystem with the working/staged roots.

dolt add <table|doc>

Staging a table with dolt add <table> moves that table from the working root to the staged root, whereas dolt add <doc> requires creating a dolt_docs table at runtime from the filesystem values, and applying those changes to the working and staged root.

dolt checkout <table|doc>

Resetting a table with dolt checkout <table> takes that table from the staged root (if it is already staged), or the head root, and applies it to the working root. Doing the same for a document, dolt checkout README.md involves taking the dolt_docs table from the staged or head root, plucking out the README.md row via primary key, and saving the contents of that row to the README.md on the filesystem.

Rendering docs on DoltHub#

There were no changes necessary on the DoltHub service layer to render documents on the web. We filter system tables out of the main table list in the repository, and make a call for the dolt_docs table wherever we want to render the documents or check for their existence. We used ReactDiffViewer to display text diffs and used an external style sheet from github-markdown-css to style the markdown content.

Conclusion and next steps#

Row by row operations#

In order to support LICENSEs and READMEs, you’ll notice we had to support staging individual rows. Any operation taken on a single document, like dolt checkout <doc>, dolt add <doc> and dolt reset <doc> required this new functionality. While it currently only applies to documents and the dolt_docs table, it is a feature we would like to support on user tables. For instance, we think users would like to stage and commit only some of the rows they modified in a working set. This would be akin to only committing part of a file in Git using git add --patch. Because the Git patch is not exactly analogous to the table use case, we need to design appropriate semantics. Look forward to this feature in a future release.

Checking in other files#

For now, the only documents that can be checked in are README.md and LICENSE.md. There has been some interest in checking in other documents with tables. Data import code, for example, coupled with tables and version controlled across branches, could unlock even more value to the data versioning space. This is likely something we will support someday, but for now we are limiting the committable documents to README.md and LICENSE.md only.

Exposing system tables on DoltHub#

We are excluding dolt_docs from the Tables section in the repository, and have plans to build out a separate interface for system tables. You can still examine these tables using the SQL interface on Dolt. More to come!

If you haven’t already, give Dolt a try, and look out for part II of this blog for more discussion on data licensing.

Blog

PRODUCTS

KEYWORDS