LICENSE.md and README.md in Dolt

5 min read

Dolt and DoltHub strive to be the best data distribution platform on the internet. Having documentation versioned alongside data, and a standard, easy way to read the documentation online are features we admire in Git and GitHub.

Following in Git's footsteps, we released the documents feature in Dolt (release 0.13.0 & patch 0.13.1) and DoltHub. These are repository-level documents that are rendered with Markdown on DoltHub. The contents of these files are stored in a system table called dolt_docs.

You can check out documents in action on DoltHub in our Corona Virus dataset.

How to add a LICENSE.md and README.md in Dolt

LICENSE.md and README.md files are created when setting up a Dolt repository with dolt init:

shell$ dolt init
shell$ ls
LICENSE.md README.md
shell$ dolt status
On branch master
Untracked files:
  (use "dolt add <table|doc>" to include in what will be committed)
        new doc:        LICENSE.md
        new doc:        README.md

Or, when a user creates the file(s):

shell$ echo "This is the new README" > README.md
shell$ dolt status
On branch master
Untracked files:
  (use "dolt add <table|doc>" to include in what will be committed)
        new doc:        README.md

Just like tables, these docs can be added, removed, modified, diffed, committed, checked out and reset.

Design Decisions: Storage and User Workflow

shell$ dolt schema show dolt_docs;
dolt_docs @ working
CREATE TABLE `dolt_docs` (
  `doc_name` LONGTEXT NOT NULL COMMENT 'tag:0',
  `doc_text` LONGTEXT COMMENT 'tag:1',
  PRIMARY KEY (`doc_name`)
);

From an engineering standpoint, the simplest approach was to set up a dolt_docs system table, and have users dolt sql insert into dolt_docs (README.md, "README content..."). We wanted to avoid a separate storage map, with its own getters and setters, if possible. We achieved this by storing the documents in a system table, which had the added benefit of allowing us to take advantage of the Dolt command code structure, and thus modify or remove docs from existing commands.

So we settled on the storage being a dolt_docs system table, but it didn't feel right to expect users to know any SQL or understand the role of system tables in Dolt just to check in a LICENSE.md or README.md. For this reason, we decided to make changes to dolt_docs under the hood, and provide the file-based document workflow that we know from Git.

What the Filesystem Changes

Theory

Similar to Git, Dolt uses 3 root values to represent repository state. Each root value stores a commit history and plays a different role in the user workflow. The working root stores local changes in the repository. The staged root stores the changes that have been added and are ready for commit. The head root stores the commit history and represents the tip of the current branch.

# Tables root structure:
working root --> staged root --> head root


# Docs root structure:
filesystem --> working/staged root --> head root

Practice

dolt status

Local table changes are computed by diffing the working root with the staged root, whereas local document changes are computed by diffing the filesystem with the working/staged roots.

dolt add <table|doc>

Staging a table with dolt add <table> moves that table from the working root to the staged root, whereas dolt add <doc> requires creating a dolt_docs table at runtime from the filesystem values, and applying those changes to the working and staged root.

dolt checkout <table|doc>

Resetting a table with dolt checkout <table> takes that table from the staged root (if it is already staged), or the head root, and applies it to the working root. Doing the same for a document, dolt checkout README.md involves taking the dolt_docs table from the staged or head root, plucking out the README.md row via primary key, and saving the contents of that row to the README.md on the filesystem.

Rendering docs on DoltHub

There were no changes necessary on the DoltHub service layer to render documents on the web. We filter system tables out of the main table list in the repository, and make a call for the dolt_docs table wherever we want to render the documents or check for their existence. We used ReactDiffViewer to display text diffs and used an external style sheet from github-markdown-css to style the markdown content.

Conclusion and next steps

Row by row operations

In order to support LICENSEs and READMEs, you'll notice we had to support staging individual rows. Any operation taken on a single document, like dolt checkout <doc>, dolt add <doc> and dolt reset <doc> required this new functionality. While it currently only applies to documents and the dolt_docs table, it is a feature we would like to support on user tables. For instance, we think users would like to stage and commit only some of the rows they modified in a working set. This would be akin to only committing part of a file in Git using git add --patch. Because the Git patch is not exactly analogous to the table use case, we need to design appropriate semantics. Look forward to this feature in a future release.

Checking in other files

For now, the only documents that can be checked in are README.md and LICENSE.md. There has been some interest in checking in other documents with tables. Data import code, for example, coupled with tables and version controlled across branches, could unlock even more value to the data versioning space. This is likely something we will support someday, but for now we are limiting the committable documents to README.md and LICENSE.md only.

Exposing system tables on DoltHub

We are excluding dolt_docs from the Tables section in the repository, and have plans to build out a separate interface for system tables. You can still examine these tables using the SQL interface on Dolt. More to come!

If you haven't already, give Dolt a try, and look out for part II of this blog for more discussion on data licensing.

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt