So you want Spreadsheet Version Control?

REFERENCE
7 min read

Here at DoltHub, we've had a lot of success with our "So you want..." series of blog posts helping people find Dolt when they are looking for us. Dolt is a lot of things. Dolt is a version controlled database, a Git database, Git for data, data version control, an immutable database, and a decentralized database.

Dolt is not Spreadsheet Version Control. But a lot of our customers migrate to Dolt when their spreadsheet powered system of data management is crumbling under heavy edits from multiple users. Version controlling data currently managed in a spreadsheet becomes a powerful way to increase data quality while still encouraging collaboration.

In this blog, I will discuss the collaborative spreadsheet problem in detail, explain how it breaks down, and then explain some options to remedy the issue, one of which is migrating to Dolt.

Spreadsheet Icon

But there's a Version History in Excel and Google Sheets?

You're right. Here's the version history option for the Google Sheet we use to manage our blog writing schedule.

Google Sheets Version History

Similar functionality exists in Microsoft Excel, traditional and online.

But version history is not version control. Version control allows for branch and merge, the ability to asynchronously collaborate on multiple versions in an organized way. Instead of everyone editing the same Google Sheet, you can create you own copy, make edits, see the differences (ie. diffs) between your copy and another copy, and then merge two copies together. That workflow is true version control. No spreadsheet has those features.

Why is there no Version Controlled Spreadsheet?

Version control is hard, both the technology and the user experience.

Branch and merge is a famously hard technical problem. The industry struggled with it for source code until one of the greatest computer scientists who ever lived, Linus Torvalds, creator of Linux, invented Git. Adding version control that scales to spreadsheet's underlying data storage is a hard problem. We here at Dolt know, it's what we do.

Moreover, true version control involving branch and merge has some user experience challenges. A quick Twitter search for "git interface" always yields some gems like this one:

I've been in the software game a long time and Perforce received similar vitriol. People don't like or understand most version control interfaces so adding one to spreadsheets is a user interface design challenge.

Lastly, once you need spreadsheet version control, you may need to graduate out of a spreadsheet driven process altogether. "Let's version control this spreadsheet" usually means "We need something other than a spreadsheet to do this".

When do Spreadsheets need Version Control?

Collaborative Editing

If you get in a situation where greater than five people are editing the same sheet, either online like in Google Sheets, or offline like in a tradition Microsoft Excel, people start stepping on each other's toes. Edits are written over. Work is lost. In the offline case, a manual merge of multiple people's work is required. Version control was designed to solve this problem for source code so people naturally think it can solve the problem for spreadsheets.

Data Ingestion

Often, business critical data is created for automated systems by human editors. These processes often start as spreadsheets because of the convenient, powerful, tabular edit interface. The data is then usually saved in a text format, like comma separated variable (CSV), and ingested into some sort of database. This is a human driven process and human's make mistakes. As a system like this gets bigger or more business critical, a natural evolution is to institute a "peer review" by another human before the automated system ingests the data. Version control can produce differences between two copies of a file, making human review a lot easier. You only have to review what has changed. Version control also makes it easy to rollback to a previous version of the data if the review process fails adding an extra layer of protection.

Shared data

Often people want to distribute a spreadsheet for many people to use. What if everyone can't agree on what version of the data is correct or useful to them? You need to have multiple parallel copies for different people to refer to. What if you want to update some data common to all copies? You don't want to edit all copies. You want make one edit and have all copies get that edit if they choose. Version control solves this problem.

The Options

Researching for this blog, I found a limited number of options for version controlling spreadsheets. If I missed you please email me at tim@dolthub.com.

Spreadsheet Version Control

Git

Tagline
Fast, scalable, distributed revision control system
Initial Release
April 3, 2005
GitHub
https://github.com/git/git (mirror)

You can put an Excel file in Git. It's just a file. The problem is, Excel files are not text files, they are a proprietary binary format.

In the following example I start by adding an empty Excel spreadsheet to a Git repository.

$ git init
$ git add Book1.xlsx 
$ git commit -m "Empty Excel workbook"
[master (root-commit) c59b950] Empty Excel workbook
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 Book1.xlsx

I then populate cell A1 the value "Test".

Excel Git Example

Then I want to see what changed.

$ git diff
diff --git a/Book1.xlsx b/Book1.xlsx
index 3265e4b..ed05683 100644
Binary files a/Book1.xlsx and b/Book1.xlsx differ

I can see the file changed but I can't see what changed, severely limiting Git's utility. You can store Excel files in Git and it gives you a way to rollback to previous versions and annotate what's changed in the Commit message but you don't get the powerful version control capabilities like diff and merge.

You can also save the sheet you want to version control in a text format like CSV but in order to get proper diffs and merges the CSV must be sorted in the same way each time. You also lose formulas if you try to restore from a CSV.

Using Git to version control spreadsheets does not seem like the right tool for the job.

xltrail

Tagline
Version Control For Excel Workbooks
Initial Release
February 2018 (Used GitXL first release)
GitHub
https://github.com/xlwings/git-xlGit XL is open source, not full xltrail

xltrail and the corresponding open source package Git XL is really the only option to extend version control beyond the version history functionality of Excel or Google Sheets. xltrail only works with Microsoft Excel workbooks.

xltrail offers Git integration and a slick web interface for version management and diffs. xltrail breaks your Excel workfbooks up into sheets, VBA code, and Power queries and versions each piece separately. xltrail allows you to commit and annotate why you changed different pieces.

The downside is that xltrail does not seem to support merge and conflict detection on the spreadsheets themselves. This means you are pretty much restricted to linear history which should be fine for most use cases.

The Git repository is active so we can expect to see the product improve over time.

Dolt

Tagline
It's Git for Data
Initial Release
August 2019
GitHub
https://github.com/dolthub/dolt

Dolt is not a version controlled spreadsheet. Dolt is a version controlled database. DoltHub is a web interface to Dolt databases, similar to GitHub for Git.

Many of Dolt's customers have a spreadsheet driven workflow. These customers hit one of the problems listed above and went looking for a version controlled spreadsheet. They found Dolt and became convinced that their process had outgrown spreadsheets exclusively. Migrating the data to Dolt for the data version control functionality while keeping the spreadsheet interface for parts of the process becomes the solution. This is the diagram of a process one of our customers adopted to achieve data version control in a previously spreadsheet driven process.

Sheets as a Feeder for ML

Moreover, DoltHub comes with a spreadsheet editor for non-technical users to make modifications directly to Dolt. It does not support formulas or other spreadsheet features but for some users it fits the bill.

DoltHub Spreadsheet Editor

We're biased but we think Dolt and DoltHub can really help in situations where you need a version controlled spreadsheet. A version controlled spreadsheet doesn't exist. Maybe you should build one? Dolt would be the perfect backing store for something like that. Want to discuss? Come chat with us on our Discord.

SHARE

JOIN THE DATA EVOLUTION

Get started with Dolt

Or join our mailing list to get product updates.