The genesis for DoltHub the company was a persistent question starting in 2013. Why is there no place on the internet to get access to interesting, maintained data? DoltHub's thesis was adding branch/merge to data in some way similar to what we had in source code would result in more data shared. Data creators and consumers could collaborate on data in the same way that open source maintainers and users collaborate on source code.
Dolt was created as Git for tables in August 2019. Shortly after Dolt's open source release, DoltHub was launched as Dolt's GitHub, a place on the internet to share and collaborate on Dolt databases. Since then, DoltHub has become an interesting collaborative workspace for data, especially for data bounties.
Since 2019, Dolt's customers have been requesting Dolt become a database you can run in production because branch/merge on databases is really useful for a number of use cases. This blog attempts to capture and explain some of those use cases.
Dolt. The database for...
As I said in the introduction, Dolt was built for sharing. Dolt is the database you can copy/paste. A copy or clone in Git parlance creates a copy of data and schema for truly decentralized reads and writes. A paste or merge allows these decentralized writes to be shared amongst various copies of the database. These features make Dolt the ideal database for sharing because each user gets to operate on their own copy and collaborate asynchronously when he or she chooses to adopt another person's changes.
The best way to see this sharing in action is data bounties where a distributed team of data bounty hunters collaborate to produce a database like all businesses in the United States.
We started imagining Dolt would be the database you would share with hundreds or thousands of people on the internet. But, we're starting to wonder if we skipped a few steps. What if Dolt is the database you share between your laptop, development and production? Dolt also works for smaller scale data sharing use cases.
Dolt is the database for sharing.
Are customers or vendors sharing data with you? When new data is shared, do downstream systems break? In my career, receiving shared data from outside my organization has often caused problems.
Dolt allows you to assert greater control over the shared data you are ingesting into your team. If you ingest bad data, roll back to the previous version. Look at the problematic diff and often the problem and solution will be obvious. Were you only expecting new data from the past day but the vendor updates last month's data as well? You can even automate these tests with tools like Great Expectations which integrates really well with Dolt.
Dolt is the database for data ingestion.
In Machine Learning use cases, data is code. Andrej Karpathy call this Software 2.0. What if you treated data like code?
Using Dolt to manage and share your Machine Learning data amongst your data analysts, engineers, and scientists make collaboration easy. Dolt gives you human and machine readable diffs. Diffs are useful providing data oriented insights into your ML models. Why is this model performing better than this one? What in the data changed? Dolt provides data lineage as a first class entity in your Machine Learning pipelines. Dolt provides model reproducibility by storing each version of the data you use to train a model. Dolt is especially useful in Natural Language Processing (NLP) where the data is mostly text.
Dolt is fully MySQL compatible so it integrates seamlessly with Pandas Dataframes. Dolt integrates easily with other machine learning tools. We've partnered on integrations with Kedro, Flyte, and Metaflow.
Dolt is the database for machine learning.
Configuration has been taken over by YAML. So much YAML.
It is common practice to store this YAML configuration in Git. Unfortunately, a file based UI becomes unwieldy at scale because it's not queryable. For large scale configuration management, you often grow out of YAML.
By adopting Dolt for configuration management, you can have Git-style version control and the ability to query large scale sets of configuration. Nautobot, a network configuration management tool, leverages Dolt to allow Network engineers to share network configuration data more easily. Nautobot was built to be backed by Postgres, but by moving it to Dolt, Nautobot was able to support Git-style versioning, including integrated code review via Pull Requests.
Dolt is the database for configuration management.
Do you have a database problem in production? Do you feel comfortable logging in with privileges to the production host? What if you need to make a change, like add an index? How can you be sure you're not going to break anything?
The above problem is not the least bit harrowing using Dolt. Share your production database with your laptop. Clone a copy of the production database and debug outside of production. Run pull to get the latest production data. Need to add an index as a fix or even change a problematic value? Make the change on your laptop, get it reviewed in a PR, and merge it up into production. No downtime. No risk of connecting to a production system and making writes.
Dolt allows you to have all the benefits of the software development workflows you've used for the past 10 years in your database.
Dolt is the database for debugging.
The trend towards continuous deployment in software unlocked orders of magnitude productivity improvements in many software development environments. Commit your code and have it tested and deployed all the way to production, failing if it causes issues. This set up moves software failure closer to the time the code was written, making debugging much quicker. Continuous deployment was so revolutionary that most modern software development teams adopt it from inception now.
Dolt allows you to bring continuous deployment to your database development. Commit your schema and data and have it tested and deployed all the way to production, failing if it causes issues. Achieve the same order of magnitude productivity improvements in database development you achieved with software. Share the database on your laptop with production.
Dolt is the database for continuous deployment.
Dolt was built as the database for sharing. But because of this, Dolt is the database for so much more. Dolt is the database for data ingestion, machine learning, configuration management, debugging, and continuous deployment...to name a few of our favorite use cases.
Have a use case we missed? Come tell us about it on our Discord.