So you want an AI Database?

AIREFERENCE
11 min read

Here at DoltHub, we built the world’s first version-controlled SQL database: Dolt. What do version control and databases have to do with Artificial Intelligence (AI)? It turns out, a lot.

At first, we were skeptical about the AI revolution, but then we discovered Claude Code. All of a sudden, Dolt became the database for agents. A little later, we decided Dolt is the database for AI. It says so right at the top of our homepage.

In 2025, we talked a lot about AI in this blog. As this pivotal year for AI comes to a close, let’s survey how AI impacts databases.

A Brief History of AI#

As covered in yesterday’s article, A Brief History of AI, AI is capable of the following functions today:

  1. Universal Similarity Search
  2. Summarization
  3. Writing Code
  4. Self-Driving

In the near future, AI will start expanding into the following domains:

  1. Cursor for Everything
  2. Agents for Everything
  3. Advanced Robotics

Current and future AI use cases color how we must think about databases in an AI-driven world.

What Does AI Mean For Databases?#

What new database features are needed to support AI use cases? What existing database features become more important? What new types of data need to be persisted? What query interfaces are needed?

Training Data#

Storing training data in databases is probably more applicable for traditional machine learning rather than generative AI though there are interesting generative AI datasets. Traditional machine learning relies on human or machine generated labels to produce a model. Generative AI does not. Traditional training data is usually somewhat structured. The raw data is usually stored in an Online Analytics Processing (OLAP) database with specific training datasets being pulled to feature stores. Traditional machine learning data can benefit from new database features like version control as we’ve seen here at DoltHub via Flock Safety. Flock Safety stores all their vision and audio training data in Dolt and leverages Dolt’s version control features for model reproducibility, explainability and collaboration.

Modern generative AI pre-training requires a massive training corpus, basically the entire internet plus any other quality data you can get. For large language models (LLMs), the data is raw text. These models split the data into training and test and then optimize a loss function, continually tuning the model. The last few years have been dedicated to scaling out pre-training data and compute, producing ever more effective models. From a database perspective, this raw text must be stored but the query requirements don’t go far beyond “give me all the data”. The data is also not highly curated. The more data the better. Thus, this training data is likely 99% appends.,1% updates and deletes for removing obvious garbage. Thus, I suspect training data for large language models sits comfortably in cloud storage like AWS S3 or Google Cloud Storage. When it’s needed, it’s pulled in bulk reads.

For generative AI, there are interesting curated post-training and evaluation datasets. These datasets are similar to traditional machine learning datasets with human curation and labels. I’m not very familiar with this space, but a few prospective Dolt users have been interested in versioning this data for model reproducibility, explainability, and collaboration. I’m eager to learn about this space so if you work with this data, please come by our Discord and educate me.

Vectors#

Vector databases were hot in 2024. Specialized vector database like Pinecone and Milvus were gaining traction and almost all database offerings were launching vector support. Even Dolt launched version-controlled vector support in early 2025. Vectors in databases enables universal similarity search and retrieval augmented generation (RAG). A vector database must both store vectors and be able to calculate the distance between them quickly usually using vector indexes. Vector indexes are a non-trivial technical problem so the trade-offs each database offering made is interesting. Vector support seems like table stakes for a database looking to support AI use cases. Though, as mentioned above, the popularity of RAG seems to be waning.

More Writes#

This analysis is about to get a bit more meta as I think through the new data requirements in a world where generative AI and agents are commonplace. We’re really at the beginning of a new AI technology cycle so predictions instead of hard facts rule the day. There are a number of factors stemming from generative AI that will result in more writes to databases.

In general, data becomes more useful because AI can find patterns and anomalies in data that humans can’t. This will result in more data stored. This data will likely be mostly append only, semi-structured, and read in big batches. This makes OLAP solutions attractive. Expect your company’s Snowflake and Databricks bills to go up.

Context is the term for all the information you send to an LLM to generate a response. Context is important to understanding why the LLM produced a certain response. Agents produce multiple round trips to an LLM each producing context. Context is like an agent’s stack trace. All this context should be stored resulting in much more data being written. It’s odd to me that better storage for and interfaces to context don’t already exist. This data is mostly text but could benefit from features like rewind, forking, and merging as both user-facing features or for agents themselves.

With agents becoming more commonplace, currently human-scale write workloads will become machine-scale write workloads. This may pose some challenges for multi-version concurrency control (MVCC), the current technology for managing concurrent writes to databases. Agentic writes will be at machine-scale, longer duration, large batch, and may require review. Potentially, MVCC isn’t enough. Any conflicting writes in MVCC are rolled back, making frequent, long duration, large batch writes problematic. More advanced concurrency control techniques that minimize or manage conflicting writes will be necessary.

App Builders#

As mentioned above, App Builders need a database for the App AI is building for you. Neon is the early leader in this market. Neon is the default database for both Replit and Vercel V0. Neon’s scale-to-zero and versioning capabilities made it the early leader. Neon deploys a standard Postgres database on top of a copy-on-write filesystem. The compute can be detached when not being used, allowing the user to only pay for storage. The copy-on-write filesystem allows for convenient forking of database data, allowing the App Builder to build on a separate copy while the main application continues to operate as normal.

App builders seem like an obvious use case for a fully version-controlled database. Full version control would enable multiple users to build features on many branches. Rollback to any point in the past would be convenient. The AI could inspect database diffs to debug its mistakes. Databases changes could be reviewed along with code changes. If the changes are approved, code and database changes could be merged into production together.

Structured Data#

For the agents for everything use case, agents could be making writes not only through APIs but also directly to databases. In this case, the more verifiable the result, the more reliable agentic writes will be. Structured data with schema will be preferred over unstructured, document-style data as data shape constraints can be enforced at write time. Free-form documents lack the constraints necessary to serve agentic writes. Any constraints would be enforced in APIs. Thus, I see traditional SQL databases regaining some popularity in the AI database space.

Data Testing#

Agents need tests. Agents perform better with multiple layers of verification as evidenced by the coding use case. Data testing and quality control become more important for AI databases. Data testing has traditionally lived outside of the database itself. Rules on the shape of the data were enforced by the schema and constraints. Tooling in this space could shift as agents require more data testing tools.

New Interfaces#

Will AI require different interfaces? Model Context Protocol (MCP) exposes well-defined tools to an LLM in order to perform specific tasks. Databases could directly support MCP interfaces in concert with traditional interfaces like SQL. As we learn more about which interfaces AI operates on more effectively, I suspect innovation in this space.

Permissions#

AI reads and writes may require different permissions compared to traditional database access. If an AI is sending context to an external LLM provider, like is common today, should the AI be restricted from reading personally identifiable information to prevent data exfiltration? Perhaps each agent or agent type will get its own identity and this identity will cascade down to permissions at the database layer? Are current database permissions models expressive enough? At least in SQL, the grants and privileges system is extremely expressive. I think with the proper identity that system will suffice.

Isolation#

Related to permissions is isolation. You may not want an agent anywhere near the production copy of your data. You may want agents operating on copies, physically siloed from your production data to prevent any chance of mistake. For read heavy workflows, agents may generate suboptimal query patterns so sharing a physical resource with agentic workflows may be problematic. Write isolation is a desirable property for agentic writes, but with traditional database systems, merging isolated writes back into production is difficult. Decentralized version control, like Git does for files, can facilitate merges of isolated writes.

Rollback#

With more frequent writes through potentially error-prone agentic processes, the ability to rollback changes becomes necessary. It already feels anachronistic that operator error on your database can cause data loss or hours of downtime while you restore from an offline backup. Moreover, the need to rollback individual changes while preserving others will become more important as write throughput increases. Agents will experiment with writes and roll them back if they do not pass verification, much like is done with agents writing source code today.

Audit#

Lastly, some classes of agentic writes will require human review and approval of changes. A human review workflow necessarily requires an AI database to have version control features in order to scale past a handful of concurrent changes. Branched writes, diff, and merge become essential.

Moreover, once changes are accepted, audit becomes an important function for human operators of the system. What changed? What agent did this? When? Why? There is no human to ask for more context so the system of record must preserve more metadata about a change. Audit in databases has traditionally been accomplished using a technique called slowly changing dimension. However, version control systems have audit as a first class entity in the system. Version-controlled databases obviate the need for slowly changing dimension.

AI Needs Version Control#

As I walked through what AI means for databases, you may have sensed a theme: AI needs version control. Surely, the creators of Dolt, the world’s only version-controlled SQL database would believe that. But it’s not just us. The founding engineers at Cursor think so. The UC Berkeley computer science department thinks so. Andrej Karpathy, one of the leading thinkers in AI, thinks so.

In order for AI to extend to domains beyond code, AI needs to be operating on version-controlled systems. Code is stored in files. The rest of our data is stored in databases. AI requires version-controlled databases.

Timeline#

Let’s amend yesterday’s AI timeline with some notable database events. As you can see, some key database moments are interwoven with these AI advances.

AI Database Chronology

Products#

Now, let’s review the product options available today that could be AI databases. I focus on open-source, free options.

Vector Databases#

Vector search used to be the domain of specialized databases. But now, almost every database supports vector search, including Dolt with version-controlled vectors. Moreover, I predict vector search won’t be a very important aspect of AI systems moving forward so I’m not going to dive deep into vector database solutions. However if you are looking for a vector database. check out Milvus, Weaviate, and Qdrant.

Version-Controlled Databases#

Now, let’s focus on version-controlled databases. I believe there will be more entrants in this space over time with a version-controlled option in every database category. Right now, LakeFS offers the only version-controlled OLAP database, and version control support is somewhat limited. Neon and Dolt are available for OLTP workloads. Again, Neon version control support is limited.

AI databases

LakeFS#

Tagline
Scalable Data Version Control
Initial Release
August 2020
GitHub
https://github.com/treeverse/lakeFS
SQL Flavor
Apache Hive, Spark SQL

LakeFS is a version control extension for data lakes. It sits on top of many open formats like Parquet or CSV and adds version control functionality to those formats. You get SQL via Apache Hive or Spark SQL.

LakeFS supports branches and merges. LakeFS branches employ “zero copy branching”. LakeFS stores a pointer to a commit and then stores a set of uncommitted changes. It’s not clear from the documentation how these changes are stored. Given other documentation, I suspect a new version of any changed file is stored, meaning structural sharing is done at the unchanged file level.

Merge is handled at the file level, which in these formats is more like a table. Files in data lakes can be hundreds of Gigabytes. Conflict resolution is a simple “take ours” or “take theirs” at the file level.

Diff between branches is supported via the Iceberg plugin.

Neon#

Tagline
Serverless Postgres
Initial Release
June 2021
GitHub
https://github.com/neondatabase/neon
SQL Flavor
Postgres

Neon is a very popular open-source Postgres variant. Neon separates storage and compute and substitutes the PostgreSQL storage layer by redistributing data across a cluster of nodes on a copy-on-write filesystem. Neon is the default database for the popular App Builders, Replit and Vercel.

Neon claims to support branches but there is no mention of merges in their branch documentation. All the text and visuals suggest branches are point-in-time copies and are not designed to merged. Diff between branches is not supported.

Dolt#

Tagline
Git for Data
Initial Release
August 2019
GitHub
https://github.com/dolthub/dolt
SQL Flavor
MySQL, Postgres (beta)

Dolt is true database version control. Dolt implements the Git model at the database storage layer using a novel data structure called a Prolly Tree. This allows for fine-grained diff and merge of table data while maintaining OLTP query performance.

While Neon claims to have branches, Neon’s branches are really forks. Dolt is the only database with all the version control features you know from Git: branch, merge, diff, push, pull, and clone. These features are essential for an AI database.

Dolt wasn’t built for AI. The code is ten years old. Dolt is older than the transformer, the fundamental building block of generative AI. It just so happens that AI needs Dolt’s features to reach its full potential.

Interested in learning more? Come by our Discord and I’d be happy to give you a demo.

JOIN THE DATA EVOLUTION

Get started with Dolt

Or join our mailing list to get product updates.