Agents need tests. Anthropic agrees. Here at DoltHub, we built a database unit testing framework into Dolt to support agents. Test-driven development is back!
How much do tests really matter in agentic development? My experience building DoltLite, a fork of SQLite with Dolt-inspired version control features, shaped my perspective on this question. DoltLite is entirely vibe coded. DoltLite is write-only code. This approach is not possible without tests. In fact, ideas for new test suites may be my human brain’s single biggest value add to the project. This article dives in.
DoltLite’s Story#
DoltLite started as a lark. What could I build in a week using Gas Town, a popular agent orchestrator, with unlimited budget?
Dolt is MySQL-flavored. Doltgres is Postgres-flavored. We want a version-controlled database in every database flavor. Building a new flavor is a heavy lift. We shipped the first version of Doltgres in November 2023 and it’s still in Beta. Could a swarm of agents add Dolt-style version control to SQLite more quickly?
I gave it a try. I thought maybe I would have a SQLite fork with a fragile Prolly Tree-based storage engine and a couple of version control features by the end of the week. I had that by the end of day one. That kind of development speed is intoxicating and I’ve been hooked on improving DoltLite using agents ever since.
DoltLite has been adopted into the DoltHub family of version-controlled database products. It works. Multiple people are using it for various projects. Those users are becoming contributors. We plan on making DoltLite a big part of DoltHub’s future. We always wanted a version-controlled database in every database flavor and agents have made that future possible much sooner than expected.
New Test Suites Drive Progress…#
Testing has been one of the main driving factors in DoltLite’s success. Every time I hit a wall in quality, a new test suite idea drives the next round of improvements.
The existing ~107,000 SQLite acceptance tests ruled day one. Swap out btree.h for a Prolly Tree storage engine and make all the SQLite acceptance tests pass. In about five hours, I had a working SQLite with Dolt storage. I was so surprised it actually worked I had to change the DoltLite interface just to make sure. The tests defined SQLite behavior and the agents made the tests pass for the new storage engine.
Next up was the sysbench test suite. How does the DoltLite engine compare to stock SQLite on larger tables with common query patterns? At first the answer was “it crashes”. After a few weeks, the answer is about 5% slower on reads and 50% slower on writes. This was a heavy lift. At first, DoltLite didn’t support multi-level Prolly Trees. sysbench forced the agents into the correct implementation.
With the SQLite acceptance tests and sysbench test suite looking good, I started to worry about the version control functionality. How could I test it? I instructed the agent to use Dolt as “an oracle” and compare DoltLite results to Dolt for a multitude of version control operations. This was the first agent generated test suite that found real bugs. Oracle testing was a big breakthrough for me and probably deserves its own blog. After a few weeks of iteration, I was confident DoltLite’s version control functionality was solid.
The last two performance issues discovered and fixed were driven by new performance tests, first an autocommit variant of sysbench then a text key variant of sysbench. The agents had optimized around integer keyed tables under transactional load because that is what sysbench tested. The agents happily took the shortest path to getting an optimal sysbench while ignoring other workload and table shapes. This was fixed just this week.
As you can see, agentic code is shaped by the tests. Agents work to specification. Specification is defined by the tests.
…But Lock In Implementation#
There is a downside.
Agents are spooked by failing tests. I’ve watched this happen many times, mostly in the prolly_mutmap code as I was trying to drive a major performance change. Claude starts a refactor, breaks 200 tests, and bails out. It reverts the change and writes a comment like “this is too much for one session”.
So tests drive progress, and tests calcify the implementation. The bigger your test surface, the harder it is for an agent to do an architectural rewrite. Agents color inside the lines you’ve drawn. So if your initial implementation is slightly off target, the tests will keep you slightly off target. You can prompt around this limitation but changes become more difficult.
So Many Tests#
Here’s an inventory of DoltLite tests so you get a full picture of how deep DoltLite testing has gotten.
Existing SQLite#
- SQLite smoke tests: ~107,000 pre-existing test.s
- SQLLogicTest: ~5.7M SQL query tests.
Performance#
- Sysbench: 15 read benchmarks, 8 write benchmarks all within single transactions. In-memory and file-backed.
- Sysbench Autocommit: Same as above but commit every statement.
- Sysbench Text, Blob, and Composite keys: Same as above but non-Integer key types.
Oracle#
- SQLite as Oracle: 900 custom, agent-generated SQL tests.
- Dolt as Oracle: 1,935 custom, agent-generated version control tests.
Durability#
- Crash Testing: DoltLite is durable under specific crash scenarios.
- History Independence: DoltLite produces consistent hashes under different write orders. This is my current focus.
Agents Make GitHub Actions Easy#
You can even have your tests executed on every Pull Request. This used to be so difficult: YAML, dependency handling, waiting for runners to fire. Here at DoltHub, we’ve spent days getting a single workflow right.
Now it’s a prompt. “Add this test suite as a GitHub Actions workflow”. Done. Five minutes. The agent gets the YAML right, the gh CLI usage right, the dependencies right. It even gets the markdown PR-comment de-duplication right. Find existing comment by marker and edit instead of append. This exact task once recently took a trad code engineer here at DoltHub a week to get right.
I have a sysbench benchmark workflow per table-shape category, a workflow for crash testing, a workflow for sanitizers, a workflow for WASM builds. I’d never have built half of these by hand. GitHub Actions is easy now.
Big Picture#
Writing code is so cheap now.
Let me say it differently. The value of any specific implementation has dropped, fast, in the last twelve months. Two years ago, the implementation was the asset. Now I can ask a team of agents to rewrite DoltLite in an afternoon.
So what’s the asset? The tests. The tests are the spec. The tests encode the edge cases. The tests survive a rewrite. The tests are what you’d hand to a new agent if you threw the implementation in the trash and asked for a rebuild from scratch.
Is this the future of software development? Continually refining the test suite? Competing implementations that satisfy the tests?
Conclusion#
Got an idea for another test suite I should add to DoltLite? Schema-evolution oracle that runs ALTER TABLE sequences against DoltLite and stock SQLite? Fuzz tester over random dolt_* call sequences? Property-based test for Prolly Tree round trip? An agent made those ideas up. I want to hear your ideas. Drop by our Discord and tell me what to test next. I got Claude and Codex max plans to burn.

