DoltCash Part Two: Still Vibing

10 min read

A few weeks back, I wrote a blog introducing the agentic AI application we created named DoltCash. DoltCash is a vibe-accounting tool that allows you to bring your own coding-agent, like Claude Code, and get it to write and maintain your GnuCash books for you.

Vibe Accounting

The goal of such a project was to experience firsthand the process of building agentic applications backed by Dolt, and, selfishly, give me a way to do my bookkeeping through vibe-accounting, cause I’m a lazy shit when it comes to my finances.

The initial DoltCash article showed in-depth the first iteration of DoltCash and highlighted its many, many shortcomings, including its inability to perform even the most basic accounting tasks, without totally bungling the books. But I didn’t give up on the project and have since made some credible progress at improving the tool to actually, kinda-sorta, do my bookkeeping.

In today’s post, I’ll give an update on the status of DoltCash, how I was able to improve its correctness and accuracy, and like last time, share some crucial insights this process has revealed to me along the way.

DoltCash is not a finished product by any means — it’s actually three things duct-taped together that we’re using as a proof of concept.

The three components are a custom GnuCash fork that we modified to support Dolt as a database, the DoltCash MCP server that contains the primary tools and API the coding, er, “accounting” agent uses to interface with GnuCash and Dolt, and finally, the agent itself. I’ve been sticking with Claude Code since building DoltCash, but in theory, any agent that supports MCP would work.

If you need a refresher as to why we embarked on the journey of building an agentic application, you should read or reread Tim’s post, Cursor for Everything, which discusses the future of agentic applications that are now possible with Dolt. This was our inspiration for DoltCash, which aimed to bolt an agentic chat window onto GnuCash, as a prototype for agentic accounting.

Anyway, our initial build of DoltCash, while cool conceptually, really sucked at accounting. It would hallucinate financial information and present it as facts, it would confuse and conflate the basic accounting equation, which was especially bad when trying to process and balance liability transactions, and it could not consistently or effectively use the financial statement processing MCP tools to automatically categorize transactions on the human’s behalf.

For a piece of accounting software, this was all a non-starter. But building such a janky tool did yield some crucial insights:

Access to resources, like a database or MCP server, is not in itself sufficient for building competent agentic applications.
Building on Dolt is essential, so you can enable writes and never worry about data loss or corruption.
Validation of agentic work is the most important piece of the puzzle.

So we decided to persevere on DoltCash to see if we could improve the tool to actually be good (well, at least not-janky). And, turns out, we could! And we did. In the following sections, I’ll describe what I changed about DoltCash that started to yield much more accurate and correct results, as well as the new insights I’ve gained in the latest iteration cycle.

Don’t kneecap your agent#

Injured agent

The first thing I’ve learned in my continued work on DoltCash is to “not kneecap your agent”. By this I mean, don’t try to take away, constrain, or control things that your agent is really really good at. In the case of DoltCash v1, I’d added tools to DoltCash MCP that aimed to enable efficient CSV and PDF parsing and processing, but these were not good tools. AI agents, out-of-the-box, are really good at both of these things, and they’re really bad at using my dumb tools. Deleting them from the MCP server instantly made both of our lives easier and better.

And I think this is an important point: if your agent can do something better without a tool, don’t add the tool. In retrospect this seems pretty obvious, but as a former software engineer turned vibe-coder, I’m naturally inclined to touch, control, and intervene on all things my agent does, but I’ve found this to yield worse results. It’s much more valuable to either not add the tool at all or add a related tool in the “validation layer”, the surface of the application that validates results. More on this layer below.

Do handicap your agent#

Blindfolded agent

Now, this won’t be a problem if you’re building your own bespoke AI agent to do custom work or use your application, because you can fully control the context and system prompts that are hard-coded into the agents identity or persona. In this case, you get to decide what behaviors to allow or disallow in your agent.

But, if you’re like me and you want to use an existing agent to work on your use-case that might be orthogonal to its designed use-case, you may have to “handicap” it to get it to do what you want.

In the case of DoltCash specifically, one of the biggest reasons for mistakes and correctness problems I was encountering in v1 was Claude Code wanting to absolutely sprint through its tasks, as quickly as possible. For coding this makes total sense. It needs to be efficient and not bankrupt its human operator by using unnecessary tokens, so I presume there’s some instructions in its system prompt to this effect, aimed at making it move quickly and work in batches.

But in the case of DoltCash, speeding through the processing of financial statements is an anti-pattern that resulted in not just poor categorization but also nonsensical ledger writes that were hard to debug.

So, to address this, I “handicapped” the agent by forcing it (as much as possible) to slow way, way down and process a single transaction at a time. To do this, I removed all batch processing tools, making sure it had only single transaction tools, and also updated the identity prompt I give to the agent to follow my slow-way-down rules more carefully.

Doing this tremendously improved DoltCash, and although it does take more time to get through transactions, they were much more likely to be accurate. Also, the work on the part of the human was still minimal and chill — you just need to approve or correct the transactions it presents to you, without doing any manual data entry. This was the experience I was aiming for when conceptualizing DoltCash, and getting Claude to slow down really enabled this to happen.

So don’t be afraid to add friction to your agent’s process in favor of correctness. In my experience, you can always convince it to go faster, but it’s much harder to get it to go slower and continue to go slow for a duration.

Think in tests: enforce actual versus expected#

Testing

Building off of the “validation is the name of the game” insight I described in part one, I found that I also achieved better correctness, accuracy, and enabled the DoltCash agent to self-correct by modeling things as “tests”, or more specifically as “actual” versus “expected” comparisons.

In DoltCash v1, I would point the agent at a bank statement, and it would report to me that it completed writing the statement’s transactions to the GnuCash ledger, only to find that it did so all wrong, with bad data, or both.

So, to address this, I introduced a workflow for processing bank statements that at the start of the workflow, defined what the “expected” resulting values of processing that statement would be, and then, at the end of processing, have the agent compare these with the “actual” values written in the ledger. Essentially, I found another way to create a “test” that could validate work.

This is easily the most important thing you can do in order to get anything meaningful or productive out of AI, and it really is a game-changer. Not only does it allow the human to see concrete evidence that work was done correctly, it actually empowers the agent to iterate on the work until it’s correct, because it will intuitively continue looping until “actual” matches “expected”.

In the case of DoltCash, I created a bank statement preflight tool that registers key information from the bank statement about the state of the account, before any transactions are processed. These are things like statement date period, starting balance, ending balance, and number of transactions. These define the “expected” values, and the ledger, once all transactions in the statement have been processed, represents the “actual” values. To complete the processing of a financial statement, the agent must call another tool, statement.reconcile, that ensures the actual matches the expected and reports any discrepancies.

This has also greatly improved the correctness of DoltCash since v1, and thinking in terms of tests and what tools you can add to validate work is what I’m calling the “validation layer.” This is a surface area for agents that checks the work they’ve done or allows them to check and validate their own work. We recognized that this layer was crucial to agentic work, and its one of the major reasons we built Dolt Tests right into Dolt itself. Agents need tests.

The myth of Guardrails#

Nascar crash

The next important thing I’ve learned while building an agentic application is that guardrails are a myth, and it’s a losing battle trying to construct them. By guardrails I mean the limits or constraints you try to erect before-hand that try to keep your agent in check. They attempt to constrain the behavior of your agent. Though they may seem similar to tests on the surface, tests don’t try to constrain agent behavior, per se. They run on the other side of agent behavior, after its already produced its output, and then validate the resulting output. So though vaguely similar, guardrails and tests occur at different phases of the agentic workflow and try to do different things.

Now, similar to my earlier point, perhaps if you’re building an agent from the ground up for a particular use-case, it’s possible for you to add useful and enforceable guardrails to it that prevent it from doing things you don’t want it to do or acting in ways you don’t want it to act.

But again, if you’re just grabbing an agent off-the-shelf, like me, you have limited to no control.

But fret not, you don’t need control. In the new world of non-deterministic software, control is an illusion, so don’t waste time on it.

In the context of DoltCash, guardrails would mean getting the accounting agent to use the MCP tools I gave it in the exact way I intended for it to use them, at the exact time I think is right for using them, in the exact right order. But agents don’t work like this. They mostly do what you think they will but will also just do some weird stuff. This could mean doing the wrong thing, making something up, or half-assing something that could have been correct if it was full-assed but it wasn’t because they’re too busy, on a break, or concerned about your token usage like they’re your fucking dad.

In any case, for all the ways you come up with to attempt to corral them, they’ll find a new way out of your guardrails. So my point here is, don’t spend time trying to herd these cats. Instead, I think the solution to working with these new non-deterministic entities is to help them find truffles among the shit piles, which I’ll explain more below.

Move context out-of-band#

Psychic memory

One huge problem I’ve encountered with DoltCash has been agent memory management. Specifically, once the agent’s context is full and is compacted via garbage collection, the agent turns into a lying dunce.

Again, with a bespoke agent, you may be able to avoid this, but with Claude Code as DoltCash’s agent, I recognized that I’d need some way of either frequently reminding the agent of important context it needs to retain in order to successfully perform the work or alternatively, moving crucial context out of band, so it would still be accessible to the agent but not susceptible to compaction or garbage collection.

I didn’t really have an idea for how to do this until I heard about Beads from one of our customers. Beads is marketed as agentic memory and vaguely reminds me of taskwarrior but for agents. So I’ve recently been working to move important context and task management onto Beads as a first-class feature of DoltCash to see if it helps fix this memory and context compaction problem I’ve been running into.

I’ll let you know how it goes.

Build tools for truffle-hunting#

Truffle-hunting

The last insight I have to share today kinda piggy-backs (pun intended) off of avoiding guardrails and aims to steer you into building your agents to be truffle-hunters.

A “truffle” in this context is a good, valuable piece of work the agent output. And agents can certainly, at times, produce truffles. They can also produce a bunch of shit too, but that’s okay. We just need to get good at helping them find the truffles in the mess they made and move forward with only the mushrooms.

Artists work this same way. They make a bunch of paintings, some good, some shit, then pick the best ones from the bunch for the gallery show. I think we’ll need to embrace this as the standard for agentic work product for now and just get really good at helping the agents identify the good ones.

Practically speaking, how do you facilitate better truffle-hunting? Well, this ties into the previous insights I’ve mentioned earlier that I think enable this the best: 1) modeling agentic workflows as tests, and 2) finding ways to enable agents to self-evaluate their own performance against an objective measure. In my limited experience so far, this is really the only way I’ve found to get useful work out of them.

Conclusion#

This project has been really fun to work on, and believe it or not, I find myself really enjoying the vibe-accounting experience. I don’t use many financial software tools myself and don’t have a strong need to get into the weeds of GnuCash for my personal bookkeeping, so I’ve found that while building and iterating on DoltCash, I hardly ever have the GnuCash UI open, preferring to only work in the terminal with the accounting agent and chatting about my finances, instead of looking at the raw data itself.

Actually, if it were safe, I could imagine myself preferring this type of chat-style interface for even my bank accounts, instead of (only) having the direct account UX/UI to use. Obviously, this is pretty wild thinking… But who knows, maybe a working DoltCash prototype can change that!

If you’re interested in DoltCash or other agentic applications, or want to learn more about how you can build yours with Dolt, drop by our Discord.

Blog