Eating your own AI dog food: AI Operations playbook

Yesterday morning, 8:47 CEST. The smoke test against the sandbox copy of the enterprise client's accounting platform comes back green. When I started on this project I was a little unsure as to whether I could actually pull this off. The ask, automate as much as possible the mundane manual accounting day to day for the CFO so that they could get more time back.

I spoke about this in another post. 20 hours per month, spent doing manual keystrokes into a platform, fiddling with + and - signs, eur, chf, usd inputs. Not really what a CFO needs to be doing. In fact, I specifically recall the COO saying: "This person is literally doing things that a 16 year old could do, but she's the only one that can do it." That sounds exactly like a job to try and automate with ai. I just didn't know if I could do it because there are a lot of unknowns going into a situation that has; strong SEC regulated environments, a new MSP lockbox on security and permissions, an accounting software I had never heard of and knew nothing about, 5 different banking jurisdictions, 5 different entities, the list goes on. I was optimistic, but until I could actually start running tests in a sandbox, there was no way I could tell. Or even how long it would take to know.

All six write methods cleared. Validation errors on each one (entity null, missing date, missing supplier ID), exactly the errors I want. The auth layer passed. The permissions layer passed. The calls reached business logic. Six weeks ago this was a CSV that was hand-building every month from a bank export. The AI operations playbook that got us from there to here has six steps and one rule no model can replace.

Step one was the export file. Sixty-four rows, three months of bank data, every column massaged into the format the accounting platform would accept. That file became the truth. Anything we shipped had to produce a CSV byte-identical to it. This was the easiest part to do.

Step two was three calls of thirty minutes each, one a week. I recorded each one in Granola, kept the transcripts. Then I asked her for a Loom walking through her actual reconciliation workflow inside the platform. Bank rows on the left. Purchase invoices on the right. The manual matching. The memorise function for repeat patterns. Multi-currency advanced posting (USD leaving, EUR applied). Multi-invoice allocation. Until I watched her click through it, I was guessing at the workflow from her email language. For each of these calls, my ai co-pilot comes up with the questions that it needs to know to build the scripts, and I need to understand equally what is being done.

Step three was the first draft. With those four artefacts (the CSV, three transcripts, the Loom), we built the Python script. The script ran. It produced a CSV that was almost byte-identical to her April truth file. Almost. That's ok, this is the first draft.

Step four was two reviews against the script. The first was Garry Tan's gstack toolkit, a thin-harness, fat-skills approach to development workflows. I leaned hardest on the CSO skill, the security audit pass: input validation, error handling, secrets management, what happens if the bank file is malformed, what happens if a transaction date crosses a quarter boundary. The CSO pass found three things I had not thought through.

The second was a multi-model review. I built a project brief, what the script does, why, what the inputs and outputs are, what we're worried about, and pasted the whole thing into Gemini 2.5 Pro, then into GPT-5.5. I needed to understand everything the brief explained and said, so there was editing, riffing back and forth with the Opus 4.6 model. When sent to the other models, both produced the same general read but with different blind spots. Gemini caught two off-by-one indexing bugs that had survived every previous test. GPT-5.5 flagged a case I had not considered: what happens when the bank file contains a refund (negative amount) tied to a positive invoice? Both reviews surfaced bugs that were fine for sandbox testing. Neither was fine for a live month-end run. Not ok, when we are talking about big numbers and trying to make a first good impression.

Step five was implementing the fixes. Then building the regression tests. Fifty-six tests in 1.5 seconds. The headline three: a byte-identical test against the April truth file, a partial-overlap test against the interim April truth file (nineteen rows match, nineteen for nineteen), and one row-level test on a single April 20th IRS payment where the script has to flip the sign and preserve the description verbatim. Three months of bank data, every transaction reproduced exactly.

I'm feeling a little more confident that I'll be able to pull this off. 12 months ago, I would have been passing this off to... I am not even sure who I would trust to build something like this and it would take a lot longer to pull off.

Step six was the off-site audit anchor. A separate private GitHub repo, append-only, mirroring every operational session. The clock started Friday. By June close the anchor will have a full month of real history under live conditions. Production cutover waits on the anchor running for a full close, not the test suite passing.

That is the technical playbook.

Before any of this touches a live month-end close, a CTO-caliber engineer is going to sit down and read every line of the script. Not me, but someone I trust that would have been that person I would have asked to build this manually a year ago. Now, I can do it myself with the right tools. Not run the tests. Not skim the diff. Read it, line by line, with the question "what could go wrong in production that the tests do not catch."

I have one. He has reviewed two pieces of work for this client now: an architecture pass on a research-AI proof of concept, and the QA on this script. If a third use lands inside the next month, I formalise it: name him as the external CTO and QA reference, fix a rate, slice it out of the retainer. I'm calling him my CTO now, instead of my mate, friend, a guy I know. Much cleaner that way.

Fractional AI ops is not "I built it with AI, ship it." Fractional AI ops at the production end is: I built it with AI, three other models reviewed it from different angles, the test suite proves it matches the truth, and one human who has shipped real production code for twenty years still reads every line before it goes live. That's the basis for a comfortable answer to the question: how confident that this is going to work?

This month, May runs manually as usual, like every other month. Mid-May, two to four small live USD payments run through the script in parallel, maybe more, hand-paced, under CFO sign-off, while the manual work keeps being done the old way. June close is the first cutover candidate. Same human in the loop. Same anchor. Same line-by-line review on any change to the script.

Three months of vibe coding had me thinking AI could replace the senior engineer. It can't. AI made it possible to ship in three weeks something that would have taken a six-month engagement only a year ago. AI did not make the human reviewer optional. It made the human reviewer cheaper, because the human is now reviewing tested code with a known truth file, not building from scratch.

Monthly Revenues $11,800 | Clients 2 | Prospects 1, a different 1 than before. | Employees - Still just me. For now.

Day 45 of 365.

Eating your own AI dog food: AI Operations playbook

See if this is the right fit.