Every time I run the ai controller, the number of tests add up. Which is fine while we are running books, but as per our own rules I run an audit once a month on all of the projects I'm working on. Currently, we have three hundred and eighty-four user tests that are running to test the ai controller. It sounded high, and we hadn't done an audit for a while, so I ran one.
So I went looking for junk to cut. Most of the talk about testing ai generated code is about whether to write any at all. Mine was the opposite question. I thought I had too much, and I wanted to take some out.
The thing I went hunting for first was conflict. This client has changed how it treats certain transactions many times over the last few months, the way any real business does once it sees its own books clearly. Every change is a rule in the code, and every rule has tests. My worry was that somewhere in there, two tests now disagreed: one still checking the old behaviour, one checking the new, both passing, both wrong about the truth. That is the quiet way a test suite rots. It keeps reporting green while it slowly stops meaning anything.
I read every spot where a rule had changed. I did not find a single contradiction. Every time a treatment changed, the old test had been rewritten in place, not left behind to argue with the new one. The suite was honest with itself. That was the good news, and after twenty minutes of reading it was a relief.
Then I found the real problem, and it was not extra tests. It was a hole.
I had nearly four hundred tests, and the two pieces of code whose entire job is to stop a bad write to a live financial ledger had none. Zero. One of them refuses to run a live posting unless I pass an explicit, typed confirmation, so I can never fat-finger real money in a client's books by accident. The other reads the live ledger before I post anything and checks that nobody, me or the client, already entered the line I am about to enter, so I never double-post. These are the two functions standing between a normal Tuesday and a phone call I do not want to make. Both were completely untested.
I had spent months testing the comfortable parts, the date parsers and the classifiers, the bits that are easy to write tests for because they are easy to reason about. The genuinely dangerous part, the code that touches the live books, I had left bare. Not on purpose. That is just where the easy tests are.
The risk is not that you write too many tests. The risk is that you and the model together write a big comfortable pile of them on the parts that were simple, and write nothing for the one function that can actually hurt you. The number at the bottom of the terminal tells you nothing about that. A green run on the wrong tests is worse than no tests, because it talks you into trusting the thing.
I wrote nineteen new tests, all on the two guards. My total went up, not down. I came in to cut and I added, the opposite of what I expected when I sat down.
I have been here before with this same system. A few months ago it missed half a bank statement in one month, and I wrote about what that taught me. The fix then was not less human involvement. It was more, in exactly the right place. Same answer on Saturday. The AI helped me write the code, and it helped me write most of the tests. It will not tell me which untested function is the one that loses me the client. That judgement is what an enterprise actually pays a two-person AI ops team for.
That means the audit was worth it, and it still stays in my monthly ai to do. Anyone can get an AI to write the code. The work is knowing where it is quietly lying to you.
Monthly Revenues $11,000 | Clients 2 | Prospects (AI outbound agent now live) | Team: Me + part time Jan (CTO)
Day 94 of 365.