Deploying AI Agents Safely: 18 Field Findings

Yesterday the code review document came through from Jan, the CTO. A page and a half. Eighteen findings.

Six high-severity. Seven medium. Five low. Plus a comment-density note Jan added on WhatsApp the same day. The build under review was Phase 1 of the AI Financial Controller pillar I wrote up four days ago. AI agent posting transactions into a live general ledger for an enterprise client. Real money. Real cycles. Jan is the CTO I hired. His standing job is to read my code, find what is wrong, and tell me.

The build thus far has been me, not so technical guy, building Python scripts for the CFO to ai automate the manual but mundane aspects of their workflow. I've been building this, cross-referencing other models, running security skills through the work.

Still, a human in the loop caught some things.

Deploying AI agents safely is the pillar this whole 365-day build-in-public campaign is tilting toward. When you gloss over it, puff it up, and skip some of the slips, it reads like theory and isn't that helpful. Eighteen specific findings reads much more believable.

Three of the highs got shipped the same day. Tests behind every one.

One. The matcher uniqueness fix.

The agent finds open invoices to match against incoming bank payments. The old code walked the invoice list, applied a predicate (amount + payee + date), and took the first match. If two invoices both matched the predicate, the code silently picked one of them and posted. Jan's finding: if two open invoices match the same payment, the agent cannot know which is right. The right move is to route the payment to an ambiguous-exception queue and let a human pick.

The rewrite collects every invoice that survives the predicate, counts the survivors, and if more than one remains, routes the payment to exceptions with both candidates listed. The agent never guesses. The exception goes into a queue someone reads. Routing ambiguity to a human is what gets the CFO to trust the system, and the ambiguous edge cases shouldn't be much. Worth it to manually go over them as they come in. When matched and recorded, if there's a rule that can be followed, we can codify the rule. Edge cases, especially new or not fully understood cases, are routed for the human CFO to decide.

Six new tests behind the fix. Two are matching-cases that should resolve cleanly. Four are the ambiguity cases that used to fail silently. All green.

Two. The supplier pre-flight.

Before the agent posts a transaction, it checks the supplier code against the master list. If the code is not in the master, the agent stops. That much was already in. The finding: there were two known-verified supplier codes that did not appear in the master read because of a sync delay, which meant legitimate postings were being blocked. The instinct fix is to skip the pre-flight when the code looks plausible. That instinct is wrong.

The actual fix is two parts. One, a small set of known-verified codes that augment the master read. The agent will accept those even if the master sync is late. Two, an explicit override parameter for anything outside both lists, which downgrades the block to an audit-logged warning. The warning goes into the audit trail with the reviewer, the timestamp, and the override reason. Loud, not silent.

Seven new tests. The audit-log assertion is the one that matters.

If you're a C Suite exec thinking about hiring an AI deployed as a service team, you might like to see how a potential team solves enterprise level problems. Safely, with AI.

Three. The credential scrubbing.

When an OAuth token call fails (refresh denied, expired, network), the agent was writing the failure into the audit log with the full exception message attached. The exception message sometimes contained the client_secret or the access token, because some libraries put those in the failure text for debugging. The audit log is a git-pushed, off-site repo. The worst case was a leaked credential in the chain, in a place git would remember forever.

The fix is a small helper that classifies the exception, captures the HTTP status code, and scrubs client_id, client_secret, access_token, and refresh_token from the message before it lands. Wired into the three token paths. Eight new tests, including one end-to-end that pushes a fake failure and inspects what the audit log received. Nothing in clear text.

Twenty-one new tests. One hundred eighteen passing tests to one hundred thirty-nine. Eight tenths of a second to run.

Not "we follow best practices." Not "AI is safety-first at our company." Eighteen specific failure modes named by a human who is paid to find them, ranked by what would break first, fixed with tests behind every fix.

I cannot get all eighteen done this week. The fourth high-severity needs a sandbox round-trip the client has not authorised yet. The five lows will queue behind whatever ships in the next two weeks. The medium-severity comment-density cleanup ran today in fourteen passes through one file.

Independent reviewer. Findings ranked by what hurts most. Shipped in order. Tests behind every fix.

A scripted automation breaks loudly. You see it. You fix it.

An AI agent breaks quietly. It picks the wrong invoice. It posts to a supplier code that should not exist. It writes a credential into an audit log a thousand engineers will be able to read later.

The work is not done with a manifesto. It is done with code review, severity-ranked findings, a person whose only job is to find the thing that is wrong, and the discipline to ship the fix the same day with a test sitting behind it.

Yesterday Jan found eighteen. Today three are done, with the rest being worked on this week.

Day 53 of 365.

Monthly Revenues $11,800 | Clients 2 | Prospects (AI marketing employee live in 7 days) | Team: Me + Jan (CTO)

Eighteen Findings, Three Ships: Deploying AI Agents Safely in a Live Build

AI Deployment as a Service. One workflow at a time.