Agents Shouldn't Improvise: The Case for a Verified Task Catalog

The Demo That Doesn't Survive Contact With an Auditor

Every enterprise has now seen the demo. An agent is handed a goal in natural language, a bundle of tool schemas, and credentials to real systems. It reasons out loud, composes API calls on the fly, chains them together, and — in the demo — gets the right answer. The room is impressed. Then someone from internal audit asks the only question that matters for production: when it does this next quarter, against live data, how will we prove what it did and why?

The honest answer, for most agent frameworks, is a transcript. A log of what the model said it was doing, stapled to a log of HTTP calls, with the connection between intent and effect left as an exercise for the investigator. That is not an audit trail. That is a diary. And the gap between a diary and an audit trail is the gap between a demo agent and a production agent.

The fix is not a better model or a longer prompt. It is an architectural decision about what the agent is allowed to execute in the first place. Production agents should not improvise operations against your systems of record. They should select from a catalog of verified, replayable tasks — in Own360's registry, 268 single-app tasks and 42 cross-app workflows, every one of them permission-scoped, contract-tested, and audit-logged.

Improvisation Is the Wrong Kind of Freedom

The improvisational pattern feels powerful because it is maximally general: give the model raw API access and it can, in principle, do anything. But "can do anything" is precisely the property that makes a system unshippable. A language model is non-deterministic. Given the same goal twice, it may compose different call sequences — usually equivalent, occasionally not, and the occasional case is the one that ends up in an incident review.

When the improvising agent misbehaves, there is no artifact to fix. You cannot patch a call sequence that was invented at runtime; you can only adjust the prompt and re-roll the dice. On-call engineers know exactly what this means: a system whose failure modes cannot be enumerated, reproduced, or regression-tested. Every remediation is a vibe.

An improvised action cannot be re-run, because it was never a thing — it was a moment. Catalogs turn agent behavior from moments into artifacts.

The catalog inverts the freedom. The model keeps its judgment — which tasks, in what order, with what parameters — but the operations themselves are fixed, named, versioned artifacts. When something goes wrong, there is a specific task with a specific version that did a specific thing, and it can be replayed, tested, and fixed like any other piece of software.

Fig 1 — Two execution paths. The model reasons in both; only one produces artifacts an auditor can stand on.

What "Verified" Actually Means

"Catalog" undersells it if you picture a list of tool names. A verified task in Own360's registry is a contract-bound unit of execution with four properties.

It is replayable against the contract. Each task targets a typed operation on the application's contract — not a URL someone found in the docs. On every release, the task is replayed against that contract: same inputs, expected outputs, expected side effects. If an application changes in a way that would break the task, the verification fails before an agent ever runs it in production. This is the same discipline as a regression suite, applied to agent capability.

It is permission-scoped. The task declares the RBAC scope it requires, and agents run under the same RBAC as humans. An agent invoking a task must hold a role a human could hold, with the same approval chain behind it. There is no "agent superuser" lurking under the abstraction.

It is audit-logged as a unit. When a task runs, the audit record captures the task identity and version, the invoking agent, the inputs, the resulting diff, and the reversal handle. Because the unit of execution is named and versioned, the log entry is meaningful on its own — "books.invoice.three_way_match v1.4 ran with these inputs and produced this diff" — rather than a heap of HTTP calls awaiting forensic reconstruction.

It is reversible where reversal is possible. Tasks that mutate state carry their undo. That single property changes the risk conversation: the question stops being "what if the agent is wrong?" and becomes "how quickly do we detect and reverse it?" — a question operations teams already know how to answer.

Fig 2 — A catalog entry is a contract-bound, permission-scoped, replayable unit — software, not vibes.

The Model Chooses; the Runtime Executes

The predictable objection: doesn't a catalog neuter the intelligence? If the agent can only do 310 things, why bother with a frontier model?

Because the intelligence was never supposed to live in the API calls. It lives in the judgment — reading a messy situation, deciding which tasks apply, sequencing them, choosing parameters, knowing when to stop and hand off to a human. In Own360's architecture, OwnAgents reason over the catalog the way a competent employee reasons over their actual authority: creative about the plan, constrained in the actions. The model chooses and sequences tasks. It does not invent endpoints. This is the practical resolution to the tension between deterministic workflows and non-deterministic agents: non-determinism is confined to the planning layer, where it is valuable, and excluded from the execution layer, where it is dangerous.

Composition keeps the ceiling high. From 268 single-app tasks and 42 cross-app workflows, the space of valid sequences is enormous — more than any team will use — but every point in that space is made of verified moves. The floor is fixed; the combinations are not.

The catalog is the floor, not the ceiling. But the floor is what auditors, risk teams, and on-call engineers stand on.

Fig 3 — Each layer narrows what the layer above proposed. The model's freedom is preserved where it is safe and removed where it is not.

How the Catalog Grows Without Growing the Risk

A fixed catalog would eventually become the bottleneck, so the growth path matters as much as the current count. The safe pattern is promotion, not improvisation: a candidate task is authored against the application contract, exercised in non-production environments, replayed until its behavior is characterized, reviewed for permission scope, and only then promoted into the verified registry — where all six OwnAgents can draw on it immediately.

Notice what this does to the economics of trust. In the improvisational model, every new agent behavior is a fresh risk assessment, because nothing learned about yesterday's behavior transfers. In the catalog model, verification is paid once per task and amortized across every agent, every team, and every subsequent run. The registry compounds. It also becomes the natural place to answer the governance question "what can our agents actually do?" — with an enumerable list rather than a shrug, which is the foundation both agent observability and agent governance are built on.

Why Improvised Tool-Use Fails Audits

Strip away the architecture and this is ultimately an evidence problem. An audit — SOX, ISO, internal — asks three questions about any actor that changes financial or operational records: what was it authorized to do, what did it actually do, and can you demonstrate the controls that kept those two aligned?

Improvised agents fail all three. Authorization is whatever the credentials allowed, which is far broader than anything intended. Actions are reconstructible only by correlating model transcripts with network logs. And the control demonstration collapses, because there is no enumerable set of behaviors to test controls against — you cannot sample from an infinite population. The auditor is not being conservative when they refuse to sign off on this; they are being accurate.

Catalog agents pass the same three questions in the same shape as human evidence. Authorized: these tasks, this RBAC scope, granted through this approval flow. Performed: these task runs, these versions, these inputs and diffs. Controlled: verification on every release, per-run audit records, reversal handles — replayable on demand. The audit trail stops being the awkward appendix of your agent program and becomes its strongest asset.

The industry keeps framing agent maturity as a model problem — wait for the next generation and the reliability will come. The auditors have it right. Reliability you cannot evidence does not exist for enterprise purposes, and evidence is an architecture, not an emergent property. Give the model its judgment. Give production a catalog.

Browse the verified task registry

Own360's six agents draw on 268 verified single-app tasks and 42 cross-app workflows — permission-scoped by OwnCentral, replayable against contract, every run audited and reversible.

See it live →