Late on a Saturday I had two windows of Claude open side by side, each running on its own. The one on the left had spun up twenty-one autonomous agents; an hour and sixteen minutes and three point six million tokens later, it reported that it was finished. The one on the right: fourteen agents, forty-six minutes, one point nine million tokens. I had not written a line of code in either. Both showed the same headline numbers – output satisfaction one hundred percent, review coverage one hundred percent. I was, briefly, delighted.


That last number is the whole story. The model was never the bottleneck – spinning up twenty agents is a button press, and three million tokens is a rounding error. What stood between me and shipping a confident, half-finished result was not a cleverer model. It was that every agent's output had been reviewed, against a definition of done the agents themselves did not hold. The requirements were doing the work the model could not.
Article 03 gave you the machine: the lineage from search, the four-layer stack, the lifecycle. This is the discipline of pointing that machine at your real systems without it quietly costing you a fortune – who stays in the loop, how to scope the work, and the things an agent cannot run without.
- Why most agent failures are a scoping problem rather than a model problem, and the three roles every production system needs (one of which teams keep forgetting).
- How Domain-Driven Design quietly solved "what is this agent allowed to do" years before agents existed.
- The seven requirements an agent cannot work without, and the five kinds of work you should never hand one.
Three Roles. Not Two.
Most teams structure AI deployment around two roles: the AI system and the human using it. That model fails in production at a predictable rate. The failure point is almost always the same: no one defined who reviews what before it mattered. By the time it matters, the cost is already paid.
There are three roles, and all three apply whether you are deploying a single Copilot instance or orchestrating ten autonomous coding agents across a sprint.
Domain-Driven Design and Agent Scope
The question that kills most agent deployments is not "which model should I use?" It is: "what exactly is this agent supposed to do?" Vague task assignment produces vague output. An agent asked to "improve the onboarding experience" has no way to succeed – the task has no defined input, no bounded context, no measurable completion criteria, and no clear failure mode. It will do something. That something will not be what you meant.
Domain-Driven Design provides the natural unit of agent work: the bounded context. A bounded context defines a domain area with its own language, its own data, and its own rules. It has explicit inputs and outputs. It has ownership. It has defined edge cases. These are exactly the properties an agent needs to operate reliably.
If the task can be described in one sentence with clear inputs, outputs, and a success condition → it is agent-ready.
If completing the task requires crossing multiple domain boundaries → it needs human orchestration before agents can handle any part of it.
If the failure mode cannot be defined in advance → it is not ready for agent execution, regardless of how sophisticated the model is.
If the agent's output would require organisational context an LLM does not have – cultural norms, relationship history, unwritten constraints – → a human must be in the decision path.
DDD was designed to manage complexity in large software systems by keeping domain concerns separate and explicit. The same complexity that DDD manages is the same complexity that defeats autonomous agents. A billing agent that accidentally touches user authentication data because the bounded context was not defined does not produce a billing bug. It produces an incident. The solution is the same in both cases: draw the boundary first.
The Seven Requirements
The following is not a list of nice-to-haves. These are the minimum requirements for an AI agent operating in a real system. If any of these are absent, the deployment will eventually fail – the only variable is when, and what the cost is.
-
01Bounded ContextThe agent's scope must be defined before deployment. What domain does it operate in? What data can it access? What actions can it take? What is explicitly out of scope? Without this, the agent is free to interpret the task – which means it will, incorrectly, in ways you will not predict.
-
02Defined Output FormatAgents do not improvise presentation. The expected output – structured JSON, a markdown document, a file change, an API call – must be specified. Ambiguous output requirements produce outputs that are technically correct and practically useless.
-
03Least-Privilege AccessThe agent should have access only to what it needs for the assigned task. Over-permissioned agents are a reliability risk as much as a security one: access to more data means more surface area for hallucination and unintended side effects.
-
04Rollback or Dry-Run CapabilityAny agent that writes to production systems must have a mechanism to preview or undo its actions. A dry run is not optional engineering polish. It is the difference between a recoverable mistake and an incident. No exceptions for agents touching live data.
-
05Human Review GateAt every consequential action – write operations, communications, deployments, financial transactions – a human reviewer must be in the loop. Define the review checkpoints before deployment, not after the first incident forces you to.
-
06Cost Awareness Per RunEvery agent call runs on a meter. Multi-step agents multiply token costs across every tool call, every retrieval, every intermediate generation. Know the expected cost per run before deployment. Set a ceiling. Monitor it. Agents with unbounded loops and no cost cap are how $500/month tools become $50,000 incidents.
-
07Explicit Failure Definition"Done" must be defined. So must "failed." An agent without a clear failure state will run indefinitely – generating costs, producing output, and reporting success with equal confidence regardless of what it actually produced. Define the exit condition. Both of them.
What Not to Delegate to Agents
The list of things agents should not do describes operational realities that apply to any autonomous system – not AI limitations. Model capability is rarely the limiting factor. Whether the task structure supports reliable autonomous execution almost always is.
- Anything without a definition of done. If you cannot describe success in advance, the agent cannot reach it. It will produce output – it will not produce the right output.
- Decisions requiring organisational context an LLM does not have. Culture, relationship history, political dynamics, unwritten constraints – these are not in the training data. They cannot be retrieved. They must be held by a human in the loop.
- Actions that compound on failure. Database migrations, bulk record updates, mass communications, production deployments without a review gate. One step in the wrong direction multiplied across ten thousand rows is not a bug. It is a crisis.
- Any task where the only reviewer is the agent itself. Agent-side self-review is a pre-check. It reduces noise. It does not replace external review. Self-certified output reaching production without human approval is a process failure, not an AI capability.
- Anything you have never done manually and fully understood. Agents accelerate processes you already control. They do not substitute for understanding a process you have never owned.
The most expensive AI failures in 2025–2026 share a pattern: autonomous agents with broad permissions, no human review gate, and unclear completion criteria. Some teams discovered this after burning through investor runway at rates that would have been unthinkable before AI tooling made it technically possible to run ten agents simultaneously. The model was not the problem. The requirements were absent.