What Agents Actually Need

Late on a Saturday I had two windows of Claude open side by side, each running on its own. The one on the left had spun up twenty-one autonomous agents; an hour and sixteen minutes and three point six million tokens later, it reported that it was finished. The one on the right: fourteen agents, forty-six minutes, one point nine million tokens. I had not written a line of code in either. Both showed the same headline numbers – output satisfaction one hundred percent, review coverage one hundred percent. I was, briefly, delighted.

// the actual receipt · two parallel sessions, one Saturday
Claude session status bar reading 1 hour 16 minutes, 3.6 million tokens
Session 1 · 21 agents
Output satisfaction100%
Review coverage100%
Work completeness94%
Claude session status bar reading 46 minutes, 1.9 million tokens
Session 2 · 14 agents
Output satisfaction100%
Review coverage100%
Work completeness83%
Satisfied and reviewed in full, both of them. And yet one in sixteen pieces of work on the left, and nearly one in six on the right, simply not done.

That last number is the whole story. The model was never the bottleneck – spinning up twenty agents is a button press, and three million tokens is a rounding error. What stood between me and shipping a confident, half-finished result was not a cleverer model. It was that every agent's output had been reviewed, against a definition of done the agents themselves did not hold. The requirements were doing the work the model could not.

Article 03 gave you the machine: the lineage from search, the four-layer stack, the lifecycle. This is the discipline of pointing that machine at your real systems without it quietly costing you a fortune – who stays in the loop, how to scope the work, and the things an agent cannot run without.

// previous · article 03
You Already Know AI – You Just Called It Search
The lineage from search to AI, the four-layer stack, and the two lifecycles you can play. Start there if you skipped it.
// in one breath
  • Why most agent failures are a scoping problem rather than a model problem, and the three roles every production system needs (one of which teams keep forgetting).
  • How Domain-Driven Design quietly solved "what is this agent allowed to do" years before agents existed.
  • The seven requirements an agent cannot work without, and the five kinds of work you should never hand one.

Three Roles. Not Two.

Most teams structure AI deployment around two roles: the AI system and the human using it. That model fails in production at a predictable rate. The failure point is almost always the same: no one defined who reviews what before it mattered. By the time it matters, the cost is already paid.

There are three roles, and all three apply whether you are deploying a single Copilot instance or orchestrating ten autonomous coding agents across a sprint.

// the three-role model · mandatory in production
Role 01 · End Users
Initiate intent and evaluate final output. They describe what they need. They approve or reject results. They do not write specifications for agents. They do not debug agent chains. Their only job is to be clear about desired outcomes – and to evaluate whether the output actually meets them.
Role 02 · Agents (three sub-roles)
Execute the work. Three sub-roles within any non-trivial agent system:
Planner – Receives user intent and decomposes it into a sequence of steps with defined inputs and outputs. This is where most agent failures originate: if the plan is underspecified, every downstream step amplifies the error.
Worker – Executes individual steps. Calls tools, retrieves data, generates outputs. Operates inside a bounded context. Does not make architectural decisions.
Reviewer (agent-side) – Checks the worker's output against the step's success criteria before handing off. It runs before the human reviewer, as a pre-check rather than an approval gate, and it reduces the noise that reaches humans rather than the risk.
Role 03 · Human Reviewers
The mandatory oversight layer. They approve, reject, or correct agent output at defined checkpoints. In a production system, a human reviewer must be in the loop at every point where the agent's action has external consequences – write operations, communication, financial transactions, deployments. The agent reviewer sub-role can be automated. The human reviewer cannot be replaced. Any architecture that removes human review from consequential actions is a prototype, not a system.
boundaries

Domain-Driven Design and Agent Scope

The question that kills most agent deployments is not "which model should I use?" It is: "what exactly is this agent supposed to do?" Vague task assignment produces vague output. An agent asked to "improve the onboarding experience" has no way to succeed – the task has no defined input, no bounded context, no measurable completion criteria, and no clear failure mode. It will do something. That something will not be what you meant.

Domain-Driven Design provides the natural unit of agent work: the bounded context. A bounded context defines a domain area with its own language, its own data, and its own rules. It has explicit inputs and outputs. It has ownership. It has defined edge cases. These are exactly the properties an agent needs to operate reliably.

// bounded context test – is this task agent-ready?

If the task can be described in one sentence with clear inputs, outputs, and a success condition → it is agent-ready.

If completing the task requires crossing multiple domain boundaries → it needs human orchestration before agents can handle any part of it.

If the failure mode cannot be defined in advance → it is not ready for agent execution, regardless of how sophisticated the model is.

If the agent's output would require organisational context an LLM does not have – cultural norms, relationship history, unwritten constraints – → a human must be in the decision path.

DDD was designed to manage complexity in large software systems by keeping domain concerns separate and explicit. The same complexity that DDD manages is the same complexity that defeats autonomous agents. A billing agent that accidentally touches user authentication data because the bounded context was not defined does not produce a billing bug. It produces an incident. The solution is the same in both cases: draw the boundary first.

requirements

The Seven Requirements

The following is not a list of nice-to-haves. These are the minimum requirements for an AI agent operating in a real system. If any of these are absent, the deployment will eventually fail – the only variable is when, and what the cost is.

  1. 01
    Bounded Context
    The agent's scope must be defined before deployment. What domain does it operate in? What data can it access? What actions can it take? What is explicitly out of scope? Without this, the agent is free to interpret the task – which means it will, incorrectly, in ways you will not predict.
  2. 02
    Defined Output Format
    Agents do not improvise presentation. The expected output – structured JSON, a markdown document, a file change, an API call – must be specified. Ambiguous output requirements produce outputs that are technically correct and practically useless.
  3. 03
    Least-Privilege Access
    The agent should have access only to what it needs for the assigned task. Over-permissioned agents are a reliability risk as much as a security one: access to more data means more surface area for hallucination and unintended side effects.
  4. 04
    Rollback or Dry-Run Capability
    Any agent that writes to production systems must have a mechanism to preview or undo its actions. A dry run is not optional engineering polish. It is the difference between a recoverable mistake and an incident. No exceptions for agents touching live data.
  5. 05
    Human Review Gate
    At every consequential action – write operations, communications, deployments, financial transactions – a human reviewer must be in the loop. Define the review checkpoints before deployment, not after the first incident forces you to.
  6. 06
    Cost Awareness Per Run
    Every agent call runs on a meter. Multi-step agents multiply token costs across every tool call, every retrieval, every intermediate generation. Know the expected cost per run before deployment. Set a ceiling. Monitor it. Agents with unbounded loops and no cost cap are how $500/month tools become $50,000 incidents.
  7. 07
    Explicit Failure Definition
    "Done" must be defined. So must "failed." An agent without a clear failure state will run indefinitely – generating costs, producing output, and reporting success with equal confidence regardless of what it actually produced. Define the exit condition. Both of them.
the other list

What Not to Delegate to Agents

The list of things agents should not do describes operational realities that apply to any autonomous system – not AI limitations. Model capability is rarely the limiting factor. Whether the task structure supports reliable autonomous execution almost always is.

// do not delegate these to agents
  • Anything without a definition of done. If you cannot describe success in advance, the agent cannot reach it. It will produce output – it will not produce the right output.
  • Decisions requiring organisational context an LLM does not have. Culture, relationship history, political dynamics, unwritten constraints – these are not in the training data. They cannot be retrieved. They must be held by a human in the loop.
  • Actions that compound on failure. Database migrations, bulk record updates, mass communications, production deployments without a review gate. One step in the wrong direction multiplied across ten thousand rows is not a bug. It is a crisis.
  • Any task where the only reviewer is the agent itself. Agent-side self-review is a pre-check. It reduces noise. It does not replace external review. Self-certified output reaching production without human approval is a process failure, not an AI capability.
  • Anything you have never done manually and fully understood. Agents accelerate processes you already control. They do not substitute for understanding a process you have never owned.

The most expensive AI failures in 2025–2026 share a pattern: autonomous agents with broad permissions, no human review gate, and unclear completion criteria. Some teams discovered this after burning through investor runway at rates that would have been unthinkable before AI tooling made it technically possible to run ten agents simultaneously. The model was not the problem. The requirements were absent.

You now have the architecture from Article 03 and the discipline to run it. The next question is where it actually breaks. From here the series moves into the engine room: the nine phases of the delivery lifecycle, and the one concrete failure that shows up in each when teams point AI at a process that was never designed for it.