Forces 05 and 06 bring artifacts to production and route AI task payloads through the
running system. Forces 07 and 08 start where the monitoring dashboard used to be enough.
Force 07 adds the fourth layer to the test pyramid that nobody has built yet —
AI evaluation. Force 08 asks what SRE looks like when the failure mode is
"confidently wrong" rather than "unavailable" — and when the alert that should
fire does not, because the system is behaving exactly as designed.
// TL;DR — what you'll take away
- The test pyramid needs a fourth layer — AI evaluation — and almost nobody runs it in CI yet. The tooling exists; the adoption decision is the gap.
- Your monitoring cannot see "confidently wrong." New SLOs — hallucination rate, retrieval precision floor, agent audit rate, cost-per-value — can.
- Autonomous agent actions often cannot be rolled back. Prevention is the only strategy; rollback is a post-hoc audit.
◉
// Companion overview
All 8 Forces Reshaping How Software Gets Built — reference card for the full landscape. This article covers Forces 07 and 08.
force 07
The traditional test pyramid was built for a world where software behaved
deterministically. Given the same input, the same output emerged. Testing meant
specifying inputs, asserting outputs, and confirming the gap between the two was zero.
This is still true for the deterministic parts of your system. It is not sufficient
for the AI parts.
A language model given the same input will produce semantically similar but textually
different outputs across runs. Its response can be factually incorrect, contextually
appropriate, confidently delivered, and completely undetected by every existing test
in your suite. None of your current quality tooling was designed for this class of
failure. Most teams are shipping AI features with a testing gap large enough to
affect real users for days before anyone notices.
The Four-Layer Test Suite
L1
Functional Tests — Unit · Integration · Contract
Vitest · Playwright · Pact · Jest
L2
Non-Functional Tests — Performance · Security · Load
k6 · Locust · Snyk · OWASP ZAP
L3
Data Quality Tests — Schema · Freshness · Pipeline
Great Expectations · dbt tests · Monte Carlo
L4
AI Evaluation — Faithfulness · Relevance · Hallucination Rate
Ragas · DeepEval · LangSmith
Missing
L3 + L4 are the layers most teams are missing in 2026 →
Layers 1 and 2 are widely adopted and well-understood. Layer 3 — data quality testing
— is implemented by teams with mature data engineering practices and largely absent
everywhere else. Layer 4 — AI evaluation — is understood conceptually, has good
tooling, and has almost no production adoption. Most teams that ship AI features
have never run a single automated AI evaluation test in CI.
What Nothing Is Telling You
Load balancer
Knows
if your API response time is too slow
Error tracker
Knows
if your API is throwing exceptions or returning 5xx errors
Uptime monitor
Knows
if your endpoints are reachable and responding
Log aggregator
Knows
if unusual patterns appear in your structured output logs
Any of the above
Does not know
if your AI feature is producing confidently incorrect answers at 12% of queries
Any of the above
Does not know
if your RAG retrieval precision has degraded because the embedding index drifted
Any of the above
Does not know
if the context your AI is being given is three days stale and the answers reflect it
The last three rows are where your users are living right now if you have shipped
an AI feature without Layer 4. The system is healthy by every metric your monitoring
tracks. Users are receiving incorrect, outdated, or hallucinated responses. The
symptom — bad AI output — and the cause — a data pipeline failure, a stale embedding
index, a retrieval precision threshold that was never defined — are invisible to each other.
// The critical gap
AI features are downstream of data pipelines. Data quality issues surface as AI
quality issues — silently, in production — affecting real users for days before
anyone connects the symptom to the cause. The data quality test suite
is not a data engineering concern. It is an AI correctness concern. These two
things belong in the same CI pipeline, owned by the same team, gating the same
deployment.
The AI Evaluation Layer in Practice
The primitives for AI evaluation exist and are production-ready. Ragas and DeepEval
provide the core metrics. LangSmith provides the tracking infrastructure and golden
dataset management. The tooling is not the gap — the adoption decision is.
Faithfulness
Does the response stay within the retrieved context? Or is the model adding content not present in the source?
Ragas
Relevance
Is the response actually answering the question asked? High relevance means the answer addresses the query; low relevance means it responds to a different question.
Ragas · DeepEval
Context Precision
Of the chunks retrieved for RAG, what fraction were actually relevant to the query? Low precision means noise is contaminating the context.
Ragas
Hallucination Rate
What percentage of responses contain statements that cannot be grounded in the context or source data? Tracked over time to detect model or data drift.
DeepEval · LangSmith
Context Recall
Was the relevant information in the corpus actually retrieved? Low recall means correct answers exist in your data but the retrieval system is not finding them.
Ragas
// The golden dataset problem
AI evaluation requires curated golden datasets: representative input–output pairs that
define correct behaviour for your domain. These are not one-time artefacts —
they require active maintenance. When the underlying model is updated, when the
data the RAG system retrieves from changes, when the domain evolves, the golden dataset
must be reviewed and updated to remain meaningful. Ownership of the golden dataset —
which team maintains it, who defines "correct" for edge cases, how it is version-controlled
— is an organisational question that must be answered before the technical tooling can
do its job. Teams that skip this answer learn it the hard way: their AI eval CI job
passes every PR because the golden dataset is two model versions out of date.
// Force 07 tools · 2026
Ragas
DeepEval
LangSmith
Great Expectations
dbt tests
Vitest / Playwright / Pact
k6 / Locust
Snyk / OWASP ZAP
force 08
Site Reliability Engineering was built on a foundational assumption: a healthy system
is one where latency is within bounds, error rate is below threshold, and throughput
is within capacity. Define those bounds, alert on breach, escalate, fix. The runbook
captures what to do when any of those three numbers goes wrong.
AI systems introduce a failure mode that none of those three numbers captures: the
system is fast, reliable, and operating at normal throughput — and it is giving users
wrong answers. There is no alert for "confident incorrectness." There is no SLO for
"degraded reasoning quality." The runbook has no entry for "the model started
hallucinating entity names in user-visible summaries three days ago and nobody knew."
// The 3–6 month pattern
This failure hits every team shipping AI features — typically 3 to 6 months
after launch. In the first weeks, the team monitors closely, quality is
high, edge cases are rare. As time passes, novelty wears off, monitoring relaxes,
and edge cases accumulate in production. The model encounters inputs it was not
well-evaluated against. The embedding index has drifted from the live data.
A configuration change altered retrieval behaviour without a corresponding eval run.
By the time a user complaint surfaces, the degradation has been ongoing for weeks.
The incident review cannot identify the start date because nothing in the monitoring
recorded it.
New SLO Vocabulary for AI Systems
The answer is not more dashboards. It is defining what "healthy" means for an AI
system — the equivalent of latency and error rate SLOs, applied to AI behaviour.
These SLOs are not standard; every organisation must define them from the
characteristics of their domain. But the categories are consistent:
// SLO type 01
Hallucination Rate SLO
The maximum acceptable percentage of responses that contain claims not grounded in the retrieved context or authoritative source data. Typically set per feature tier: higher tolerance for internal tools, lower for user-facing answers.
// Measured by: Ragas faithfulness + DeepEval hallucination scorer on production samples
// SLO type 02
Retrieval Precision Floor
The minimum acceptable ratio of relevant-to-retrieved chunks in your RAG pipeline. Below this floor, the AI is generating from noisy context and output quality will degrade regardless of model capability. Alert before it affects users.
// Measured by: Ragas context precision on a rolling sample of production queries
// SLO type 03
Agent Action Audit Rate
The percentage of autonomous agent actions that are logged with full traceability: task instruction, context window state at decision time, tool call made, result received. 100% is the only acceptable audit rate — partial traceability means partial rollback capability.
// Measured by: LangFuse / LangSmith trace completeness per agent session
// SLO type 04
Cost-Per-Delivered-Value
The LLM inference cost attributable to each unit of delivered value — a completed task, a resolved query, a summarised document. Rising cost-per-value (not total cost) is the early warning signal of degrading retrieval efficiency or model misuse patterns before they become incidents.
// Measured by: token cost per completed task, tracked by feature and model version
// SLO type 05
Business Outcome SLO
Did the AI feature achieve its business purpose? Task completion rate (did the agent successfully resolve the request?), user satisfaction on AI-assisted interactions, and AI-assisted resolution rate. These are distinct from Cost-Per-Delivered-Value — they measure whether value was actually delivered, not what it cost to deliver it. A low-cost workflow that produces wrong answers fails this SLO despite passing the cost SLO.
// Measured by: product analytics + user feedback signals + agent task outcome logs (Amplitude, Mixpanel, custom event telemetry)
// The on-call reality
On-call for an AI system in 2026 requires different debugging skills than traditional
SRE. Latency profiles are normal. Error rates are normal. The investigation starts
with: what changed in the data pipeline? When was the embedding index last rebuilt?
Did the golden dataset evaluation pass on the last deployment? None of these questions
are in a traditional runbook. Building the runbook for AI failure modes is the work
of Force 08 — and most teams have not started it.
Autonomous Action Rollback: The Hard Constraint
⚠
// The one-way door problem
Traditional SRE assumes rollback is possible: revert the deployment, restore
the database snapshot, re-route traffic to the previous version. Autonomous agent
actions often cannot be rolled back. An email sent is sent. An API call to
an external service completed. A payment processed. A record deleted.
The SRE principle of reversibility — which underpins almost all traditional incident
response — does not apply when an AI agent has taken an action in the world.
This means the only viable strategy for autonomous agent failures is prevention:
explicit approval gates before consequential actions, least-privilege tool access,
and conservative action scopes that require human review for any action that
cannot be undone. Rollback is not a recovery strategy for autonomous agents.
It is a post-hoc audit.
Platform Engineering for Agent-Facing Reliability
Force 08 extends platform engineering into new territory. The 2026 platform team
is building infrastructure not just for human-facing services, but for agent-facing
reliability — the primitives that make AI systems observable, safe, and operable
at 3am.
// Force 08 tools · 2026
LangFuse
LangSmith
Arize Phoenix
OpenTelemetry
Datadog LLM Obs.
Guardrails AI
Llama Guard
PagerDuty + custom SLOs
Datadog APM
Grafana + Prometheus
CloudWatch / Azure Monitor
all 8 forces
The Cascade: Each Force Is the Ceiling for the Next
These eight forces are not independent. They form a cascade — each force upstream
determines the quality ceiling for every force downstream.
F01Requirements
→
F02DDD / Domain
→
F03Six Tracks
→
F04Polyglot Data
→
F05CICD
→
F06Middleware
→
F07Test Suite
→
F08SRE
The quality of your requirements analysis determines the quality of your domain model.
The quality of your domain model determines the quality of your AI-generated implementation.
The quality of your data infrastructure determines the quality of your AI features.
The quality of your monitoring determines how long you go blind when something goes wrong.
Each force upstream is the ceiling of every force downstream.
This is why the forces series starts with requirements and DDD rather than tools and
infrastructure. A team with excellent AI observability tooling and a shallow domain model
will observe excellent-quality failures with great clarity. The hard problems stay hard
from the top down.
The forces series maps the structural changes in how software is built, stored, shipped,
and operated when AI is in the delivery chain. Article 11 is where all eight forces
are applied simultaneously — phase by phase across the full software delivery lifecycle.
Nine phases. Eight forces at each phase. Honest impact levels. The failures that happen
at each intersection, and the tools that address them. It is a reference, not an argument.
Keep it open in another tab.
// tool references last reviewed · June 2026