The Cobra We All Bred

On the night of 19 October 2025, a single AWS region in northern Virginia lost track of its own network, and a startling share of the internet went quiet with it. No attack, no disaster. A stale DNS record inside one region called us-east-1, cascading outward until services that ran nowhere near Virginia could not start, could not sign in, could not recover. I have built on clouds like this for twenty years, and what unsettled me was not that a region failed. Everything fails eventually. It was that so many of us had quietly, rationally agreed to fail in the same place.

There is an old idea from economics that explains how a thing like that happens without anyone deciding it should. It is usually told as a story about snakes.

// the parable

The German economist Horst Siebert gave it a name in a 2001 book about the ways economic policy goes wrong: the cobra effect. The story he told is the one you may know. The British administration in colonial India, worried about venomous cobras, put a bounty on dead ones. Enterprising people began breeding cobras to collect it. When the authorities realised what was happening and cancelled the reward, the breeders set their now-worthless snakes loose, and the city was left with more cobras than before. Whether it happened quite like that is doubtful. Historians who go looking find little record of it, and naturalists note that cobras do not breed obligingly in captivity, so treat it as a parable rather than a documented event. The mechanism underneath it is real, though, and has a precise name: a perverse incentive. You reward the outcome you want, and the reward quietly manufactures the outcome you were trying to prevent.

I kept thinking about that snake while I read the post-mortem of the outage.

the default

We Each Made the Same Sensible Choice

There was never a bounty on us-east-1. There was something subtler: a thousand small rewards, all pointing the same way. It is AWS's oldest region, opened in 2006, and its largest. New services tend to land there first, and most of them are cheapest to run there. It is the default in the SDK and the CLI, in Terraform and the console, in nearly every tutorial a new engineer ever follows. The global token service quietly defaults to it. When you spin up a quick proof of concept, when a CI pipeline needs somewhere to live, when a startup picks a region on day one and never revisits the decision, the path of least resistance runs through northern Virginia.

Every one of those choices is individually rational. Cheaper, faster to start, more complete, exactly what the documentation assumed. Nobody set out to build a monoculture. We each made the sensible local decision, and the sensible local decision, made by enough people for enough years, became an industry-wide single point of failure. That is the cobra effect with no villain in it. The reward was convenience and a slightly smaller bill. The cobra we bred was concentration.

what actually broke

A Second Region, and No Way to Reach It

By AWS's own account, the trigger was almost banal. A latent race condition in the system that manages DynamoDB's DNS left the regional endpoint with an empty record, so for a few hours dynamodb.us-east-1.amazonaws.com simply could not be resolved. DynamoDB itself recovered in roughly three hours. The damage did not stop there, because in a cloud this interconnected, nothing fails alone.

// the cascade, by AWS's own account

DNS erased the address. A stale plan overwrote a newer one and was then cleaned up, deleting the IP addresses for the regional DynamoDB endpoint. Requests could no longer find it.
EC2 fell next. The subsystem that tracks server leases keeps its state in DynamoDB; the leases expired, and when DynamoDB returned, the stampede to re-establish them collapsed under its own weight. EC2 took most of the following day.
The recovery machinery jammed. Load-balancer health checks fell into a loop, and identity calls inside the region stalled. The total event ran about fourteen hours, from late on the 19th to mid-afternoon on the 20th.

More than a hundred AWS services were touched, and through them a long tail of the consumer internet: by the outage trackers' tallies, services people used every day went dark or wobbled for hours, with reports running into the millions. AWS named no customers, which is its habit. The full account is public, in AWS's own post-event summary, and it is worth reading slowly.

Here is the part that should keep architects awake. Plenty of the teams that went down had done the right thing. They had a second region. They had a disaster-recovery plan. What they did not always have was a way to run it, because the machinery of failing over often lives in us-east-1 too. AWS's own fault-isolation guidance is blunt about this: a number of its global services keep a single control plane in one region while serving traffic everywhere, and for a long list of them, including IAM, the Route 53 management plane, and the default endpoint of the token service, that one region is us-east-1. When you try to fail over, you assume an IAM role and you update a DNS record. During the outage, those were exactly the calls that stalled. You can build a second home and still be unable to open the door, if the keys are kept in the house that is on fire.

In the room, this is not an abstraction. It is an architect watching design debt come due in real time, the shortcut from three years ago presenting its invoice. It is a CloudOps engineer staring at a pipeline that cannot deploy the fix. It is an SRE on a 3 a.m. call, explaining a breached service level they did not break and never agreed to, only inherited from whoever picked the region on day one.

meanwhile, in europe

The Rules They Resented Kept Them Online

While a good part of the American internet waited on Virginia, a lot of European systems stayed up. Not because their architects were wiser, but because a regulator had, in effect, made the prudent choice for them years earlier, and for an entirely unrelated reason.

// GDPR · resilience by accident

A data-protection law, not a resilience regime. It never tells you where to keep a server. What it does is make moving personal data out of the EU genuinely hard, through the lawful-transfer machinery of its Chapter V, and that pressure pushes teams to run inside European regions and keep data close to home. Resilience was never the point; the word appears in passing in Article 32, as one property of keeping data safe. The multi-region, data-sovereign shape this produced was a side effect, one that happened to survive a single region going dark.

// DORA · resilience on purpose

From January 2025, financial firms across the EU answer to a regulation built for operational resilience outright: mandatory ICT risk frameworks, real resilience testing, and direct supervisory oversight of the critical cloud providers everyone leans on. That last part reads as if it were written about this exact night. It exists because regulators worried about the whole sector resting on a handful of hyperscalers in a handful of places.

Read those two side by side and the shape of the argument appears. GDPR bought Europe resilience by accident, as a by-product of caring about privacy. DORA buys it on purpose. And DORA, in particular, is close to a regulator's formal answer to the cobra effect: when an entire sector defaults to the same provider in the same region, that concentration stops being a private engineering choice and becomes systemic risk, so someone with the authority to do so decides to price it in before the snakes get loose.

There is a tidy irony in that. The rules a lot of engineers grumble about, the data-residency reviews and the audit trails and the third-party registers, are the same rules that quietly handed them a second region they were too busy to build for themselves. Sometimes the regulation you resent is the only thing standing between you and your own perverse incentive.

the lesson, borrowed

The Wiser Learn From Someone Else's Night

There is a line, usually called a Chinese proverb, that the wise learn from their own mistakes and the wiser learn from other people's. The cheapest way to absorb the lessons of that October is to treat them as someone else's mistakes, and bank them before you have to live them.

The first lesson is about blast radius. A region, an availability zone, a provider: each is a boundary, and the discipline is to decide on purpose how far a failure is allowed to travel, and to design that isolation in before you ship rather than discover it at 3 a.m. The second is that a recovery plan you have never executed is a document, not a capability. Replicating infrastructure as code so a second region is real instead of aspirational, then rehearsing the failover and the failback until both are boring, is the whole game. I have argued this from the other side in Nobody Promised You 100% Uptime: if your failover needs a hero, you do not have failover. October adds one clause to that. If your failover needs the region that is already down, you do not have failover either.

The rest is familiar, because none of it is new. Break the system on purpose, on a quiet Tuesday, and include the region-wide failure everyone assumes will never come, because the failure modes you have rehearsed are the only ones you meet calmly when they arrive for real. Prefer designs that keep running without phoning home for permission, so that fail over never quietly means ask the broken region for help. This is the same non-functional discipline I keep returning to in Design for the Load You Have Not Met Yet: availability and resilience treated as numbers a system is held to, decided early and built in, not adjectives argued over during the incident.

And the last lesson is not technical at all. Resilience is not a feature you add near the end. It is a habit a team keeps from the beginning, the choice to automate recovery instead of improvising it, to rehearse instead of hope, to treat the dull drill as the real work. That is a culture, and cultures are built, not bought, and certainly not defaulted into.

So when the next region has its bad night, and it will, spare a thought for the people who will be awake for it. The DevOps, CloudOps, and SRE engineers are the ones who hold the line between an outage and a meltdown, usually in the dark, usually without thanks, very often cleaning up cobras that someone else bred years before they ever arrived. Build the second region. Rehearse the failover. And learn the names of the people who run toward the fire while the rest of us are still refreshing a status page.