In 1943, the analysts had a sensible plan: look at the bombers coming back from their runs, find where they were most riddled with bullet holes, and bolt extra armour there. The wings, the tail, the fuselage. It was obvious. It was also exactly wrong.
The mistake was caught by a statistician named Abraham Wald. Those planes, he pointed out, were the ones that came back. The damage they carried was, by definition, survivable: they had been hit in those places and still flown home. The data nobody had was in the planes that never returned. So armour the spots that were clean on the survivors, Wald argued, the engines, the cockpit, the fuel lines, because a hit there was a hit you never came back to report.
This is survivorship bias, and once you have seen it you cannot unsee it. We study what made it back and quietly assume it stands for the whole population. It does not. It stands for the lucky subset. And nowhere is that trap more expensive, or better camouflaged, than in the decisions we make about software that is supposedly running just fine.
The Systems That Came Back
Listen to how we reason about systems that have not failed yet, and you will hear the returning bombers talking. "Our monolith has run for seven years, why modularise it now?" Those seven years are the bullet holes on the wing: proof of what it survived, not proof of what it can still take. "We load-tested the API at ten thousand requests a second, so thirty thousand will be fine." That is reading three times the range off an instrument you only ever calibrated to one. "We have never had a security breach, so our controls must be working." Or your attackers have simply not been motivated enough yet, and the absence of a recorded hit is not evidence of armour. "It held up last Black Friday, it will hold this one." Last year's traffic, last year's data volume, last year's dependencies, none of which are this year's.
Every one of those sentences mistakes a clean spot on a surviving plane for a strong one. The truth is harder to sit with. You usually do not know which invisible weakness kept you one timeout, one expired certificate, one retry storm away from a very different week. The incident that did not happen taught you nothing, and you quietly thanked it for the silence.
Reinforce What Would Never Come Home
Wald's correction, translated into architecture, is uncomfortable precisely because it asks you to spend effort where there is no visible damage. Build for the failures you have not had yet, not only the ones already in your incident log. Design for the edge cases you have never seen in production, because those are the exact ones with no survivors left to study. And spend your reinforcement budget on the components that, if they take a single clean hit, never return home: the auth path, the payment ledger, the data you cannot reconstruct, the one region everything quietly leans on. I have written about the framework for that kind of resilience in designing for the load you have not met yet and about the myth of 100% uptime. Survivorship bias is a large part of why we keep under-funding both. It is also the close cousin of inverting the problem: instead of asking why the system works, ask what would have to be true for it to fail, and go looking there before the failure does.
The hard part is cultural, not technical. Reinforcing the engine on a plane that keeps landing safely feels like wasted money, right up until the day it is the only reason a plane lands at all. The work nobody applauds, the failover that is never triggered, the limit that is never reached, the breach that never lands, is the work that decides whether next year you are studying your own returning aircraft, or someone else is studying the gap where yours used to be.
Reliability is not built from what survived. It is built from what didn't.