Nobody Promised You 100% Uptime

“Once we move to the cloud, we will have one hundred percent uptime.” I have heard that sentence in more than one meeting, said with the calm confidence of someone who has just signed the contract. Every time, something in me went quiet, because I knew it was not true, and I knew that the day it stopped being true, it would not fail gently.

I learned how that day feels on a cloud migration I worked on. We had done what the diagrams ask for. The workload ran across multiple availability zones, the textbook answer to making a single failure a non-event, and we walked away believing we were bulletproof. Then one of those zones went down and took a critical service with it. The redundancy was real. The recovery was not. Our failover needed a human to notice, understand, and act, and by the time we had finished doing all three, the outage had already been counted by the people who felt it.

The lesson was not that the cloud is unreliable. The cloud did roughly what it promised. The lesson was that uptime had never been one vendor's promise to keep, and that the distance between ninety-nine point nine-nine percent on a slide and a service that actually stays up is filled with decisions only my team could make.

the chain

Uptime Is a Chain, Not a Switch

Availability is not a setting your provider turns on for you. It is a chain, and a chain is only ever as strong as its weakest link. Before a single request reaches your carefully architected service, it has already passed through layers you share with everyone else on the internet, and any one of them can break entirely on its own.

// Link 01 · The Connection You Do Not Own

The network in between

Long before the cloud, there is the path to it. If the user's internet provider is having a bad morning, if a fibre route is cut, or if DNS quietly stops resolving, the most elegant multi-region design on earth becomes invisible to the person trying to reach it. I have watched a perfectly healthy platform get reported as down because an upstream provider three companies away was the one actually failing. You do not own this link, which is exactly why you have to assume it will fail and decide in advance what your system does when it cannot be reached.

// Link 02 · The Provider, Nines and All

AWS, Azure and Google Cloud go down too

Not often, but they do, and their own contracts say so in the small print. A 99.99% availability target sounds like a rounding error until you turn it into time. It still permits roughly fifty-two minutes of downtime a year, and 99.9% permits almost nine hours. The nines are a budget, not a guarantee, and your provider is spending part of that budget on your behalf whether or not you planned for the bill.

// Link 03 · The Part You Break Yourself

The layer you fully control

Then there is the layer you own completely, and it is the one that fails most. A bad deploy. An autoscaling policy tuned to the wrong metric. A dependency that times out and drags its callers down with it. Most of the outages I have been paged for did not start in a data centre in another country. They started with a change we made, on a system that was perfectly fine until we touched it. The cloud was up. We were the incident.

So your real availability is not the best number in that chain. It is all of them multiplied together, and then multiplied again by every change you ship on top. You inherit the weakest link, and then you add your own. It helps to see what the famous nines actually buy you, because the gap between them is the difference between a coffee break and a long weekend of being down.

Availability Downtime per year Downtime per month
99% · two nines about 3.65 days about 7.2 hours
99.9% · three nines about 8.8 hours about 43 minutes
99.99% · four nines about 52 minutes about 4.4 minutes
99.999% · five nines about 5 minutes about 26 seconds

Every nine you add costs far more than the last, in redundancy, in automation, and in the discipline to keep all of it honest. That is why uptime is a decision with a price tag, not a default you inherit by signing up. I have written separately about turning these qualities into numbers a system can be held to, in designing for the load you have not met yet. This piece is about what you do once you accept that the chain will break.

lesson one

Design for Failure, Because It Is Coming

The first lesson the availability-zone outage left me was almost embarrassing in hindsight. Assume things will break, and design as though you actually believe it. We had treated multiple zones as insurance we were sure we would never claim. Redundancy you never exercise is not insurance. It is a guess wearing a confident face.

Designing for failure means deciding, on purpose and ahead of time, what happens when each piece goes away. What does the system do when a zone disappears, when a dependency is slow rather than fully down, when the database is reachable but returning nonsense? It means a small blast radius by default, so one failure cannot quietly become all of them, and bulkheads between components so a struggling neighbour cannot pull the healthy ones under with it. The more you distribute a system, the more places it can fail, which is why microservices do not remove this problem so much as multiply its surface. Resilience belongs in the first diagram, not bolted on after the first incident.

lesson two

If a Human Has to Step In, It Is Not Really Failover

The second lesson was sharper, because it was the actual reason our outage lasted as long as it did. Our failover existed. It was documented. It simply required a person to trigger it. And a recovery that waits for a human is a recovery that arrives late, because the human has to first notice, then diagnose, then remember where the runbook lives, usually at the worst possible hour of the night.

Automated recovery is the difference between an incident and an outage. Health checks that pull a sick instance before it poisons the pool. Failover that fires because a threshold was crossed, not because someone happened to be awake. Retries with backoff, and circuit breakers that stop one failing call from becoming a failing system. A runbook that lives only in someone's head is not a recovery plan. It is a hope with your name on it. The honest test is simple: if your failover needs a hero, you do not have failover. You have a fire drill you have not run yet.

lesson three

Break It on Purpose, on a Tuesday

The third lesson is the one most teams agree with and least often do. You do not actually know whether a system is resilient until you have made it fail and watched what happens next. Reasoning about failure on a whiteboard is not the same as living through it. The only way to trust your recovery is to trigger it yourself, deliberately, while everyone is awake and watching, instead of discovering its gaps at two in the morning when it counts.

This is the discipline behind chaos engineering: injecting controlled failure into a system, on purpose, to prove it can absorb the failures it will eventually meet by accident. Kill an instance and confirm the traffic reroutes. Add latency to a dependency and confirm the circuit breaker trips. Take a zone offline in a rehearsal and confirm the thing you assumed would happen actually happens. Teams who practise this stop being surprised, because they have already met their failure modes and shaken their hands.

Netflix made this idea famous, and they explained it better in motion than any paragraph can. The talk below is close to a decade old now, but it remains one of the clearest visual depictions of resilience in action that I know of, and almost everything in it still holds.

Netflix on chaos engineering and designing for resilience. Older now, but still one of the clearest visual explanations of the idea.

lesson four

Make the Alerts Mean Something

There is a last way a system fools you, and it is the most comfortable one: it looks healthy. The build is green, the pipeline is clean, the dashboards glow. None of that is the same as being ready for what happens after it goes live. A delivery pipeline only proves you can ship. Observability only proves you can see. Neither one promises the system will hold when real traffic, and real failure, finally arrive – that part was decided in the design, long before the dashboard was wired up.

The trap hiding inside those green dashboards is alert fatigue. When every CPU spike and every transient blip pages someone, the alerts stop carrying information and start carrying noise, and a team trained to ignore its alerts will ignore the one that finally matters, usually at the worst hour of the night. The repair is to tie every alarm to something a user would actually feel: an SLO being burned, latency crossing the line where carts get abandoned, errors that reach real people – not a number that looks alarming on a graph and means nothing on the ground. A good alert is a sentence about the user, not a reading off the machine.

the honest promise

Resilience Is the Honest Promise

So I have stopped arguing about the myth in those meetings and started redirecting it instead. One hundred percent uptime is not a target. It is a sentence that sets a team up to feel like failures the first time reality disagrees with them. The number you can actually stand behind is a different kind of promise: that when something fails, and something always will, most people will never notice, and the system will recover without a hero being paged.

That is what designing for resilience buys you. Not a perfect record, but a system that bends instead of breaking, that fails in small ways you planned for rather than large ways you did not, and that tells you the truth about its own health while there is still time to act. Chase the unbreakable system and you will be brittle and surprised. Build the one that recovers well, and you will be neither.

Stop chasing one hundred percent uptime. It is a number you cannot honestly promise and a standard that quietly punishes the people holding the pager.

Design for resilience instead, and let the system keep the promise the slide never could.