On the AWS Outage – O’Reilly

Everybody notices when something big fails—like AWS’s US-EAST-1 region. And fail it did. All sorts of services and sites became inaccessible, and we all knew it was Amazon’s fault. A week later, when I run into a site that’s down, I still say, “Must be some hangover from the AWS outage. Some cache that didn’t get refreshed.” Amazon gets blamed—maybe even rightly—even when it’s not their fault.

I’m not writing about fault, though, and I’m also not writing a technical analysis of what happened. There are good places for that online, including AWS’s own summary. What I am writing about is a reaction to the outage that I’ve seen all too often: “This proves we can’t trust AWS. We need to build our own infrastructure.”

Building your own infrastructure is fine. But I’m also reminded of the wisest comment I heard after the 2012 US-EAST outage. I asked JD Long about his reaction to the outage. He said, “I’m really glad it wasn’t my guys trying to fix the problem.”¹ JD wasn’t disparaging his team; he was saying that Amazon has a lot of expertise in running, maintaining, and troubleshooting really big systems that can fail suddenly in unpredictable ways—when just the right conditions happen to tickle a bug that had been latent in the system for years. That expertise is hard to find and expensive when you find it. And no matter how expert “your guys” are, all complex systems fail. After last month’s AWS failure, Microsoft’s Azure obligingly failed about 10 days later.

I’m not really an Amazon fan or, more specifically, an AWS fan. But outages like this should force us to remember what they do right. AWS outages also warn us that we need to learn how to “craft ways of undoing this concentration and creating real choice,” as Signal CEO Meredith Whittaker points out. But Meredith understands how difficult it will be to build this infrastructure and that, for the present, there’s no viable alternative to AWS or one of the other hyperscalers.

Operating and troubleshooting large systems is difficult and requires very specialized skills. If you decide to build your own infrastructure, you will need those skills. And you may end up wishing that it isn’t your guys trying to fix the problem.

Footnote

In 2012, I happened to be flying out of DC just as the storm that took US-EAST down was rolling in. My flight made it out, but it was dramatic.

Source link

What's Hot

TCS Partners with Mistral; becomes the first Global Systems Integrator (GSI) to bring Mistral Forge to enterprises worldwide

BT outlines mobile video pilots with Meta

Posit AI Blog: torch 0.9.0

On the AWS Outage – O’Reilly

Open Source Ecosystems – O’Reilly

What Academics Need to Know About Industry Chip Design

I Like Ferrari’s Luce EV. But This Is Why It’s Heartbreaking

Does AI really make workers more productive?

Indonesia bans Polymarket over online gambling concerns

The Download: coding’s future, the ‘Steroid Olympics,’ and AI-driven science

Understanding U-Net Architecture in Deep Learning

Hard-braking events as indicators of road segment crash risk

Redefining AI efficiency with extreme compression

TCS Partners with Mistral; becomes the first Global Systems Integrator (GSI) to bring Mistral Forge to enterprises worldwide

BT outlines mobile video pilots with Meta

Posit AI Blog: torch 0.9.0

For AI, Context Isn’t Optional: What Data and Analytics Leaders are Saying

Our Picks

TCS Partners with Mistral; becomes the first Global Systems Integrator (GSI) to bring Mistral Forge to enterprises worldwide

BT outlines mobile video pilots with Meta

What's Hot

On the AWS Outage – O’Reilly

Footnote

Related Posts

Subscribe to Updates