Richi Jennings

Amazon EC2 cloud outage: Apology and explanation

April 29, 2011 6:06 AM EDT
Amazon logo By Richi Jennings. April 29, 2011.

Updated: Amazon (AMZN) has offered its promised mea culpa. It's published its post-mortem on the recent outage of its AWS EC2 (Amazon Web Services Elastic Compute Cloud) and RDS (Relational Database Service). In the writeup, it says what went wrong and how it's planning to avoid such problems in the future. In IT Blogwatch, bloggers read and inwardly-digest it.

Your humble blogwatcher curated these bloggy bits for your entertainment. Not to mention Mario With A Portal Gun...

While you were asleep, Amazon's AWS Team got contrite:
We would like to share more details with our customers. ... We are very aware that many of our customers were significantly impacted. ... The issues affecting EC2 customers ... primarily involved a subset of the Amazon Elastic Block Store ... volumes in a single Availability Zone ... that became unable to service read and write operations ... “stuck” volumes.
The degraded EBS cluster ... caused high error rates and latencies for EBS [API] calls. ... As with any complicated operational issue, this one [had] several root causes interacting with one another. ... [We have] many opportunities to protect the service against any similar event reoccurring.
At 12:47 AM PDT on April 21st, a network change ... was executed incorrectly. ... The secondary network couldn’t handle the traffic level ... many EBS nodes ... were completely isolated from other EBS nodes in its cluster. ... This change disconnected both the primary and secondary network simultaneously. ... When the incorrect traffic shift was rolled back ... nodes rapidly began searching the EBS cluster for available server space where they could re-mirror data. ... Because the issue affected such a large number of volumes concurrently ... this quickly led to a “re-mirroring storm.” M0RE
Zee M. Kane was among the first bloggers to spot it:
One of the biggest criticisms of the whole affair was Amazon’s ... silence. As startups ... were left to apologise ... because of Amazon’s failure. Amazon maintained its silence.
Today, two weeks later, Amazon has issued an apology and a post mortem. ... Amazon has also issued a 10 day credit ... to customers in the affected zone.
What impact will this have on Amazon’s web services business and will we see ... distrust in the cloud? ... This combined with the hacking of Sony’s Playstation Network is ... going to leave many ... hesitant about the future of a cloud based world.   M0RE

Henry Blodget has heard from one big AWS customer:
The explanation will be important. As will the explanation for how the company could have permanently destroyed some of its customers data. ... Here's an email Amazon sent to a big customer letting them know. ... You'd think that, under the circumstances, Amazon could do a bit better. ...
A few days ago we sent you an email letting you know that we were working on recovering ... one or more of your Amazon EBS volumes.  We are very sorry, but ... our efforts to manually recover your volume were unsuccessful.
What we were able to recover has been made available via a snapshot. ... If you have no need for [it] please delete it to avoid incurring storage charges.
Your Humble Blogwatcher can't help but react:
Uh, hello? Did I just fall through a portal to an alternate universe?
So, Amazon, it's not bad enough that you've lost their data; you have to add insult to injury by charging them for storing the corrupt copy of their data!
Good grief.
Ten out of ten for the complete post-mortem, but minus several million for this irony-free BS.   M0RE
And Molly McHugh is befuddled:
There’s also the matter of compensating the affected users for their loss. ... Amazon’s [EC2 SLA] guarantees 99.95 percent uptime ... but the fine print makes it easier for Amazon to avoid crediting anyone. ... Basically, you only get some redress from the site if you opted to use various “availability zones” ... regional zones that are specifically built to stay up ... in case of system failure.
[This] might make it difficult for users trying to seek credit ... contract loopholes should [not] be able to keep compensation from the 0.07 percent who lost their data. ... Until Amazon steps forward to explain ... its plans for crediting ... those affected, customers will just have to sit, wait, and hope the down time is over.   M0RE
Meanwhile, Beth Cohen grinds her ax:
Setting aside the issue of Amazon [SLAs], all of this assumes that you have control over ... the systems and services in your IT stack. ... The recent outage highlighted ... that, even if they had built in the best ... high availability into their systems, they were still dependent on vendors and services that might not have been quite so diligent.
Uptime is going to be increasingly more difficult to determine through the maze of inter-dependent services. ... You need to make sure you have ... architected your own service to have a full failover solution ... [and] do diligence on all of your vendors’ policies and architectures. ... If one of your upstream service providers does not have a good policy in place, your site will still be affected. ... No matter how good the SLA is.   M0RE

And Finally...
Mario With A Portal Gun
[hat tip: Matt Burns, via John Funk]
Don't miss out on IT Blogwatch:

Richi Jennings, your humble blogwatcherRichi Jennings is an independent analyst/consultant, specializing in blogging, email, and security. He's also the creator and main author of Computerworld's IT Blogwatch -- for which he has won American Society of Business Publication Editors and Jesse H. Neal awards on behalf of Computerworld, plus The Long View. A cross-functional IT geek since 1985, you can follow him as @richi on Twitter, pretend to be richij's friend on Facebook, or just use good old email: You can also read Richi's full profile and disclosure of his industry affiliations.