Cloud storage failures – the perfect storm
CEO & Founder of Server Density.
Published on the 5th October, 2011.
When the Qantas A380 suffered engine failure after taking off from Singapore in November last year, many things went wrong at the same time:
The failure of the giant Rolls-Royce turbofan triggered a massive fuel leak, as well as 50 or more failures and malfunctions of various systems and subsystems, some more serious than others and not all of which are yet understood.
These failures included uncontained engine failure (causing shrapnel damage), failure of multiple redundancy electrical buses, loss of fluid in one of two main hydraulic systems, fuel leaks, fuel transfer failure and so on.
When a commercial airliner crashes or has a major problem, it’s never caused by just one thing. Indeed, modern aircraft are designed with multiple levels of redundancy so when something does happen, it’s usually some kind of “perfect storm” of events.
With a number of failures of cloud storage services over the last few months and the resulting analysis after each event, it seems a similar situation exists in the hosting industry as regards block storage such as Amazon EBS or utility NAS devices such as the one we use through our hosting provider, Terremark.
- A simple cause cascades into multiple failures. In the Qantas incident, this “simple” event was an uncontained engine failure (likely more complex on its own but it was a single event which should have been contained) that proceeded to cause further system failure. In the first Amazon outage there was an error during maintenance and the second was caused by a power failure.
- Automated failover procedures get activated but cause further problems because they weren’t designed to cope in the situation that arose. For Amazon, this meant resources were quickly exhausted due to the speed impairment of network congestion (first incident) and overloading of management servers (second incident).
- Software bugs/design flaws cause further issues. In the Qantas case the pilots were unable to fully calculate their landing conditions because the computer interface was not designed with enough fields to account for all the failure variables they needed to input. In Amazon’s case the search for free resources did not back off when nothing was found. In our recent outage at Terremark, a failed-over component was not put into the correct mode causing significant write latency.
- Manual recovery at scale takes a long time. It took the Qantas pilots a long time to run through all the necessary check lists and procedures before they could land the plane. Amazon’s outage went on for over 11 hours and even after that required manual recovery of many EBS nodes. The Terremark outage lasted 7 hours and we suffered large scale filesystem corruption on many of our nodes after coming back online.
- Availability of parts and people adds to recovery time. Amazon had to physically move storage from one data centre to another to provide sufficient capacity to restore service. Terremark had multiple, continual failures of every power unit which was not expected and caused spare part shortages.
So what does this all mean? Well, Amazon and Terremark both have very good uptime levels and their services do work as advertised the majority of the time. We rarely see problems with “core” computing resources like CPU and RAM, perhaps because CPU is CPU and RAM is RAM, meaning these are simple resources to share and have very few failure scenarios. And when they do fail it’s usually on a single instance level and you can just provision a new one. However, cloud storage is persistent, has many more components and requires many more systems to manage it. These all add complexity. When a disk breaks, nobody notices because it can just be replaced. But when there is a massive system failure, situations arise that were never thought of or cause unexpected conditions.
This is compounded by the complexity of diagnosing and fixing these problems, which all add to the total resolution time.
The Qantas pilots were able to successfully land the plane with no fatalities and, due to the way the airline industry requires deep root cause analysis, we can be sure similar events in the future will be avoided. Commercial airlines have been around for over 60 years and have vastly more complex machines to deal with – a 747 has over 6 million parts, 3 million of which are independent (the rest being rivets!). Amazon and Terremark learnt from their outages and we have detailed information about what happened and how it will be fixed so we can hope that reliability will improve just as it has in for flying.