AWS Outage Teaches Us Monitor Cloud Like It’s Your Data Center
CEO & Founder of Server Density.
Published on the 20th March, 2017.
At the beginning of the month, AWS suffered a major outage of its S3 service, the original storage product which launched Amazon Web Services back in 2006.
Reliance on this service was highlighted by the vast number of services which suffered downtime or degraded service as a result. The root cause turned out to be human error followed by cascading system failures.
With a growing dependence on the cloud for computing and with no signs of demand for cloud resources abating, we really need to treat those resources like the on-premises data center that we relied on for so many years.
As the article, “3 Steps to Ensure Cloud Stability in 2017” points out “it’s critical to ensure the stability of your cloud ecosystem” and that starts with monitoring. The article offers the following advice: “Ensure that you have access to reports which can give you actionable, predictive analytics around your cloud so that you can stay ahead of any issues. This goes a long way in helping your cloud be stable.”
Of course, I couldn’t agree more! Server Density even built an app to send notifications when cloud providers have outages.
The cloud might provide “unlimited” scalability and instant provisioning, but the SLAs and reliability guarantees are often confused with meaning 100% uptime and complete reliability. Note that S3 itself guarantees 99.99% uptime every year, which equates to just under an hour of expected downtime.
But note that the outage only affected the US East region. Other regions were unaffected, yet the fact that many services suffered outages indicates they are relying on a single region for deployments. AWS runs many zones within regions, which are equivalent to individual data centers but are still within a logical group and a small geographical area. Cross region deployment is typically reserved for mitigating against geographic events e.g. storms, but should also be used to mitigate software and system failures. Good systems practice means code changes get rolled out gradually and indeed, AWS states that regions are entirely isolated and operated independently.
S3 itself has a feature which automates cross region replication. Of course, this doubles your bill because you have data in two regions, but it does allow you to switch over in the event an entire region is lost. Whether that cost is worth it depends on the type of service you’re running. Expecting an hour a year of downtime is the starting point for the cost benefit calculation, but this particular outage took the service offline for more than that.
Human error can never be eliminated, but the chances can be reduced. Using automation, checklists and ensuring teams practice incident response all contribute to good system design. Having a plan when things go wrong is crucial, just as crucial as testing the plan actually works on a regular basis! And when the incident is resolved, following up with a detailed (and blameless) post mortem will provide reassurance to customers that you are working to prevent the same situation from happening again.
Outages will always throw up something interesting, such as the AWS Status Dashboard itself being hosted on S3. The key is knowing when something is going wrong, having a plan and closing it up with a post mortem.