Improving the Service Resilience of your AppLeave a Comment
Last week, a customer sent a note, asking us how we make sure Server Density remains available, around the clock. We love getting those questions. In fact, we take every opportunity to discuss service resilience in great detail.
As you’d expect, there are two areas that govern how reliably your cloud app runs. The application itself, is one. How long does your code operate without failures? How is it architected, and does it scale gracefully?
Equally important, though, is the infrastructure your app is sitting on. What happens if a VM is taken down? How do you handle datacenter failovers? What systems do you have in place?
If your app is to deliver on its bold uptime metrics, you need a good handle on both application and infrastructure resilience. Let’s take a look at how we do those at Server Density.
Service Resilience: the Importance of Application Quality
At its very core, the application itself needs to be solid, i.e. any failures should be handled internally without causing outages (normal vs. catastrophic failure).
Better software quality, by definition, means lower incident rates. To encourage application resilience, it makes sense to expose everyone—including devs and designers—to production issues. In light of that, here at Server Density we run regular war games with everyone who participates in our on-call rotation.
We also recommend running regular Chaos Monkeys. Every single bug we found as a result of chaos events was in our own code. Most had to do with how our app interacts with databases in failover mode, and with libraries we’ve not written. Setting Chaos Monkeys loose on our infrastructure—and dealing with the aftermath—helps us strengthen our app.
The Importance of Systems
Infrastructure resilience is all about ensuring that individual component failures do not affect our overall infrastructure.
The most obvious way to minimise service interruptions is by having “one more of everything” (for example, in a cluster of 4 servers handling current load, we would need 5). But you also need systems. When faced with an outage, there should be little doubt as to what needs to happen. Doubts cause delays and errors. Focus should be on executing an established set of steps (see checklists). Any downtime should therefore equal to no more than the time it takes to failover.
So, let’s start from the most benign types of failures and gradually up the ante. Let’s take a look at how our infrastructure copes as the stakes get higher.
Regular Server Maintenance – Downtime: None
Most of our servers go offline at least once every 30 days. We do full upgrades virtually every month. At the very least, they would involve kernel (OS) updates, which require a reboot. By having a regular and scheduled server downtime, we get to flex our failover muscle and get better at routing traffic.
Power and Networking Failures – Downtime: None
Our provider, SoftLayer, has a redundant power supply, i.e. not just the electricity grid but also on site generators too. Should there be a power outage, the first thing that kicks in, is the UPS with its immediate batteries. That buys enough time for the provider to start their generators. To be able to cope with longer power outages, most providers keep fuel stocks onsite.
Datacenters should also provide redundant networking paths. Not only at the actual physical layer of networking, but the networking equipment too. Redundancy throughout. We’ve worked with Softlayer to understand their architecture so we can place our servers and virtual instances across different failure paths.
Zone Failures – Downtime: None
Washington is our primary region. Within that region our workloads are split in two different zones. Each zone is a separate physical facility, and in our case they’re approximately 14 miles apart This protects against external unforeseen events, like excavators cutting a fiber cable and taking out all networking for the entire facility. We experienced this type of outage 3 years ago. In response to that, we introduced an extra layer of inter-region redundancy.
If our primary datacenter fails for whatever reason, we now have a second identical and “hot” datacenter within Washington. That datacenter is fully up to date (current) and ready to go with zero delay. Virtually every provider now has multiple zones in region and it’s become a standard.
Regional Failures – Downtime: approx. 2 hours
On October, 2012 Washington DC declared a state of emergency due to hurricane Sandy. To preempt any service interruption, we decided to failover to our secondary datacenter in San Jose (SJC).
It’s important to note that we don’t sit and wait for black swans to happen before testing our readiness. We schedule and test regional failovers on a regular basis (last test was in 2015) so that we’re not taken by surprise in case of a “force majeure”.
We run our database replication “live” (replication happens within a few seconds) but keep all the other instances as snapshot templates to avoid running idle servers, wasting money and natural resources. However, this means there is a lag time between pressing the button to starting everything up. Most failovers will be triggered due to our own choice e.g. playing it safe with weather warnings, so this doesn’t matter. But if both our zones were to fail then full recovery would take some time (5-10 minutes to boot up, 10-15 minutes for automatic procedures to complete, 60 minutes for a human to run through the failover checklist).
Sustaining geo-level redundancy involves enormous amounts of duplicate systems. At some point, we asked ourselves: is there any way we could make better use of this capacity. After reviewing our options, we decided to move our secondary from San Jose to Toronto, which is geographically nearer. That will help us reduce the latency between the two datacenters from 70ms to 20ms. Minimising the latency means we can make better (and more dynamic) use of all that extra capacity in situations when a complete failover is not necessary. We can run both locations in active mode, achieving the same results as the zone setup but with enough geographic distance to avoid localised events such as weather.
Provider Failure – Downtime: 1 day
Are you keeping count? That’s 3 redundant datacenters so far (2 zones in Washington, 1 zone in Toronto). You’d think any more redundancy borders on, dunno, too much? How could more than two separate regions fail at the same time?
Well, they can. And they have. In November 2014, Azure suffered a global outage that lasted two hours. (Unlike Azure’s interconnected architecture, SoftLayer and AWS facilities are completely isolated, so this specific type of global outage should be impossible).
There are several ways we could mitigate this risk. We could have another provider with “hot” infrastructure in place, that could pick up our entire workload in a near instant manner. The folks at Auth0 can failover from Azure to AWS in 60 seconds.
The other option would be to align our fate with our provider’s and do nothing. The risk here is obvious. Should our provider face a long term service interruption, we would run out of options.
We decided to opt for something in the middle. Instead of having a “hot” provider on standby, we have put in place a disaster recovery process using MongoDB’s infrastructure. This involves having a live backup using MongoDB’s Cloud Backup service. We have built our own restore and verify service which runs twice a day to ensure that our backups actually work, and stores a copy of the backup on Google’s storage (versioned so we retain copies going back several weeks).
A full rebuild would obviously not be instantaneous. It would take some time to rebuild, but that is significantly reduced by using Puppet to manage all our systems. We wouldn’t have to do as much by hand because we can easily replicate our existing setup.
Write Good Postmortems
Ultimately, there is no such thing as 100% availability. When sufficiently elaborate systems begin to scale it’s only a matter of time for some sort of failure to happen. There is no way around that.
Writing good postmortems when systems are back online, helps restore customer confidence. It demonstrates that someone is investing time on their product. That they care enough to sit down and think things through.
Downtime is expensive in more ways than one. Service interruption can lead to lost revenue, it can impact your productivity and tarnish your reputation.
Ultimately, your availability metrics are an indication of quality. How solid is your application infrastructure? How solid is your failover routine? Furthermore, how solid is your communication, customer care, and post mortems?
Attaining 100% availability might be an impossible feat. How well you prepare, plan and execute around it, is not.