Multi data center redundancy – sysadmin considerations

By David Mytton,
CEO & Founder of Server Density.

Published on the 6th June, 2013.

Last week, I considered the implications for multi data center redundancy on your applications. This post will look at considerations for the sysadmin – network and server level failover, plus other aspects of infrastructure which need to be considered.

Network considerations for multi data center redundancy

The network is the connecting layer between your customers and you, and all your internal components across data centers. This means there are a few areas to consider how failover works:

Connecting you to your customers – IPs and DNS

If one of your data centers fails then you need a way to redirect traffic to the secondary data center. We achieve this for Server Density by using the global IP service offered by Softlayer. This is similar to Amazon’s Elastic IP service but allows you to point the IP to any server in any of their data centers, rather than being restricted by region. This allows us to redirect traffic to an entirely different data center without any customer impact – the IP stays the same, it’s the internal routing that changes.

However, there is a failure scenario where this doesn’t work. The change requires the original data center routers to acknowledge the new configuration before the new route is applied, which might fail if there is a network problem in the original data center. This is mitigated by using DNS failover.

Our DNS is set with a low TTL so that we can adjust the IPs we send traffic to. In the event we can’t reroute our global IP, we can update the DNS to point to an IP in another data center, which we have already reserved and are ready to run as hot standbys. The downside is this can cause some downtime because of the DNS caching for the length of the TTL. Some ISPs cache more aggressively so it’s not guaranteed to update for all customers at the same time.

An alternative would be to use round robin DNS to provide all the IPs at once but this relies on the client to time out on failed connections and means some connections would always go to the secondary data center, which may not be optimal if you run a primary/secondary data center setup.

Internal connectivity

This is mostly about latency. Within the data center you can usually expect sub-1ms round trip times but as soon as you start to go between data centers then this increases based on distance.

From Washington, DC to San Jose, CA in the USA we see round trip times of around 72ms. Across the Atlantic you can expect anything up to 100ms, trans-Pacific up to 150ms and between Europe and Japan up to 300ms.

The implications of this are relevant for database replication – can you survive eventual consistency or do you need to guarantee the data gets to all your data centers? Network partitions can also be a concern.

It’s also relevant for file transfers, particularly for things like backup. If you have to restore an offsite backup to your secondary data center, how long will the file transfer actually take?

Internal ping

Server & ops considerations for multi data center redundancy

Load balancer failover

We run load balancer pairs using nginx – one active, one hot standby. These are monitored externally and if there is a failure, our global IP is automatically rerouted to the standby load balancer and an on-call alert triggered.

Gateway access

Almost none of our servers can be connected to via SSH from the public internet – all access goes through a single gateway server, including across multiple data centers. Of course, if the data center where this is located goes down then you lose all access. So we have a duplicate in our secondary data center.

An alternative to this is to use a VPN to connect directly into the environment network. I don’t like this approach because it opens up the entire network to your local system, which can lead to mistakes e.g. connecting to production by accident.

Self hosted tools e.g config management

We run Puppet to manage configuration across all our servers and amongst other things, it manages our internal hostnames through a centralised /etc/hosts file. When servers need to be replaced or IPs changed, we can do this using Puppet but it has no built in redundancy so we have to set up a replacement Puppet slave ready as a hot standby.

This applies to other tools you might be using. Do you have a central logging server? A backup management system? Maybe you even run your own monitoring! The advantage of using SaaS products is you don’t have to think about redundancy and monitoring for these tools, but some are best run yourself. In those cases, you need to consider the failover of each tool and how your response might be impacted if that tool was unavailable.

5 7

Monitoring

You need to know when you’re completely down, but also when certain failure scenarios take place: usually low traffic, high latency, increased errors. This requires a combination of monitoring across your entire infrastructure: from remote tests for response time through to end to end testing of request pipelines. Ideally, you will detect problems before customers start noticing.

Server Density Screenshot

Communication

The worst thing as a customer is to see problems with a service but have no idea what is happening. This is where public status pages come in. But it’s not enough just to post when there are problems – if your service is critical to your customers, they need to be able to subscribe to be warned when there are issues.

Server Density monitoring is critical to our customers so if we have a service issue where our monitoring stops, they need to know immediately so they can manually monitor their systems. The same with services like PagerDuty – if their alerting system is down, you have no way to know when your own services are down!

You can build your own fancy status pages like Heroku and AWS do. But there are services like StatusPage.io which provide a hosted page for you.

It’s also worth considering where else you should be telling your customers – many will be looking to social media if there are problems.

AWS Status

Automatic failover

We have automated failover of our nginx load balancer pairs but a full data center failover requires a manual process. The problem with automated failover is the potential for flapping, which can make a situation even more confusing.

Going to a full multi data center deployment all serving live traffic is probably the best way to do this because the requirements of that setup naturally mean each data center can serve traffic. Then it’s just a case of removing a data center from the rotation if it goes down, rather than having to deal with switching and failover.

Whether you do manual or automated failover, checklists are valuable. They help avoid mistakes with manual processes and allow you to run through to make sure everything is confirmed working when an automated failover happens. We have checklists for our manual data center failover process, load balancer failovers and also a recovery checklist to ensure all core functionality is working after an outage.

Conclusions

Full multi data center redundancy is a long process and always increases raw hosting costs. The stage of the business determines when this should be done. As you grow in revenue and customers, it becomes more appropriate, especially if people are relying on the service.

It may not be worth it depending on how critical your service is to your customers if you assume your downtime will be relatively short (hours). You have to weigh up whether the likelihood of a prolonged outage from extreme events like Hurricane Sandy justifies the effort vs days of downtime.

You’ll also probably find other things not mentioned in this post, so feel free to comment with anything else you find in your own quest for full multi data center redundancy!

Free eBook: 4 Steps to Successful DevOps

This eBook will show you how we i) hacked our on-call rotation to increase code resilience, ii) broke our infrastructure, on purpose, to debug quicker and increase uptime, and iii) borrowed practices from the healthcare and aviation industry, to reduce complexity, stress and fatigue. And speaking of stress and fatigue, we’ve devoted an entire chapter on how we placed humans at the centre of Ops, in order to increase their productivity and boost the uptime of the systems they manage. What are you waiting for, download your free copy now.

Help us speak your language. What is your primary tech stack?

What infrastructure do you currently work with?

Articles you care about. Delivered.

Help us speak your language. What is your primary tech stack?

Maybe another time