Multi data center redundancy – application considerations
CEO & Founder of Server Density.
Published on the 30th May, 2013.
A few months ago we finished a long running multi data center redundancy project to allow our server monitoring service, Server Density, to survive the complete failure of our primary data center. We can now failover to another data center either with no customer impact or with minimal downtime (depending on the failure scenario).
There are x3 main types of multi data center deployments:
- Disaster recovery: you maintain servers in an additional data center which allows you to recover data in the event your primary data center is destroyed. It’s designed as a last resort after a catastrophic event and doesn’t usually allow you to run normal operations from the secondary data center. This is more like a backup.
- Hot standby: you maintain duplicate servers in an additional data center which are running and ready to take over immediately in the event the primary data center fails. The entire application can fail over very quickly and means you can survive the destruction of the primary data center, or failover in the event of planned maintenance; basically any event where the primary data center is unavailable.
- Live traffic handling: the same as the hot standby option but this data center (or data centers) serve traffic too, so there is often no real “primary”. This is often used to locate the application closer to the user and tends to use some kind of geographical load balancing, such as using anycast DNS to route users to their closest location.
These are ordered in terms of complexity and cost and are generally implemented in that order e.g. to have a live traffic handling data center you need all the same things as a hot standby facility. Each one gets more expensive because you have to duplicate servers with sufficient resources to take over live traffic, even if they’re not used (in the case of hot standby). Server Density has had a disaster recovery setup for several years and we recently upgraded to hot standby ability, with a view to moving up to live traffic handling in the future.
Application considerations for multi data center redundancy
There are two main aspects of implementing multi data center redundancy. The first is the sysadmin, network and server engineering work that needs to be done to deploy multiple servers and set up the failover mechanism, which will be covered in a future post. But before that, some preliminary work needs to be done to get your application ready to handle switching data centers.
Databases are perhaps the most complicated component to scale and database failover is very product specific. So there are a number of generic things to consider in relation to which database you’re actually using:
- Replication: this is almost certainly how you’re going to handle failover on the database. Consider how replication is implemented with regards to master/slave and how failover is triggered.
- How far behind are your slaves? Across regions there will be some replication lag due to network latency. Can your application handle some data being “lost” because it hasn’t been replicated yet, or do you need to ensure strong consistency?
- How does your database handle split brain conditions when the network partitions? Do you need an independent node in a third data center to arbitrate over which node becomes master?
- How does your application detect a change in database master? Does this even matter? Will your users get errors or will it happen automatically?
In hot standby setups, you can make tradeoffs if you don’t want to/can’t afford to replicate every single component. This means your application will need to degrade gracefully depending on the failure scenario; something which works well with service orientated architectures. For example, you could temporarily disable profile image uploading rather than duplicating large numbers of photos across data centers.
This requires your application to know when there is a failover situation, which can be done using a manual config flag that’s set as part of your failover process. Alternatively, you could set environment variables so the application knows which data center it is being served from and handles the situation appropriately.
Your application might make some assumptions about the availability of local resources. Paths, hostnames or IPs might be hard coded. You’ll need to audit your code to find out what assumptions have been made and good testing will reveal anything you have missed.
It can be useful to display a banner to users when there is a failover condition, especially if certain features are disabled or performance drops because of increased latency or cold caches.
Local data e.g. sessions
Sessions are often implemented using some kind of local storage. This problem tends to be solved as part of needing to balance traffic across multiple servers but it’s also worth considering how this works across data centers. If you use a database for storing session data then it will be replicated already, but be careful if you are using file or load balancer based session handling.
Don’t forget websites
Your core application needs to be able to fail over to allow existing users to continue using it, but don’t forget your product websites and billing systems. It’s good practice to consider your website a separate product with its own redundancy and deployment mechanism because for many businesses, its the sole location for new customers to find out and signup to the service.
First step completed
With all these considered and “fixed”, the next step is to start duplicating servers in a secondary location…which will be the topic of the next blog post!