Canary concept for system updates
Written by David Mytton
We use a fairly simple concept in our daily system operations that we call canaries. This is modeled off the use of a miner’s canary to detect toxic gasses before the more important human miners were affected themselves. The idea is that when performing system updates or changes, a group of predefined servers receive these updates first and are them monitored/tested for a period of time before rolling the changes out across the entire fleet.
The canaries are defined so that they are representative of all types of servers in the infrastructure – web servers, database servers, load balancers, etc. This is because different servers will receive different updates due to what is installed on them and we want to make sure every update is tested somewhere first. This is important when you have a large number of servers like we do powering our server monitoring service, Server Density.
This representation should not just range across use cases but also things like:
- Platform e.g. AWS Xen VMs vs Softlayer Citrix VMs
- OS release e.g. Ubuntu 10.04 vs 12.04
- Hardware e.g. RAID vs non-RAID
Further, these canaries should be part of a redundant cluster so if they have problems, they can be taken out of rotation without affecting the environment. It’s no good selecting your only production load balancer or database master as a canary, for example.
It’s common practice to roll out updates gradually but with tools like Landscape and Puppet making it easy to do mass actions across an entire fleet, you need to be careful that they don’t become tools to easily automate destroying everything, faster. These tools allow you to group and/or tag nodes so updates can be applied selectively.
This is appropriate for bulk system updates such as core OS packages, kernels or on Microsoft’s Patch Tuesday whereas big, core application upgrades (such as updating MongoDB from 2.0 to 2.2) require more detailed testing and planning first and may have to be upgraded all in one go. In those cases, you can still use a canary concept where you have multiple clusters, by spacing out the planned upgrades over a period of time (e.g. we have upgraded one of our clusters to MongoDB 2.2 and continue to monitor it before upgrading our core database cluster).