Written by David Mytton
Automatic failover is a great way to maximise uptime but it suffers from the possibility of flapping. This is where a failover mechanism is triggered multiple times because sensitivity is too high, or the error condition continually appears and disappears.
For example, in MongoDB a heartbeat is used to determine which nodes within a replica set are alive:
All nodes monitor all other nodes in the set via heartbeats. If the current primary cannot see half of the nodes in the set (including itself), it will fall back to secondary mode. This monitoring is a way to check for network partitions. Otherwise in a network partition, a server might think it is still primary when it is not.
Heartbeats requests are sent out every couple of seconds and can either receive a response, get an error, or time out (after ~20 seconds).
However, there is a condition where a single socket exception can cause a failover to another primary. This isn’t usually a problem because the new primary will just keep that state and there won’t be another failover. However, if you have a priority configured, the original node will resume the primary state after a short delay. If there is a minor networking issue over a period of time this can cause flapping between primary nodes.
For our server monitoring service, Server Density, We have found MongoDB replica sets to be extremely good at handling failover, with almost zero visible impact for users and very fast detection of error conditions. Interesting ways to improve this or to implement as part of your own failover mechanism could include:
- Issue a second check after a period of time (e.g. wait a few seconds then retry) to confirm the error state before triggering the failover.
- Limit the total number of failovers before an alert is triggered for a human to investigate.
- Decrease the sensitivity for every failover within a certain period of time. Possibly a good candidiate for exponential backoff.
For each of these, there is a balance to maintain between ensuring the least amount of downtime by failing over quickly and avoiding unnecessary failovers (which themselves can cause increased load/timeouts/other visible effects for users).