Individual node uptime doesn’t matter
Written by David Mytton — Subscribe now.
Everyone knows that a backup isn’t complete until it’s been fully restored and verified, regularly, but I wonder how many people extend that testing to other portions of their infrastructure which might exist in one state for a long time.
For example, uptime stats are often shown off – “look how many years my server has been up!”. That’s a great way to show off stability of the system and applications but would that server come back up from a reboot? What you should really be looking at is the overall uptime of the service that server is powering – that’s what matters, not the uptime of the individual node in the cluster.
Power cycling every device (servers, firewalls, load balancers, network storage, etc) should be part of the normal operating schedule. You want to be sure that servers will come back on their own and continue their normal role after a reboot because in the event of some power outage, you don’t want to suddenly discover some of your servers have mysteriously failed (probably for different reasons) when the power is restored. In those cases you’ll likely be busy with fixing corruption and other issues!
Same goes for testing failover of critical systems – best discover the problems when you’re specifically looking for them in a known maintenance window than when you actually need them to work.