The holiday server checklist
CEO & Founder of Server Density.
Published on the 10th December, 2009.
Everyone knows that problems always seem to happen when you are asleep, on holiday or away from your computer. Given that many people will be taking time off work for the Christmas holidays, just like when you go on a road trip and check water, oil and tyre pressure, here’s a short list of things check on your server(s) so you can relax and enjoy the warmth of a
laptop fire without annoying down alerts.
Make sure you have sufficient disk space
If you’re using your servers in a steady and consistent manner, it should be straightforward to work out how much disk space you are using on a daily basis and plot that out to ensure you won’t run out on New Years Eve!
Check your /tmp partition
Following from the above, check that your /tmp partition has not filled up and has plenty of space free. The OS should manage this for you but there are possibilities where it will quickly become full (e.g. temporary log files).
Check log file sizes
System logs in /var/log should rotate but you should consider clearing some of the older files out if you no longer need them. This will help free up space.
Don’t forget log files that don’t rotate themselves like Apache access/error logs. These can become very big very quickly.
Dry run your backups
You don’t want to be messing with restoring broken backups on Christmas Day – do a dry run of your latest and archived backups to ensure they are working.
Tidy up any “hacked” work
Last night we experienced an issue with our server timezones, which were all reset when an automated OS update ran, because they had been set using a symlink (as recommended by many blogs). Make sure you use the official methods of setting things (i.e. not just symbolic links) so that if an update runs in the meantime, it doesn’t break it.
Check your update exclude files
If you use an automated update system like yum, make sure you have correctly define your excludes. This will ensure that critical parts of your server (e.g. the kernel, Apache or PHP) don’t automatically update themselves and break your systems.
Check your alerting
If you use something like Pingdom to alert when your sites are down, make sure they are configured properly. Ensure the “check for string” matches what you expect and look at the alert contacts to be sure that the right people get notified. Same goes if you are using our server monitoring tool, Server Density.
Run a reboot test
If you have full redundancy/failover, test a server reboot to a) check that the failover works and b) your server(s) come back up from a reboot.
Check your secondary machine
If you’re taking a laptop with you on holiday, make sure you can connect to your servers – do you have the latest VPN client? Do you have your SSH keys installed?
And when all else fails and you get that phone alert at 4am, don’t forget the list of dumb things to check and take your time – everyone else is on holiday too so nobody will actually notice*!!
* This is obviously a joke. I know I hate it when I see my uptime drop below 100%, even if nobody else cares!