A guide to handling incidents, downtime and outages
CEO & Founder of Server Density.
Published on the 16th October, 2014.
Outages and downtime are inevitable. Designing your systems to handle failure is a key part of modern infrastructure architecture which makes it possible to survive most problems, however there will be incidents you didn’t think about, software bugs you didn’t catch and other events which result in downtime for your service.
Microsoft, Amazon and Google spend $billions every quarter and even they still have outages. How much do you spend?
There are some companies who constantly seem to have problems and suffer from it unnecessarily. Regular outages ultimately become unacceptable but if you adopt a few key principles and design your systems properly, the few times when you do have service incidents you can be forgiven by customers.
Step 1: Planning
If critical alerts result in panic and chaos then you deserve to suffer from the incident! There are a number of things you can do in advance to ensure that when something does go wrong, everyone on your team knows what they should be doing.
- Put in place the right documentation. This should be easily accessible, searchable and up to date. We use Google Docs for this.
- Use proper config management, be it Puppet, Chef, Ansible, Salt Stack or some other systems to be able to make mass changes to your infrastructure in a controlled manner. It also helps your team understand novel issues because the code that defines the setup is easily accessible.
Be aware of your whole system. Unexpected failures can come from unusual places. Are you hosted on AWS? What happens if they suffer an outage and you need to use Slack or Hipchat for internal communication? Are you hosted on Google Cloud? What happens if your GMail is unavailable during a Google Cloud outage? Are you using a data center within the city you live in? What happens if there’s a weather event and the phone service is knocked out?
Step 2: Be ready to handle the alerts
Some people hate being on call, others love it! Either way, you need a system to handle on call rotations, escalating issues to other members of the team, planning for reachability and allowing people to go off-call after incidents. We use PagerDuty on a weekly rotation through the team and consider things like who is available, internet connectivity, illness, holidays and looping in product engineering so issues waking people up can be resolved quickly.
More and more outages are being caused by software bugs getting into production because it’s never just a single thing that goes wrong – a cascade of problems all culminate to cause downtime – so you need rotations amongst different teams, such as frontend engineering, not just ops.
Step 3: Deal with it, using checklists
Have a defined process in place ready to run through whenever the alerts go off. Using a checklist removes unnecessary thinking so you can focus on the real problem, and ensures key actions are taken and not forgotten. Have a channel for communication both internally and externally – there’s nothing worse to be the customer of a service that is down and you have no idea if they’re working on it or not.
Step 4: Write up a detailed postmortem
This is the opportunity to win back trust. If you follow the steps above and provide accurate, useful information during the outage so people know what is going on, this is the chance to write it up, explain what happened, what went wrong and crucially, what you are going to do to prevent it from happening again. Outages highlight unknown system flaws and it’s important to tell your users that the hole no longer exists, or is in the process of being closed.
@serverdensity Great incident report!
— Or Weinberger (@orweinberger) August 27, 2014
Incidents, downtime and outages video
We hosted an open discussion on best practices for handling incidents, downtime and outages with Charlie Allom (Network Engineer) and Brian Trump (Site Reliability Engineer) from Yelp. Contrasting a small company to a much larger one, we chatted through how we deal with things such as:
- On call – rotations, scheduling, systems and policies
- Preparing for downtime – teams, systems and product architecture
- Checklists and playbooks
- How we actually handle incidents
- Post mortems
Here’s the full video: