Back in February we started centralising and revamping all our ops documentation. I played around with several different tools and ended up picking Google Docs to store all the various pieces of information about Server Density, our server monitoring application.
We make use of Puppet to manage all of our infrastructure and this acts as much of the documentation – what is installed, configuration, management of servers, dealing with failover and deploys – but there is still need for other written docs. The most important is the incident response guide, which is the step by step checklist all our on-call team run through when an alert gets triggered.
Why do you need an incident response guide?
As your team grows, you can’t just rely on one or two people knowing everything about how to deal with incidents in an ad-hoc manner. Systems will become more complex and you’ll want to distribute responsibilities around team members, so not everyone will have the same knowledge. During an incident, it’s important that the right things get done in the right order. There are several things to remember:
- Log everything you do. This is important so that other responders can get up to speed and know what has been done, but is also important to review after the incident is resolved so you can make improvements as part of the postmortem.
- Know how to communicate internally and with end-users. You want to make sure you are as efficient as possible as a team, but also keep your end-users up to date so they know what is happening.
- Know how to contact other team members. If the first responder needs help, you need a quick way to raise other team members.
All this is difficult to remember during the stress of an incident so what you need is an incident response guide. This is a short document that has clear steps that are always followed when an alert is triggered.
What should you have in your incident response guide?
Our incident response guide contains 6 steps which I’ve detailed below, expanded upon to give some insight into the reasoning. In the actual document, they are very short because you don’t want to have complex instructions to follow!
- Log the incident in JIRA. We use JIRA for project management and so it makes sense to log all incidents there. We open the incident ticket as soon as the responder receives the alert and it contains the basic details from the alert. All further steps taken in diagnosing and fixing the problem are logged as comments. This allows us to refer to the incident by a unique ID, it allows other team members to track what is happening and it means we can link the incident to followup bug tasks or improvements as part of the postmortem.
- Acknowledge the alert in PagerDuty. We don’t acknowledge alerts until the incident is logged because we link the acknowledgment with the incident. This helps other team members know that the issue is being investigated rather than someone has accidentally acknowledged the alert and forgotten about it.
- Log into the Ops War Room in Hipchat. We use Hipchat for real time team communication and have a separate “war room” which is used only for discussing ongoing incidents. We use sterile cockpit rules to prevent noise and also pipe in alerts into that room. This allows us to see what is happening, sorted by timestamp. Often we will switch to using a phone call (usually via Skype because Google Hangouts still uses far too much CPU!) if we need to discuss something or coordinate certain actions, because speaking is faster than typing. Even so, we will still log the details in the relevant JIRA incident ticket.
- Search the incident response Google Docs folder and check known issues. We have a list of known issues e.g. debug branches deployed or known problems waiting fixes which sometimes result in on-call alerts. Most of the time though it is something unusual and we have documentation on all possible alert types so you can easily search by error string and find the right document, and the steps for debugging. Where possible we try to avoid triggering on-call alerts to real people where a problem can be fixed using an automated script, so usually these steps are debug steps to help track down where the problem is.
- If the issue is affecting end-users, do a post to our status site. Due to the design of our systems, we very rarely have incidents which affect the use of our product. However, where there is a problem which causes customer impact, we post to our public status page. We try and provide as much detail as possible and post updates as soon as we know more, or at the very least every 30m even if there is nothing new to report. It seems counter-intuitive that publicising your problems would be a good thing, but customers generally respond well to frequent updates so they know when problems are happening. This is no excuse for problems happening too frequently but when they do happen, customers want to know.
- Escalate the issue if you can’t figure it out. If the responder can’t solve the issue then we prefer they bring in help sooner rather than prolong the outage. This is either by escalating the alert to the secondary on-call in PagerDuty or by calling other team members directly.
Replying to customer emails
Another note we have is regarding support tickets that come in reporting the issue. Inevitably some customers are not aware of your public status page and they’ll report any problems directly to you. We use Zendesk to set the first ticket as a “Problem” and direct the customer to our status page. Any further tickets can be set as “Incidents” of that “Problem” so when we solve the issue, we can do a mass reply to all linked tickets. Even though they can get the same info from the status page, it’s good practice to email customers too.
@serverdensity Great incident report!
— Or Weinberger (@orweinberger) August 27, 2014
What do you have in your playbook?
Every company handles incidents differently. We’ve built this process up over the years of experience, learning how others do things and understanding our own feelings when services we use have outages. You can do a lot to prevent outages but you can never eliminate them, so you need to spend as much time planning the process for handling them. What do you have in your incident response processes? Leave a comment!