Incident response for startups
A lot of people were affected by the AWS outage last week, including Platform as a Service (PaaS) provider, Heroku. However, what made their handling of the situation interesting is their incident response system. This was highlighted in a post on highscalability.com yesterday which discussed the incident command system that appeared to be in use.
Of particular interest is what happened after the incident was detected and how the problem was escalated throughout the Heroku team. And because of the length, of further interest is how it continues to be managed to resolution. This is summarised well in the highscalability.com post:
- Monitoring systems immediately alerted Ops to the problem.
- An on-call engineer applied triage logic to the problem and classified it as serious, which caused the on-call Incident Commander to be woken out of restful slumber.
- The IC contacted AWS. They were in constant contact with their AWS representative and worked closely with AWS to solve problems.
- The IC alerted Heroku engineers. A full crew: support, data, and other engineering teams worked around the clock to bring everything back online.
- The Ops team instituted an emergency incident commander rotation of 8 hours per shift, once it became clear the outage was serious, in-order to keep a fresh mind in charge of the situation at all times.
This seems like a great system – it was planned in advance, tested and worked properly when a real incident occurred. However, it makes one assumption – size (in terms of the number of people). There are quite a few people involved:
- On call engineer (and presumably a backup in case that engineer did not respond)
- An on call incident commander (again, probably a backup in place too)
- Application engineers – support, data and other teams
- Sufficient numbers of additional people to be able to rotate every 8 hours
This is fine for a company that has a large team (and has been acquired by a multi-billion-dollar company) but how can this scale down to a smaller startup?
The response plan has multiple layers which can be broken out to examine who can staff them and what you might be able to outsource until you have a sufficiently large team to cover all aspects internally.
Layer 1: monitoring and notification
Assuming you have that set up properly, you need to ensure notifications get out to the right people. Being on-call in a startup will mean frequent wakeup calls and needing to have internet access at all times, so rotating this amongst your co-workers is going to be important. Sharing a single phone could work but it’s easy to make mistakes where a human process is involved.
Instead, we use PagerDuty (link gives you 10% off). This already integrates into Server Density but you can just have your monitoring system e-mail PagerDuty directly and it will trigger an alert. This gets assigned to the on-call engineer and allows you to define rules about who is on call and when and how escalations work if that engineer doesn’t acknowledge the problem within a specific time period.
Layer 2: on-call engineer
Once an incident has been detected, you need someone to respond. This can be you or your co-workers during working hours but you can’t be around 24/7, so you either need several people to handle this role in rotation, or get someone else involved. This role is perfect for an outsourced server support company like Roundhouse Support (customers of Server Density), and there are plenty of others too.
Since a 3rd party will rarely have the same level of knowledge about your infrastructure as your own team, you will want to define some key rules and processes for the outsource company to follow. In Heroku’s case they evidently have a set of rules to check basic fixes and then classify the problem appropriately. This can be very detailed depending on the alert received – try restarting services, checking logs and so on. The key here is to classify the problem and take the next steps:
- Gather diagnostic information and alert appropriate partners. In Heroku’s case this was opening a case with Amazon but it could be opening a support ticket with your hosting provider or submitting a ticket to your database vendor. This is very important because escalating to the incident commander may take a bit of time and you want troubleshooting to be ongoing in the meantime. The incident commander needs to arrive with things already happening and ideally, the situation diagnosed.
- Open an incident on the relevant status website to alert customers. Heroku provided regular updates to show they were still working on the problem. This has a fine balance because Amazon were criticised for their lack of communication and incorrect ETAs for a fix.
- Escalate the problem to the incident commander.
Layer 3: incident commander
This is the stage where your actual team can become involved – everything before can be automated and outsourced. Here one person takes over from the outsourced company, confirming the diagnosis and taking over managing communication with any 3rd parties. They also make the decision to involve the engineering team if necessary.
In Heroku’s case they were able to start implementing technical fixes (restoring backups to new instances) which required the engineering team, but it might be that you are waiting on your hosting provider to replace a failed hardware component and you don’t actually need to wake up the engineering team.
Layer 4: shift rotation
This is the only layer that you may be unable to scale if your team is small. We have just hired 2 new engineers to bring our team size to 5 so now have more scope to allow for shift rotations of the incident commander, but with fewer people your incident commander may also be part of your engineering team.
Regardless, you need to ensure everyone is rested before mistakes get made even if that means 2 people sleeping whilst 1 maintains a watchful eye over the situation. This only becomes relevant for incidents of long duration, which are thankfully extremely rare. And in many cases you’ll be waiting on a 3rd party (e.g. hosting provider) who you can ask to call to wake you up when they have an update.
The whole cake
The highscalability.com post sums up the lessons learned very well – plan in advance, test and communicate regularly with your customers. Incidents will happen and it’s how you deal with them that determines whether customers will stick around.
If you have any tricks, tips or ideas about how to handle this in smaller companies, leave a comment!
Enjoy this post? You may also like Making a point with SLAs