How and why we use DevOps checklists
CEO & Founder of Server Density.
Published on the 11th August, 2015.
According to the investigation, the pilots forgot to disengage a critical wing adjustment mechanism before take off. Evidence that even veteran pilots could miss key steps or do things in the wrong order. With hundreds of lives at stake it was necessary to design around this constraint.
The checklist does exactly that. It compensates for the “limits of human memory and attention.”
Indeed, Gawande — a doctor himself — writes how key steps in medical procedures were routinely missed, resulting in infections and preventable fatalities. The adoption of checklists reduced those occurrences, and they are now used in a wide range of healthcare settings.
Checklists for DevOps
Not unlike healthcare and aviation, sysadmins are often tasked with systems that touch many lives. Here at Server Density we appreciate the complexities of the systems we run. We also recognise the limits of the people who run them — us. That’s why we use checklists for much of what we do.
Checklists are particularly effective in situations where there is:
There is only so much that human memory can remember, reproduce and execute upon, in a reliable manner.
Stress and Fatigue
Incidents may happen at awkward times, like early in the morning when mistakes are more likely. Sysadmins are vulnerable to stress and fatigue like everyone else.
You’d expect a seasoned engineer to intuitively know how to deal with a wide range of contingencies. That is a good thing. Experience and tenure, however, could also encourage people to rely on “gut instincts”, to “wing it” and “shoot from the hip.” In complex situations those attitudes could prove hazardous.
A checklist is a good way to mitigate those problems because it helps us define our response in advance and make it available to everyone. We therefore ensure that every member of our team is taking the right steps in the right order, each and every time.
Checklists at Server Density
Here is one of our own checklists. It defines what our on-call first responders do when a critical incident occurs (we also wrote a guide on how we handle incidents outages and downtime):
As “common sense” and obvious as the steps may be, they carry great importance for the health of our infrastructure. So we spell them out.
As you would expect, we take our uptime metrics seriously. We’ve got some pretty capable folks taking care of our servers. And we use checklists. Very prescriptive ones like the one below. This one details the steps our on-call people follow when faced with a server failover:
We sed/awk/grep all day long. Our checklists assume we do it for the very first time. At 2:00 in the morning we might have trouble finding the lights, let alone the Puppet master for our configuration.
Here is another example. We use this checklist when a server we monitor stops sending data:
DevOps checklists are as unique as the teams that use them. Each team has their own recipe for doing things and as technology stacks evolve, so do the checklists required to run them.
We aim to have a checklist for every scenario. From restoring a backup to production, deploying fixes, switching primary data centres, and database consistency checks, to responding to traffic spikes, security breaches, critical alerts, and a long list of other contingencies.
Google Docs is a key part of our on-call playbook, so that’s where we store our checklists for everyone to access and update as needed.
Checklists are not static
Relying on checklists does not mean we are intractable about how we do things. For us, creating checklists is an excellent opportunity to take a step back and review the entirety of our stack.
DevOps checklists work best when we schedule time to update and improve on them.
When it comes to server monitoring, we believe checklists are an important step towards reliable systems. They help our team respond to issues in a consistent and timely manner. This translates to increased uptime and a better capacity to serve our customers.
What about you. Are you using checklists?