How and why we use DevOps checklists

DevOps Checklists

By David Mytton,
CEO & Founder of Server Density.

Published on the 11th August, 2015.

In his book The Checklist Manifesto, Atul Gawande tells the story of the first pre-flight checklist, created by Boeing following the fatal crash of a B-17 in 1935.

According to the investigation, the pilots forgot to disengage a critical wing adjustment mechanism before take off. Evidence that even veteran pilots could miss key steps or do things in the wrong order. With hundreds of lives at stake it was necessary to design around this constraint.

The checklist does exactly that. It compensates for the “limits of human memory and attention.”

Indeed, Gawande — a doctor himself — writes how key steps in medical procedures were routinely missed, resulting in infections and preventable fatalities. The adoption of checklists reduced those occurrences, and they are now used in a wide range of healthcare settings.

Checklists for DevOps

Not unlike healthcare and aviation, sysadmins are often tasked with systems that touch many lives. Here at Server Density we appreciate the complexities of the systems we run. We also recognise the limits of the people who run them — us. That’s why we use checklists for much of what we do.

checklist-tattoo
Only so much that human memory can remember. Source: http://bit.ly/1Wc7m0p

Checklists are particularly effective in situations where there is:

Complexity

There is only so much that human memory can remember, reproduce and execute upon, in a reliable manner.

Stress and Fatigue

Incidents may happen at awkward times, like early in the morning when mistakes are more likely. Sysadmins are vulnerable to stress and fatigue like everyone else.

Ego

You’d expect a seasoned engineer to intuitively know how to deal with a wide range of contingencies. That is a good thing. Experience and tenure, however, could also encourage people to rely on “gut instincts”, to “wing it” and “shoot from the hip.” In complex situations those attitudes could prove hazardous.

A checklist is a good way to mitigate those problems because it helps us define our response in advance and make it available to everyone. We therefore ensure that every member of our team is taking the right steps in the right order, each and every time.

Checklists at Server Density

Here is one of our own checklists. It defines what our on-call first responders do when a critical incident occurs (we also wrote a guide on how we handle incidents outages and downtime):

Incident Response Checklist

As “common sense” and obvious as the steps may be, they carry great importance for the health of our infrastructure. So we spell them out.

As you would expect, we take our uptime metrics seriously. We’ve got some pretty capable folks taking care of our servers. And we use checklists. Very prescriptive ones like the one below. This one details the steps our on-call people follow when faced with a server failover:

Load Balancer Checklist

We sed/awk/grep all day long. Our checklists assume we do it for the very first time. At 2:00 in the morning we might have trouble finding the lights, let alone the Puppet master for our configuration.

Here is another example. We use this checklist when a server we monitor stops sending data:

No Data Checklist

DevOps checklists are as unique as the teams that use them. Each team has their own recipe for doing things and as technology stacks evolve, so do the checklists required to run them.

We aim to have a checklist for every scenario. From restoring a backup to production, deploying fixes, switching primary data centres, and database consistency checks, to responding to traffic spikes, security breaches, critical alerts, and a long list of other contingencies.

Google Docs is a key part of our on-call playbook, so that’s where we store our checklists for everyone to access and update as needed.

Checklists are not static

Relying on checklists does not mean we are intractable about how we do things. For us, creating checklists is an excellent opportunity to take a step back and review the entirety of our stack.

DevOps checklists work best when we schedule time to update and improve on them.

Summary

When it comes to server monitoring, we believe checklists are an important step towards reliable systems. They help our team respond to issues in a consistent and timely manner. This translates to increased uptime and a better capacity to serve our customers.

What about you. Are you using checklists?

Free eBook: The 9 Ingredients of Scale

From two students with pocket money, to 20 engineers and 80,000 servers on the books, our eBook is a detailed account of how we scaled a world-class DevOps team from the ground up. Download our definitive guide to scaling DevOps and how to get started on your journey.

Help us speak your language. What is your primary tech stack?

What infrastructure do you currently work with?

  • bulgarion

    “We sed/awk/grep all day long. Our checklists assume we do it for the very first time.”
    This is some serious advice, I completely second this. I’m currently rewriting a large chunk of technical documentation on this very principle and I’m sure that whoever will read it will be able to understand it, no matter his/her technical level.

    • Max Zahariadis

      Thanks bulgarion. Do let us know how the rewrite process turns out. Are you using the checklist format?

  • Pradip Shah

    Thanks for this – I have been a fan of checklists and “memory is the biggest enemy” in our devops team, but sometimes I wondered if I was out of touch with reality.

    The process we have actually goes one step more – for all production servers (we offer a production support service for our eCommerce customers), we have an excel sheet that acts as a audit trail. Each step is documented to have been executed. We have had use of the audit part occasionally but the fact the excel has to be updated ensures the checklist is looked at.

    • This is a good idea. We maintain an audit trail when responding to incidents by copying actions into a JIRA ticket. That way we have a log of who did what and when, but tying it directly into the checklist is a good idea.

  • I would like to see checklist get more love. DevOps focuses on cultural change and system thinking as needed changes without providing workable tactics. Thats why I wrote a book sharing my story of migrating a 450 million user website to the cloud. Checklists worked well to create reliable systems while other approaches failed.

  • dennyzhang.com

    Yes, we do need to verify whatever we have changed.

    Besides to manual verifications, will you guys build up a test library or common scripts? Thus we can automate the verification process itself as much as possible.

    Chef Audit mode can fix part of the problem. I’m not very comfortable with that, thus we use serverspec to build up a test library.

    What’s your experience? Thanks!

Articles you care about. Delivered.

Help us speak your language. What is your primary tech stack?

Maybe another time