HumanOps: Making Operations Human


By David Mytton,
CEO & Founder of Server Density.

Published on the 12th May, 2016.

What is the number one sysadmin skill?

The ability to problem solve, right? We’re not talking about sudoku and crosswords here. Errors and delays can cost millions. With scale comes complexity, and an exponential increase in things that could go south. In production. At four in the morning.

And here lies the challenge. Sysadmins are not superhumans. They are susceptible to stress and fatigue just like everybody else.

We know that prolonged stress is detrimental to health. We also know that fatigue impairs our ability for basic problem solving. A diminished problem solving capacity may not pose a problem in jobs dictated by the traditional metrics of productivity, i.e. output per hour. But for those jobs where ideas and innovative solutions are required, productivity is a rather poor measure of success.

It’s hard to shoehorn some of the most important things we do in life into the category of “being productive.”

Kevin Kelly, The Post Productive Economy

Sysadmin teams consist of problem solving humans. And the pressing question is, how can those teams reach and sustain their potential?

We, too, have pondered about this for years.

As more engineers joined the Server Density team, we’ve been reassessing how we handle production incidents, how we escalate issues, how we distribute our on-call workload, how we collaborate, and learn from failure. All those efforts were aimed at the same goal. To nurture the problem solving capacity of the humans behind systems.

How do we minimise interruptions? How do we safeguard downtime and renewal? How do we minimise stress and fatigue? How do we build software that is more inline with how the human brain works?

Enter HumanOps

HumanOps is a collection of principles that address those questions. It advances our focus away from systems, and towards humans. It starts from a basic conviction, namely that technology affects the wellbeing of humans just as humans affect the reliable operation of technology.

At Server Density we’ve observed a strong correlation between human and system metrics. Reduced stress leads to fewer errors and escalations. Reduction in incidents and alerts leads to better sleep and reduced stress. Better sleep leads to better time-to-resolution metrics.

What’s the average number of interruptions and wake-ups our engineers experience per month? How many late shifts and weekend calls do they get?

As software makers, we have significant opportunity and responsibility here. How do you spot issues before they cause downtime? How do you reduce incidents and mitigate stress? How do you present this data in a more intuitive way?

Here is a wireframe for an upcoming Server Density feature called alert history. Notice the Cost column? It measures the cost of incidents in actual human hours.

HumanOps - alert history

Below is a preview of an upcoming feature for iOS, called sparklines. Sparklines condense full blown charts into smaller inline expressions that illustrate trends. Sparklines are a perfect match for the iPhone because they offer visual cues about “what’s happening?” allowing sysadmins to quickly decide whether to go home, or whether they can finish dinner before reaching for their laptop.

Human Ops iOS

We will expand on this, and many more, HumanOps features in the near future. The important thing to remember is that HumanOps features create bridges between systems and humans. And present information in a way that is easy for humans to pickup at a glance.

Humans on call

Another key area of focus for HumanOps is on-call work.

The anxiety associated with being available out of hours stems from the lack of control. It doesn’t matter if the phone rings or not. Being on-call and not being called is, in fact, more stressful than a “busy” shift.” It is this non-stop vigilance, having to keep checking for possible “threats” that is unhealthy.

How do you restore the feeling of control? How do you measure and track the human cost of out-of-hours incidents and escalations? All those considerations fall squarely under the HumanOps agenda.

We want to hear from you

HumanOps is a collection of questions, principles and ideas aimed at improving the life of sysadmins.

A challenge like this could never be tackled by one engineer, team, or company on their own. So we couldn’t be more excited about having Spotify, Barclays, Yelp, M&S, and join HumanOps. And even more teams are contributing their insights here.

If you happen to be in London on May 19th we’d love to see you at our very first HumanOps meetup, with more worldwide events coming soon.

  • Cody Hatfield

    The ideas posited here are excellent. It kind of helps put the impact of worker fatigue into perspective, especially it’s relationship with system maintenance and “productivity.” But it is not just for sys admins I might add.

    • Max Zahariadis

      Good point Cody, and thanks for reading. Any specific examples you have in mind?

Articles you care about. Delivered.

Help us speak your language. What is your primary tech stack?

Maybe another time