Life on Call: Productivity and Wellbeing around the ClockLeave a Comment
They hand you a portable gadget that makes Gordon Gekko’s cellphone look hip. You must carry it 24/7. Every hour—day and night—you have to switch it on and listen to a series of three-digit code numbers. If you hear your number you need to race to the nearest telephone to find out where you’re needed. How does that sound?
Launched in 1950 by physicians in New York, the beeper marked a seminal point. Work didn’t quite end when you left the office anymore.
In the early noughties, you needed north of a million dollars before your startup wrote its first line of code.
All that money was used to buy things that we now get for free. That’s because much of our infrastructure has since become a commodity we don’t have to worry about.
Doesn’t that mean less moving parts to fix? And less need for on-call work? Probably not. Near-zero infrastructure means near-absent barriers to entry. It means more competition. Less friction also means greater speed. Some call it streamlined, agile, elastic. Others call it frantic. Point is, features are often tested in production these days.
On-call work has always been about reacting and mitigating. A physician will not invent a cure while on-call. They can only treat symptoms. Same goes with DevOps teams. An engineer will not fix code while on-call. They will do something more blunt, like restart a server. To restart a server they don’t walk to the back-of-house. Instead, they start an SSH session. If it still doesn’t work they raise a ticket. All from the (dis)comfort of their living room sofa.
DevOps and sysadmin teams have not exactly cornered the market for on-call work. Far from it. One in five EU employees are actually working on-call.
Even if you think your work does not require on-call duties, think again. When was the last time you checked your work email? The occasional notification may sound harmless. It’s not unheard of, however, for emails to arrive past midnight, followed by text messages asking why they were not answered.
The anxiety of on-call work stems from the perceived lack of control. It doesn’t matter if the phone rings or not. Being on-call and not being called is, in fact, more stressful than a “busy” shift, according to this article. It is this non-stop vigilance, having to keep checking for possible “threats” that is unhealthy.
We are not here to demonize on-call. As engineers in an industry that requires this type of work, we just think it pays to be well informed of potential pitfalls.
Goodwill is Currency
When the alert strikes at stupid o’clock, chances are you’ll be fixing someone else’s problem. It’s broken. It’s not your fault. And yet here you are, in a dark living room, squatting on a hard futon like a Gollum, cleaning somebody else’s mess.
On-call work is a barometer of goodwill in a team. There is nothing revelational about this. Teamwork is essential.
What happens when you (or a family member) is not feeling well? What if you need to take two hours off? Who is going to cover for you if everyone is “unreachable” and “offline”?
The absence of goodwill makes on-call duty exponentially harder.
Focus on Quality
Better quality, by definition, means lower incident rates. Over time, it nurtures confidence in our own systems. I.e. we don’t expect things to break as often. The fear of impending incidents takes a dip, and our on-call shift gets less frightful.
To encourage code resilience, it makes sense to expose everyone— including devs and designers—to production issues. In light of that, here at Server Density we run regular war games with everyone who participates in our on-call rotation. So when it comes to solving production alerts, our ops and dev teams have a similar knowledge base.
We also write and use simple checklists that spell things out for us. Every single step. As if we’ve never done this before. At 2:00 in the morning we might have trouble finding the lights, let alone the Puppet master for our configuration.
Our “Life On Call” Setup
Each engineer at Server Density picks their own gear. In general we use a phone to get the alerts and a tablet or laptop to resolve them.
For alerting we use PagerDuty. For multi-factor authentication we run Duo and for collaboration we have HipChat. We also have a Twitter client to notify us when PagerDuty is not available (doesn’t happen often).
Upon receiving an alert we switch to a larger display in order to fix and resolve it. 80% of our incidents can be dealt with using a tablet. All we need is an SSH client and full browser. The tablet form factor is easier on the back and can be taken to more places than a laptop.
An overnight alert is like an auditory punch in the face. At least you’ve signed up for this, right? What about your partner? What have they done to deserve this?
To avoid straining relationships it pays to be proactive. Where do you plan to be when on-call? Will you have reception? If an alert strikes, will you have 4G—preferably Wi-Fi—in order to resolve it? What about family obligations? Will you be that parent at the school event, who sits in a corner hiding behind a laptop for two hours?
At best, working on-call is nothing to write home about. At worst, well, it kind of sucks.
Since it’s part of what we do though, it pays to be well informed and prepared. Focusing on code-resilience, nurturing teamwork, and setting the right expectations with colleagues and family, are some ways we try and take the edge off it.
What about you? How do you approach on-call? What methodologies do you have in place?