この記事は日本語でもご覧頂けます。 詳細はこちら。

DevOps On-Call: How we Handle our Schedules

On Call Schedules

By David Mytton,
CEO & Founder of Server Density.

Published on the 17th September, 2015.

I was on call 24/7/365.

That’s right. When we first launched our server monitoring service in 2009, it was just me that knew how the systems worked. I was on-call all the time.

You know the drill. Code with one hand. Respond to production alerts with the other.

In time, and as more engineers joined the team, we became a bit more deliberate about how we handle production incidents. This post outlines the thinking behind our current escalation process and some details on how we distribute our on-call workload between team members.

DevOps On-Call

Developing our product and responding to production alerts when the product falters, are two distinct yet very much intertwined activities. And so they should be.

If we were to insulate our development efforts from how things perform in production, our priorities would get disconnected. Code resilience is at least as important as building new features. In that respect, it makes great sense to expose our developers and designers to production issues. As proponents of DevOps we think this is a good idea.

Wait, what?

But isn’t this counterproductive? The state of mind involved in writing code couldn’t be more different from that of responding to production alerts.

When an engineer (and anyone for that matter) is working on a complex task the worst thing you can do is expose them to random alerts. It takes more than 15 minutes, on average, to regain intense focus after being interrupted.

How we do it

It takes consideration and significant planning to get this productivity balance right. In fact, it’s an ongoing journey. Especially in small and growing teams like ours.

With that in mind, here is our current process for dealing with production alerts:

First Tier

During work hours, all alerts go to an operations engineer. This provides a much needed quiet time for our product team to do what they do best.

Outside work hours alerts could could go to anyone (ops or product alike). We rotate between team members every seven days. At our current size, each engineer gets one week on call and eight weeks off. Everyone gets a fair crack of the whip.

Second Tier

An escalated issue will most probably involve production circumstances unrelated to code. For that reason, second level on-call duty rotates between operation engineers only, as they have a deeper knowledge of our hardware and network infrastructure.

Our PagerDuty Setup

For escalations and scheduling we use PagerDuty. If an engineer doesn’t respond within the time limit (15 minutes of increasingly frequent notifications via SMS and phone) there will always be someone else available to take the call.

Our ops engineers are responsible for dealing with any manual schedule overrides. If someone is ill, on holiday or is traveling then we ask for volunteers to rearrange the on-call slots.

pagerduty

After an out-of-hours alert the responder gets the following 24 hours off from on-call. This helps with the social/health implications of being woken up multiple nights in a row.

Keeping Everyone in the Loop

We hold weekly meetings with operations and product leads. The intention is to keep everyone on top of the overall product health, and to help our product team prioritise their development efforts.

While I don’t have on-call duties anymore (client meetings and frequent travel while on call just doesn’t make sense any more) I can still monitor our alerts on the Server Density mobile app (Android and iOS) which has a prominent place on my home screen together with apps from the other monitoring tools we use.

What about you? How do you handle your devops on-call schedules?

  • sandipb

    That is interesting. Your off-hour primary oncall is cycled between ops *and* dev? While the secondary oncall is operations only? Unless your product guys have all the operational contexts, doesn’t that cause most of the off-hour alerts to get escalated to operations?

    In my team, there are separate oncall schedules for ops and dev. The ops oncall refer to the dev oncall only when the outage requires code level knowledge to remediate.

    • We run war games with everyone who participates in the on-call rotation so that everyone knows how to solve as many different scenarios as possible. That means when it comes to solving alerts, ops and dev have a very similar knowledge base. Checklists also help with this: see https://blog.serverdensity.com/how-and-why-we-use-devops-checklists/

      Our ops team deal with ongoing day-to-day of running our infrastructure e.g. maintenance, setting up new systems, providing debugging support etc. Whereas dev day-to-day is product development. But when it comes to on-call and responding to alerts, unless it’s a very complex issue then we want anyone on the team to be able to deal with it.

    • Max Zahariadis

      That’s an interesting approach sandipb. Do you have meetings between ops and dev, and how do you compare notes between the two? Also do you see any overheads maintaining multiple schedules?

Articles you care about. Delivered.

Help us speak your language. What is your primary tech stack?

Maybe another time