DevOps On-Call: How we Handle our Schedules
CEO & Founder of Server Density.
Published on the 17th September, 2015.
I was on call 24/7/365.
That’s right. When we first launched our server monitoring service in 2009, it was just me that knew how the systems worked. I was on-call all the time.
You know the drill. Code with one hand. Respond to production alerts with the other.
In time, and as more engineers joined the team, we became a bit more deliberate about how we handle production incidents. This post outlines the thinking behind our current escalation process and some details on how we distribute our on-call workload between team members.
Developing our product and responding to production alerts when the product falters, are two distinct yet very much intertwined activities. And so they should be.
If we were to insulate our development efforts from how things perform in production, our priorities would get disconnected. Code resilience is at least as important as building new features. In that respect, it makes great sense to expose our developers and designers to production issues. As proponents of DevOps we think this is a good idea.
But isn’t this counterproductive? The state of mind involved in writing code couldn’t be more different from that of responding to production alerts.
When an engineer (and anyone for that matter) is working on a complex task the worst thing you can do is expose them to random alerts. It takes more than 15 minutes, on average, to regain intense focus after being interrupted.
How we do it
It takes consideration and significant planning to get this productivity balance right. In fact, it’s an ongoing journey. Especially in small and growing teams like ours.
With that in mind, here is our current process for dealing with production alerts:
During work hours, all alerts go to an operations engineer. This provides a much needed quiet time for our product team to do what they do best.
Outside work hours alerts could could go to anyone (ops or product alike). We rotate between team members every seven days. At our current size, each engineer gets one week on call and eight weeks off. Everyone gets a fair crack of the whip.
An escalated issue will most probably involve production circumstances unrelated to code. For that reason, second level on-call duty rotates between operation engineers only, as they have a deeper knowledge of our hardware and network infrastructure.
Our PagerDuty Setup
For escalations and scheduling we use PagerDuty. If an engineer doesn’t respond within the time limit (15 minutes of increasingly frequent notifications via SMS and phone) there will always be someone else available to take the call.
Our ops engineers are responsible for dealing with any manual schedule overrides. If someone is ill, on holiday or is traveling then we ask for volunteers to rearrange the on-call slots.
After an out-of-hours alert the responder gets the following 24 hours off from on-call. This helps with the social/health implications of being woken up multiple nights in a row.
Keeping Everyone in the Loop
We hold weekly meetings with operations and product leads. The intention is to keep everyone on top of the overall product health, and to help our product team prioritise their development efforts.
While I don’t have on-call duties anymore (client meetings and frequent travel while on call just doesn’t make sense any more) I can still monitor our alerts on the Server Density mobile app (Android and iOS) which has a prominent place on my home screen together with apps from the other monitoring tools we use.
What about you? How do you handle your devops on-call schedules?