How Spotify and GOV.UK handle on call, and more
CEO & Founder of Server Density.
Published on the 14th June, 2016.
How do you measure and track the human cost of out-of-hours incidents? How do you keep your systems running around the clock without affecting the health of the teams behind those systems?
On May 19th, and in our quest to address those questions, we sponsored the very first HumanOps Meetup here in London.
It was a great debate.
Francesc Zacarias, SRE engineer at Spotify, and Bob Walker, Head of Web Operations at GDS GOV.UK, spoke about their on call approach and how it evolved over time.
Here is a brief summary:
Spotify On Call
According to Francesc, Spotify Engineering is a cross-functional organisation. What this means is that each engineering team includes members from disparate functions. What this also means is that each team is able to fully own the service they run in its entirety.
Spotify is growing fast. From 100 services running on 1,300 servers in 2011, they now have 1400 services on 10,000 servers.
In the past, the Spotify Ops team was responsible for hundreds of services. Given how small their team was (a handful of engineers) and how quickly new services were appearing, their Ops team was turning into a bottleneck for the entire organisation.
While every member of the Ops team was an expert in their own specific area, there was no sharing between Ops engineers, or across the rest of the engineering organisation.
You were paged on a service you didn’t know existed because someone deployed and forgot to tell you.
Francesc Zacarias, Spotify Engineering
With only a handful of people on call for the entire company, the Ops team were getting close to burnout. So Spotify decided to adopt a different strategy.
Redistribution of Ownership
Under the new Spotify structure, developers now own their services. In true devops fashion, building something is no longer separate from running it. Developers control the entire lifecycle, including operational tasks like backup, monitoring and, of course, on call rotation.
This change required a significant cultural shift. Several folks were sceptical about this change, while others braced themselves for unmitigated disaster.
Plenty of times I reviewed changes that if we hadn’t stopped, would have caused major outages.
Francesc Zacarias, Spotify Engineering
In most instances however it was a case of “trust but verify.” Everyone had to trust their colleagues, otherwise the new structure wouldn’t take off.
Now both teams move faster.
Developers are not blocked by operations because they handle all incidents pertaining to their own services. They are more aware of the pitfalls of running code in production because they are the ones handling production incidents (waking up to alerts, et cetera).
They are also incentivised to put sufficient measures in place. Things like monitoring (metrics and alerts), logging, maintenance (updating and repairing) and scalability are now key considerations behind every line of code they write.
In the event of an incident that touches multiple teams, the issue is manually escalated to the Incident Manager On Call aka IMOC (in other companies this is called “Incident Commander”). The IMOC engineer is then responsible for: i) key decisions, ii) communication between teams, and iii) authoring status updates.
IMOC remains in the loop until the incident is resolved and a blameless post mortem is authored and published.
By they way, Spotify has adopted what they refer to as “follow the sun” on-call rotation. At the end of a 12 hour shift, the Stockholm team hands over their call duties to their New York colleagues.
GOV.UK is the UK government’s digital portal. Bob Walker, Head of Web Operations, spoke about their recent efforts to reduce the amount of incidents that lead to alerts.
After extensive rationalisation, they’ve now reached a stage where only 6 types of incidents can alert (wake them up) out of hours. The rest can wait until next morning.
Their on-call strategy is split in 2 lines.
Primary support is on call during work hours. Two members of their staff deal with alerts, incidents and any urgent requests. The rotation comprises 28 full time employees. Most of them start at primary support until they upskill enough to graduate to 2nd line support. 2nd line support is 9 engineer strong and they are on call during out of hours.
GOV.UK mirrors their website across disparate geographical locations and operates a managed CDN at the front. As a result, even if parts of their infrastructure fail, most of their website should remain available.
Once issues are resolved GOV.UK carries out incident reviews (their own flavour of post mortems). In reiterating the importance of blameless post mortems, bob said: “you can blame procedures and code, but not humans.”
Government Digital Service: It’s OK to say what’s OK
By they way, every Wednesday at 11:00 they test their paging system. The purpose of this exercise is to not only to test their monitoring system but also to ensure people have configured their phones to receive alerts!
One of the highlights of the recent HumanOps event was “on call work” and how different companies approach it from their own unique perspective.
There seem to be two overarching factors guiding on call strategy: i) the nature of the service offered and ii) organizational culture.
Do you run on microservices and have different teams owning different services? You could consider the Spotify approach. On the other hand, if you can simplify your service and convert most assets into static content on a CDN, then the GOV.UK strategy might make more sense.
While no size fits all, successful ops teams seem to have the following things in common:
- They empower their people and foster a culture of trust
- They tear down silos and cross pollinate knowledge across different teams.
- Their culture shapes their tools. Not the other way around.
- They increase on call coverage and reduce on call assigned time