Spotify Engineering: Making Ops Human
CEO & Founder of Server Density.
Published on the 20th October, 2016.
As it turns out, Server Density is not the only team out there getting excited about HumanOps. We recently wrote about Portia Tung from Barclays and all the exciting things she’s been working on.
Today we’d like to shift our gaze to Spotify and Francesc Zacarias, one of their lead site availability engineers.
What follows is the key take-aways from his HumanOps talk. You can watch the entire video (scroll down) or download it in PDF format and read at your own time (see below article).
Spotify Engineering goes HumanOps
According to Francesc, Spotify Engineering is a cross-functional organisation. What this means is that each engineering team includes members from disparate functions. What this also means is that each team is able to fully own the service they run in its entirety.
Spotify is growing fast. From 100 services running on 1,300 servers in 2011, they now have 1400 services on 10,000 servers.
In the past, the Spotify Ops team was responsible for hundreds of services. Given how small their team was (a handful of engineers) and how quickly new services were appearing, their Ops team was turning into a bottleneck for the entire organisation.
While every member of the Ops team was an expert in their own specific area, there was no sharing between Ops engineers, or across the rest of the engineering organisation.
You were paged on a service you didn’t know existed because someone deployed and forgot to tell you.
Francesc Zacarias, Spotify Engineering
Under the new Spotify structure, developers now own their services. In true devops fashion, building something is no longer separate from running it. Developers control the entire lifecycle including operational tasks like backup, monitoring and, of course, on call rotation.
This change required a significant cultural shift. Several folks were sceptical about this change while others braced themselves for unmitigated disaster.
In most instances however it was a case of “trust but verify.” Everyone had to trust their colleagues, otherwise the new structure wouldn’t take off.
Now both teams move faster.
Operations are no longer blocking developers as the latter handle all incidents pertaining to their own services. They are more aware of the pitfalls of running code in production because they are the ones handling production incidents (waking up to alerts, et cetera).
Want to find out more? Check out the Spotify Labs engineering blog. And if you want to take the Spotify talk with you to read at your own pace, just use the download link below.