You can’t just have a sysadmin on call
Reading through the Azure cloud outage root cause analysis from the other week made me think about the problem they encountered, what was needed to fix it and the timeline of events. It looked something like this:
- 4:00PM PST, Feb 28 – Bug triggered by rollover to Feb 29 UTC
- 6:38PM PST, Feb 28 – Bug diagnosed
- 10:00PM PST, Feb 28 – Test and rollout plan ready
- 11:20PM PST, Feb 28 – Code ready to test
- 1:50AM PST, Feb 29 – Testing completed
- 2:11AM PST, Feb 29 – Rollout fix to one production cluster
- 5:23AM PST, Feb 29 – Full rollout completed
- (There was a secondary outage but this was limited to a small number of clusters so isn’t relevant for this post)
The root cause was a software bug:
When the GA creates the transfer certificate, it gives it a one year validity range. It uses midnight UST of the current day as the valid-from date and one year from that date as the valid-to date. The leap day bug is that the GA calculated the valid-to date by simply taking the current date and adding one to its year. That meant that any GA that tried to create a transfer certificate on leap day set a valid-to date of February 29, 2013, an invalid date that caused the certificate creation to fail.
The key thing to note here is the time it took to diagnose and that code had to be written to fix the problem – 4pm PST to 11.20pm when the code was written, and it was not tested and fully deployed until 5.23am the following day.
If there is a hardware failure or failure of a component in a known way with known fixes, then a sysadmin can realistically be expected to fix the problem on their own (or within a team of ops engineers). However, with more and more complex systems in use, it may well be that a software bug has caused the outage which will require a response from product engineers. In a startup that will be just a few people responsible for many things (which presents a different problem: what if that person is away or uncontactable?) but as companies and teams grow, you still need to ensure you have sufficient coverage for all your critical components.
The job of a sysadmin is to set up and maintain systems and to be the initial responder to outages. To actually resolve and fix the problem may require involvement from engineering. Does your incident response plan include on call product engineers?
Enjoy this post? You may also like Designing and printing (dot) notebooks