Introducing Alert Costs2 Comments
When an engineer (and anyone for that matter) is working on a complex task the worst thing you can do is expose them to random alerts. It takes an average of 23 minutes and 15 seconds to regain intense focus after being interrupted.
The more noise your system generates the more inefficient and expensive it is. Problem is, given how busy most sysadmins are with various projects and daily issues, we rarely get the opportunity to pause and analyse what eats away at our time and attention the most.
Technology and infrastructure doesn’t feel fatigue, but its human operators do. Human fatigue is something qualitative, cumulative, and often imperceptible to others. When it comes to communicating about it, we’re held back by our inability to quantify it. And here lies the problem. If something cannot be measured, it’s harder to focus and improve upon. So we wanted to come up with an indicator of human fatigue.
Our ultimate goal is to raise awareness of its cost; of that human toll that we’d otherwise forget.
Introducing Alert Costs
Alert costs measures the impact of alerts (and incidents) in human hours. Armed with this knowledge, a sysadmin can look for ways to mitigate those types of alerts that create the most noise. It could be an alert is triggering more frequently than it should. Or it could be that when it does trigger, it takes a significant amount of time to resolve. This added clarity will allow sysadmins to reduce interruptions, mitigate alert fatigue, and improve everyone’s on-call shift.
How did we estimate the cost of an alert:
1.Amount of time an alert is open: This is the simple bit. If the alert triggered once and took 30 minutes to resolve, then the cost here will be 30 minutes. If it triggered 10 times, staying open for 30 minutes each time, then the overall cost will 30×10=300 mins.
But there is much more to it. Humans are not machines. You can’t redirect their attention between tasks like a mechanical switch. Context switching does not come free for humans, especially for tasks that require focus and problem solving, which is precisely what sysadmins do. It takes a significant amount of time for humans to regain intense focus on their task once they’ve been interrupted. This is not exact science and figures differ from person to person and time of day. Context Switch Penalty: We add 23 minutes for every time the alert triggered.
2.If an alert triggered 25 times and stayed open for 60 minutes, the overall cost for that particular alert would therefore amount to 635 minutes.
If you expand on the alert you can see a list of events, one for each time the alert triggered. This allows you to dive in the detail and examine the circumstances of each particular event. Based on this information you can then fine-tune the thresholds for this particular alert so you can reduce the amount of alerts we get by increasing the wait time, for example.
What is that 20% of alert types that trigger 80% of the time? How long do those alerts last and what does this mean for our productivity? You can then have interesting conversations with your team.
Alert Costs can reveal patterns that you may otherwise miss. For example, if a specific group of alerts triggers on a specific time each week, it could be due to a specific script. Extracting such insights is now much easier.
The alert costs table is entirely sortable. You can look and check the configuration to determine what fired most recently. Alternatively you can focus on duration, to see which alert was open for the longest time.
We wanted to make it immediately obvious what’s important. Our focus was on clarity and being concise. Which alerts are over the threshold for the longest time? Which alert manifests into events most often? Can we adjust its configuration and reduce the probability of waking our engineers up?
In the future, we want to do things like offer different (higher) cost values for night alerts versus day alerts. Which part of our tech stack is taking the greatest toll on our engineers? How do we improve on-call quality?
We will add sparklines and offer even more ways to slice in order to deduce (and reduce) the impact on your team.
HumanOps is a collection of principles that advance our focus away from systems, and towards humans. It starts from a basic conviction, namely that technology affects the wellbeing of humans just as humans affect the reliable operation of technology.
Alert Costs is one such feature. The aim of alert costs is to find a balance on how we assess the performance of our systems and improve the wellbeing of our people.
If you haven’t done so already, create an account with Server Density using the form below, take alert costs for a spin and stay tuned for even more HumanOps features in the not too distant future.