5 Ways to Reduce Alerts for Better Productivity
Pedro Pessoa, Operations Engineer at Server Density.
Published on the 3rd August, 2016.
We’ve long been proponents of HumanOps—the idea that a top performing Ops team is a healthy, well rested Ops team. With that conviction in mind, we recently launched Alert Costs to help teams estimate the human impact of alert interruptions.
The rationale for Alert Costs was straightforward. When an engineer (and anyone for that matter) is working on a complex task, the worst thing you can do is expose them to alerts. It takes more than 20 minutes to regain intense focus after being interrupted. Here is an example alert cost breakdown.
The cost is even greater when alerts arrive out of hours. Quality of sleep correlates with health and productivity, while sleep deprivation leads to stress, irritability, and cognitive impairment. Each and every interruption is bad news for our sleep quality and how productive we are the following day.
There really isn’t a “good” time to receive alerts.
The more noise your system generates the more inefficient and expensive it is. And given how hectic the typical sysadmin day is, we rarely get to pause and take stock of what eats away at our time, attention, and energy.
To counter that, what we do is schedule regular alert audits, i.e. make time to figure out which ones to keep, tweak or get rid off.
So here are 5 ways we use those audits to reduce alerts here at Server Density.
1.Remove duplicate alerts
An alert audit might surface duplicate alerts. Not necessarily identical alerts, but redundant ones nonetheless. Some of them might be leftovers from temporary system changes that have long been reverted without removing their associated alert.
There is no quick fix or smart solution here. We need to manually go down the list and check we’re not alerted multiple times for the same condition.
For example, we recently discovered an old alert checking if the value was less than 1 requests for a duration of 5 minutes. Concurrently, another alert was checking if the values were equal to zero. We wouldn’t have noticed this overlap without Alert Costs.
Alert Costs highlighted the cost of this alert (it wasn’t generating any incidents, but it was open on the UI for a considerable amount of time).
So we decided to check all alerts for that device. As it turns out we’d configured this particular alert a long time ago and subsequently forgot about it.
2. Tweak wait times and thresholds
A couple of definitions to start with:
Threshold: The lower metric limit that has to be crossed for the condition to be true. It might be okay for a certain value to rise for a couple of minutes. But if it persists for longer, i.e. when it’s no longer a spike but a trend, you might have a problem in your hands.
Wait time: The time a condition has to be true before you get an alert.
Not long ago we set up some backend payload processing servers and defined an alert for when their swap size grew beyond 150MB. A few months later, while reviewing our Alert Costs we noticed an elevated cost for swap usage.
Demand for our service had increased and, as a result, usage for this server had also increased. This led to a slightly larger swap which, in turn, activated the alert threshold more often than before.
So we decided to raise the alert threshold to 200MB, which is higher than the peak value of 170MB and certainly within acceptable limits for a busy server.
3. Reduce manual intervention and automate with code
What we don’t want is alerts that trigger too often and require manual intervention. Tasks like logging into a server and running a set of commands, et cetera. Ideally, we don’t want to be doing things more than once. Figure out how something works, and then automate it. Better yet, tackle the root problem that instigates the alert (see 5).
4. Use soft and hard limits
When you set alerts for certain metrics, it helps if you set a soft and hard limit. For example, if disk usage goes over 80%, send a non-critical alert. Only alert critically when usage goes over 95%. This way we are reducing the overall cost of the alert.
5. Fix the underlying problem
The ultimate goal of alerts is to raise awareness of underlying code problems. Addressing the root cause is a high-leverage activity, compared to superficial production fixes (a more expensive activity in the long term).
Noisy systems are expensive systems. Alerts interrupt our workflow and disrupt our sleep.
The aim of this article, of course, is not to demonise alerts. We rely on them for knowing when something critical is about to fail. And yet, the fewer alerts we receive the more energy we can pour on the truly urgent alerts that need our attention.
The above list is how we reduce alerts here at Server Density. What about you? How do you figure out what alerts you can live without and keep interruptions at a minimum?