5 Ways to Reduce Alerts for Better Productivity

Alert-Limiting

By Adrian Goll, Operations Engineer at Server Density.
Published on the 3rd August, 2016.

We’ve long been proponents of HumanOps—the idea that a top performing Ops team is a healthy, well rested Ops team. With that conviction in mind, we recently launched Alert Costs to help teams estimate the human impact of alert interruptions.

The rationale for Alert Costs was straightforward. When an engineer (and anyone for that matter) is working on a complex task, the worst thing you can do is expose them to alerts. It takes more than 20 minutes to regain intense focus after being interrupted. Here is an example alert cost breakdown.

reduce-alerts-1

The cost is even greater when alerts arrive out of hours. Quality of sleep correlates with health and productivity, while sleep deprivation leads to stress, irritability, and cognitive impairment. Each and every interruption is bad news for our sleep quality and how productive we are the following day.

There really isn’t a “good” time to receive alerts.

The more noise your system generates the more inefficient and expensive it is. And given how hectic the typical sysadmin day is, we rarely get to pause and take stock of what eats away at our time, attention, and energy.

To counter that, what we do is schedule regular alert audits, i.e. make time to figure out which ones to keep, tweak or get rid off.

So here are 5 ways we use those audits to reduce alerts here at Server Density.

1.Remove duplicate alerts

An alert audit might surface duplicate alerts. Not necessarily identical alerts, but redundant ones nonetheless. Some of them might be leftovers from temporary system changes that have long been reverted without removing their associated alert.

There is no quick fix or smart solution here. We need to manually go down the list and check we’re not alerted multiple times for the same condition.

For example, we recently discovered an old alert checking if the value was less than 1 requests for a duration of 5 minutes. Concurrently, another alert was checking if the values were equal to zero. We wouldn’t have noticed this overlap without Alert Costs.

reduce-alerts-2

reduce-alerts-3

Alert Costs highlighted the cost of this alert (it wasn’t generating any incidents, but it was open on the UI for a considerable amount of time).

So we decided to check all alerts for that device. As it turns out we’d configured this particular alert a long time ago and subsequently forgot about it.

2. Tweak wait times and thresholds

A couple of definitions to start with:

Threshold: The lower metric limit that has to be crossed for the condition to be true. It might be okay for a certain value to rise for a couple of minutes. But if it persists for longer, i.e. when it’s no longer a spike but a trend, you might have a problem in your hands.

Wait time: The time a condition has to be true before you get an alert.

Not long ago we set up some backend payload processing servers and defined an alert for when their swap size grew beyond 150MB. A few months later, while reviewing our Alert Costs we noticed an elevated cost for swap usage.

reduce-alerts-4

Demand for our service had increased and, as a result, usage for this server had also increased. This led to a slightly larger swap which, in turn, activated the alert threshold more often than before.

So we decided to raise the alert threshold to 200MB, which is higher than the peak value of 170MB and certainly within acceptable limits for a busy server.

reduce-alerts-5

3. Reduce manual intervention and automate with code

What we don’t want is alerts that trigger too often and require manual intervention. Tasks like logging into a server and running a set of commands, et cetera. Ideally, we don’t want to be doing things more than once. Figure out how something works, and then automate it. Better yet, tackle the root problem that instigates the alert (see 5).

4. Use soft and hard limits

When you set alerts for certain metrics, it helps if you set a soft and hard limit. For example, if disk usage goes over 80%, send a non-critical alert. Only alert critically when usage goes over 95%. This way we are reducing the overall cost of the alert.

reduce-alerts-5

5. Fix the underlying problem

The ultimate goal of alerts is to raise awareness of underlying code problems. Addressing the root cause is a high-leverage activity, compared to superficial production fixes (a more expensive activity in the long term).

Summary

Noisy systems are expensive systems. Alerts interrupt our workflow and disrupt our sleep.

The aim of this article, of course, is not to demonise alerts. We rely on them for knowing when something critical is about to fail. And yet, the fewer alerts we receive the more energy we can pour on the truly urgent alerts that need our attention.

The above list is how we reduce alerts here at Server Density. What about you? How do you figure out what alerts you can live without and keep interruptions at a minimum?

Free eBook: 4 Steps to Successful DevOps

This eBook will show you how we i) hacked our on-call rotation to increase code resilience, ii) broke our infrastructure, on purpose, to debug quicker and increase uptime, and iii) borrowed practices from the healthcare and aviation industry, to reduce complexity, stress and fatigue. And speaking of stress and fatigue, we’ve devoted an entire chapter on how we placed humans at the centre of Ops, in order to increase their productivity and boost the uptime of the systems they manage. What are you waiting for, download your free copy now.

Help us speak your language. What is your primary tech stack?

What infrastructure do you currently work with?

  • Sir Vantes

    Server Density looks very nice.

    As a System Center Operations Manager Admin, my role was often to tailor alerts to be ‘actionable’, something that could be immediately seen as either needing attention or could be addressed later.

    We crafted alerts and emails that named the server or service in the title, adjusted levels to reflect that the 10% free on a 100GB drive is more important then that 10% on a 1TB drive and simply did everything to make SCOM be informative with the minimum of noise.

    Where I’m initially impressed with Server Density is the automated parsing of alerts to provide Admins a simple dashboard to see if issues are mounting to a point where intervention is required.

    That alone puts your product above SCOM in useability.

    I will definitely be digging further into your Suite.

    Thanks

    • Max Zahariadis

      Thanks for reading, and great to hear!

Articles you care about. Delivered.

Help us speak your language. What is your primary tech stack?

Maybe another time