Redesigning Server Monitoring: Configuring Alerts
Richard Powell, Frontend Engineer at Server Density.
Published on the 16th April, 2015.
Server monitoring tools require interfaces for configuring alerts. But how many of these are truly intuitive? Alerts are an integral part of monitoring servers and successful trial customers tend to configure more alerts. Analytical data gathered through Preact, one of our awesome customers, proves this to be true:
That’s why we’ve just spent 2 months redesigning our alert configuration UI.
To succeed at this redesign we would need lots of user feedback and data. Fortunately through passive and active processes we gather feedback every day:
- Support tickets tell us what what users are struggling with.
- Usability tests show us why users are struggling.
- Examining customer accounts tell us how users are using the features we offer.
- Analytics defines what actions are essential to customer success and where we need to reduce friction.
- Using Server Density on a daily basis reveals where we have usability problems or where features are lacking.
This new UI is the result of much research, thought, usability tests and iteration. In this post I’m going to share how our design process has lead to a much more intuitive way to configure alerts. I’ll cover the problems we solved, how, what usability testing showed and how we improved our initial concepts.
Re-thinking wait and repeat.
Support tickets, usability testing and data from customer accounts showed that users weren’t using wait and repeat times correctly. Some users did not understand the iconography, others didn’t know what wait or repeat times meant.
In an alert configuration the wait time answers the question: Do you want to be notified immediately, or should the condition persist for a few minutes before you care? The repeat time answers the question: Do you want to be notified just once or every X minutes before you fix the problem? These are essential parts of server monitoring software as they reduce alert notification spam and fatigue.
These concepts require quite a bit of server monitoring experience. As such, we decided the UI should explain these concepts well. We would use a full sentence, with keywords becoming calls to action:
This had several advantages:
- The required server monitoring knowledge is explained in full, making it easier to understand.
- No debatable iconography!
- There is more space available to explain edge cases, such as no data alerts.
Usability testing wait & repeat.
Tests showed that the new wait and repeat UI was intuitive and usable. The most useful test here was with a relatively inexperienced user. He suggested that he would have completed the test quicker if the UI was configured with a default wait. The screenshot below shows no wait time on the left vs a 2 minute wait time on the right:
The setting on the right shows more but is it a good default? Customer data showed that default wait times would reduce alert spam, which is important for server monitoring software. New alert configs are now created with a default wait time of 2 minutes.
We also removed the ability to configure wait times using seconds. Customer data showed that it is not important to configure a wait time of 90 seconds vs 60 or 120 seconds. We removed this option which made the UI quicker to understand.
Making it easier to find metrics.
Server Density can monitor many metrics out of the box and you can write custom plugins. Unfortunately usability testing showed that it was difficult to find metrics nested 3 levels deep.
The old UI made heavy use of a generic dropdown:
This had served us well, but it had usability problems:
- It takes a while to manually find an item nested 3 levels deep.
- Hovering or clicking incorrectly is unforgiving. You have to start again!
- The dropdown takes up quite a lot of space.
- It does not support keyboard controls.
In January we released the latest value widget which implemented a new “accordion dropdown”:
We decided to use this as it solved most of the problems with finding metrics. Usability testing had already shown that this component was easy to use and powerful. We also knew we could make it better through further testing.
Improving keyboard controls.
Developers love keyboard controls, so we spent a lot of time improving them. Feedback from 8 different usability tests led to over 20 separate improvements. These improvements created a UI that responded intuitively to keypresses allowing alert configs to be created quickly. This wasn’t easy though, so I’ll share the key points that usability testing revealed:
- Use native elements where possible.
- Copy functionality and extend behaviour from native elements to match user expectations.
- Select elements respond to the space bar as well as down, up, enter and exit keys and they focus when you close them.
- Don’t miss a single interaction: The more keyboard controls you implement the more jarring it becomes when you miss one.
- Focus styles are incredibly important and are very inconsistent between browsers.
- One keypress is infinitely more usable than two.
Search is a powerful tool in the hands of an experienced user, but an inexperienced user won’t always use it correctly. To help inexperienced users in future we may start recording searches to see if we are matching user expectations. We may also explore a technology like Algolia (one of our customers) to build in tolerance for spelling mistakes or alternative naming conventions. Configuring alerts semi-automatically is another option.
We are thrilled with the results of the keyboard controls. It now takes just a few seconds to configure multiple alerts. Developers really like the effort we’ve put into this feature because it matches how they spend most of their time interacting with their computer.
Improving usability on small screens.
Server monitoring interfaces show quite a lot of information. Alert configurations can involve a process name, an html string, one field, three fields, four fields or five. When these fields get cramped usability suffers.
We needed to save space. One option we explored was to remove all visual tools like padding, margin, borders and shadows leaving just text which saved a great deal of space:
But there were usability concerns. Would the users know how to interact with this design? We would test this later.
We also spent time looking at customer accounts and talking to developers. This revealed that:
- It’s not important to know if an alert config is valid, only if it’s invalid due to a user error.
- It’s important to know that at least one action will be performed.
- It’s less important to know which actions will be performed.
- The wait, repeat and action settings can be hidden inside a single dropdown.
- It’s not important to see that an alert is turned on if it has validation errors.
All the designs we proposed saved space using these findings. Even so, we had to address the fact that alert configs should be able to wrap over multiple lines. The challenge was to do this without sacrificing readability.
The part of an alert config that is least likely to cause wrapping is the sentence constructed by the metric, the comparison and the value. For example disk usage is greater than 90%. Our old implementation interrupted this sentence with the subject: disk usage for /dev is greater than 90%. By moving the subject for /dev to the end of the sentence, wrapping would be less disruptive. We then indented the first element of a new line so that it did not blend with the alert config below:
To save more space, we only show the bare minimum whilst the user configures the alert. Talking to developers showed that they first expect to see the sentence “metric comparison value”:
Once the user configures the metric the rest of the alert configuration appears. The user can hover over the validation errors for more information:
Once the user finishes this process, the validation errors are replaced by the on/off toggle:
This saved space but it also made the experience of configuring alerts simpler as fields are progressively revealed. This allows users to focus on what’s important.
Usability testing the space savings.
In the end we decided against the text based approach:
With this approach, we were removing too many elements the user needs for interaction; Is this a dropdown or a text input? Where does this element end and that element start? We had tried solving these problems using fonts, font-weights and colours but the more traditional approach was clearly more usable:
We weren’t willing to sacrifice usability. It’s a very important part of monitoring servers and it’s key for us.
Another decision we took during user tests was to add loading and saved states. We assumed developers familiar with AJAX driven applications would not miss it. This was not true. Developers wanted re-assurance that their changes were being saved so we added loading and saved states.
Saving space was a difficult problem to solve. The data that devops teams pipe into server monitoring software is flexible and can be quite lengthy. In the end though I’m glad we spent so much time figuring this out. It made a big difference to the overall usability of the UI.
Making notification actions obvious.
Hypothesis one was that the old recipients icon was causing confusion:
The icon suggests “user” or “person” but not integrations. Each icon we proposed could be interpreted in multiple ways so we decided to use text instead.
Hypothesis two was that the old list was not intuitive because it contained both users and integrations. Would a developer intuitively place a user in the same list as a HipChat Channel? Perhaps not. At the very least support tickets showed one long list caused problems when 2 different integrations had the same name. We would solve this by showing multiple lists with headings, which the wireframe below shows:
We implemented this approach, and would user test it to be sure.
Realising you’ve made a mistake.
During development we suspected that our proposed approach was flawed. But we wanted proof before committing to a second redesign. A few weeks later feedback from usability testing gave us proof:
- It was difficult to find a specific user. We assumed users would use the search but many preferred to scroll.
- Users didn’t always realise they could scroll and it was frustrating to scroll when they did.
- The list was too long to tell which actions had been selected.
- It was too easy to close the dropdown accidentally.
- Users did not have enough context to say what would happen when an alert opened.
This is the wireframe of an approach we thought would fix these problems:
When we user tested this, the results were overwhelmingly positive. I’m thrilled that usability testing proved our initial designs to be flawed. How many server monitoring interfaces have these kind of problems that are never fixed?
Pausing an alert.
Support tickets showed that a few users did not understand how to pause an alert. This is an important feature for server monitoring tools because it prevents alert notification spam.
Hypothesis one was that users did not think of “pausing” or “playing” alerts. Perhaps different terminology was needed:
Hypothesis two was that that using a single play/pause button was confusing. Historically a Hi-Fi or cassette player would have separate buttons for play and pause and they would stay down when pressed. In our interface the button would change to a play icon. This functionality is ideally suited to a toggle because the alert is either on or off:
To test this approach we asked developers to describe everything they saw in the new UI. We also set tasks that avoided words like pause, play, stop, on, off. The results were pretty conclusive; Users immediately understood the new approach. A key victory in the journey towards usable server monitoring software!
Finding open alerts.
Our own experience showed us that users have 2 reasons for going to the alert configuration screen:
- To add new alerts, which you generally only do once.
- To edit an alert, which you generally only do with an open alert.
The first reason was already satisfied, but not the second. The solution appeared simple; We would add an open marker to the alert config if it was open. A red bar on the left side of the alert config would match nicely with the rest of our UI:
Not so simple after-all.
To my surprise not a single developer with server monitoring experience was able to say that the red bar indicated the alert was open. Some people thought it meant the alert was invalid and some thought it meant the alert was turned off. One possible fix was to replace the bar with the notification centre icon:
But this didn’t feel right. The icon works well for communicating notification centre, but not for this alert is open. Since billions of users worldwide know how the Facebook notification icon works we would take that approach:
Existing users were initially confused because the open alert icon occupies the same space as a now defunct icon. However new or inexperienced users showed no such confusion. Usability tests showed that existing users would quickly adapt so we did not make any changes.
A usability win for server monitoring.
We think the new alert configuration UI is a massive step forward for usability in server monitoring. It’s also a solid foundation for many advanced features that will make server monitoring easier for devops ops teams worldwide.
As well as providing some interesting tidbits on iconography, sensible defaults, keyboard controls and overall usability I hope this post proves the value of a solid design process. Research, usability testing and iteration are an integral part of design. It’s these things that help make interfaces intuitive and usable.