How to design useful monitoring graphs and visualizations
Daniele De Matteis, Designer at Server Density.
Published on the 4th October, 2013.
When we attended Monitorama EU 2013 in Berlin. We had a lot of fun, spoke to some great people and listened to lots of great talks. Maybe you even managed to grab one of our limited edition ‘Just monitor it!’ tees, or one of our awesome notebooks or stickers.
We also took to the stage ourselves to speak about the new Server Density graphs and our process for visualising monitoring in general, which is what we’ll go through in this post.
Graphs are crucial
There’s a lot of talk today about automated detection and prediction of anomalies being the most sought after direction for monitoring. They’d be capable of freeing us from manually managing a multitude of items 24/7 and with just a few pairs of human eyes and hands. We will no doubt see it shaping a good deal of our workflows in the future, surprising us with new views on monitoring and more comfortable practices.
But even though you can automate pattern recognition, automatically analyse trend deviation, or implement any other kind of detection and prediction, there’s still a couple catches: it is more than likely that the systems will not be instructed enough to all possible scenarios, and it is also likely that the algorithms will break on edge cases. In such situations you still need to see through and understand what’s happening, decide for yourself what the issue is and, fix it.
The importance of graphing lies in the fact that it’s your window onto the monitored system. It allows you to see the monitored object via a representation of it, built on the metric sampling. Ultimately, it is the most meaningful way we know to inspect a machine’s behaviour.
So graphs are important, which leads us to a couple of legitimate questions to answer:
- What is good graphing?
- How do you achieve it?
1) Good graphing is a representation of the system state that gives a meaningful description, maximises understanding and insight, and allows to easily and pleasantly inspect and detect issues, and be motivated to start action and solve problems.
2) It is a very complex design process made up of thousands of big and small decisions (and we don’t even know all of them!). It is still possible to isolate a few key principles that have been fundamental in building such a representation. At least in our case.
These principles are rooted in data visualisation literature and have sometimes emerged from user feedback or internal talks.
A few principles
We’re talking about consistency between the monitored object and its representation, in that the display should correctly represent the physical characteristics of the object. So coherence in the data structure must be carried through all of the chain, from backend to frontend.
Problem: Chart needs to make sense, asap!
A coloured line, going up and down, sometimes is not the appropriate graph type giving a fake or incomplete representation of the metrics set.
Solution: Ad-hoc chart type definitions
For each of the metric groups we picked the chart type that correctly represents the reality of the represented object. See the screenshot below:
The top one is a memory usage graph from Server Density v2, it plots series as areas and stacks them, building a whole from the individual parts. The one under it is the same graph from v1, same time window, same metrics, but they are simply overlaid.
In the first one understanding is instinctive because the object and representation match. You can clearly see an increase in physical memory usage (dark blue area), whereas the other graph asks for an effort to understand what’s going on. In the other graph you need to map the visuals to the thing by yourself.
The goal is to expose the relationship between the series thus making the chart more meaningful, so in the case above a stacked area chart has been chosen because Memory metrics are all the same unit (MBs) and sum up to a whole.
We went through the same kind of reasoning for all of the main metric groups of a server and will provide defaults for each of them in our device graphs.
Sense of place is vital for understanding what you’re looking at, efficiently displaying hierarchies and labelling is key to this as well as showing all the necessary information.
Problem: Where am I, what am I looking at?
Avoiding user disorientation should be a top concern when designing, it is in fact easy to get lost when there aren’t enough visual clues or they are poorly managed (lack of content, bad naming, odd positioning etc.). Then the user journey becomes a nightmare in constantly making the effort of rebuilding or remembering the context.
Solution: Display data hierarchy
First of all you want a clean hierarchy and naming in the data. In our case we structured the data coming from the backend to a specific format which allows for both flexibility and solid categorisation of metrics. A simplified version of it:
metrics group > n nested levels > metric > unit: datapoints
Then you can display all this with the UI in a clean and meaningful way:
Solution: Display as much ‘as possible’
The more information there is, the more the context you offer. Beware though, as comprehension is directly related to the amount of information to process. So ‘as possible’ here refers to not compromising clarity. Just stop when the display starts to be fatiguing and understanding suffers. Remember that UI’s are reactive and their state can change because of user interaction, we’ll talk more about control later.
We think about clarity as the principle for structuring content.
Avoid clutter, irrelevant or non functional ink as much as possible, optimise use of space. A few relevant cases:
Problem: Too many line series even for my 27″!
Having to display a large amount of series is always problematic. Mainly because we deal with a limited space (the screen) to fit many items all deserved of vertical resolution. This leads to infinite scrolls and loss of contextual information. So it is a trade off between how many things you look at, and how detailed they are. Overlaying (multiline charts) is not an option in these cases, because they result in chaos.
Solution: Welcome Horizon graphs
Our solution to this problem will be the use of horizon graphs, as a means for displaying a large number of series in a limited vertical space.
The main feature of horizon graphs is that they reduce the use of space while preserving resolution. This is achieved by splitting the y-axis into a few bands of uniform value ranges and then collapsing these bands on top of one another and layering them. There’s more to horizon graphs like mirroring negative values, but this is enough for our case.
We could have gone with sparklines, letting go of vertical resolution, but we went instead for increasing the density in the display, compressing information, and optimising the use of space. The advantages of this are:
- We can display a lot of series in a small space.
- It is possible to look at events across all metrics with just a glance and easily spot high or low value trends, which are represented by high or low colour presence.
Being new, Horizon graphs will require some getting used to for users, but we’ve decided to bet on them and put them in our roadmap.
Problem: There is too much on this page…
There are other cases when the users are loaded with so much information that they can’t possibly process all of it in one go.
Often the case with over decorative designs or simply cluttered displays. Since short term memory is limited, too much info slows down processing times and impacts on experience and usefulness.
Solution: When in doubt, less ink!
Don’t put something on the page unless it serves a purpose and it is useful and relevant given the specific use scenario. While a full featured graph with multiple axes, full grids, legend, etc. is perfectly fine for standalone processing; when the focus is on comparing, you need to reduce the amount of information displayed, simply because it is irrelevant in that context.
In short, You should display as much information as your story needs, and create perspective to make the display telling.
The organisation of the layout has a great influence on a users interpretation and understanding, alignment and positioning have to be used as features and increase the usefulness of the display.
Problem: Here’s a spike, so what?
A metric’s spike by itself could be useful but it doesn’t tell us much about what’s happening on a wider scale across the whole machine or system. It is just a part of the picture, missing other related events recorded in other metrics.
Solution: Expose system events
Showing relationships is the solution to this. Putting graphs in perspective and revealing the connectedness of events across metrics. Or in other words, put back together the object that we split apart in the first place using our agent.
We laid out elements carefully and reserved a wide column to line up the graphs vertically. The aim was to expose the shared timeframe across the graphs, all the different X axes representing the same sequence of moments. Such a layout allows for immediate vertical scanning of events, making it extremely easy to look between the graphs and tell if a spike or trough happened at the same time, before, after, etc. This can instantly reveal a relationship between different metric events.
By merging the time window, patterns across graphs and times are shown, uncovering correlation, causation, dependence, and ultimately system events.
Another nifty feature for adding perspective is showing simultaneous data point highlights on all the graphs. Ideally this comes after spotting events through vertical scanning, when you might want to get in-depth information about what happened across multiple graphs at a given moment in time.
Design for gratifying users, providing a pleasant UI. The assumption is that good UX engages and activates users, giving an enthusiastic state of mind which encourages action, and a willingness to investigate and to analyse.
This is a fundamental one, which is often overlooked by there’s ‘more pressing stuff to deal with’, which isn’t necessarily true. Very often in fact the key aspect for a useful monitoring tool is to create understanding and trigger action in order to actually put your infrastructure back on track.
So building motivation and a positive state of mind are core features when judging a tool’s effectiveness, because they impact on the final goal.
Problem: Chart is boring, imma go back to sleep…
An unappealing environment gets in the way of a good workflow. The last thing you want is be woken up in the night and having to deal with a boring, frustrating UI, or dull scribblings on a screen. Also, we only live once so the happier the better!
Solution: Wait, let’s make it worth it!
The solution is to make it worth it. But how exactly do you increase appeal? Our take was to direct all of the visual aspects that impact on how the UI is perceived and consumed:
form, weights and composition, colour choice, layout balance.
- Balanced spacing to create intelligible blocks of content.
- Form consistency in shapes and typography weight in the graph views and in relation to the app frame.
- Directing flow, two directions of flow: top > down with left column for metadata and right column for graphs scanning; left > right inspect each graph as a separate row of content.
- Building contrast to highlight important sections and pieces of info in the view.
- Colouring was a tricky one. The challenge was to use efficient colouring while making it pleasant and respectful of the section identity. By efficient we mean show data sets with qualitative differences while maximising contrast and accessibility.
We built a colour engine that picks colours progressively according to a few rules. To stay consistent to the app section identity the engine starts from the section’s base colour hue and then picks analogous hues with perceivable difference, then in order to put the hues further apart the chosen hue is modified in lightness and saturation. This way each colour hue is different from the neighbours in all three dimensions of the HLS colour space.
Design for interaction, distribute content in time as well as space and be aware of possible user actions, provide alternatives.
Problem: Where’s the rest of this?
With complex displays showing everything is often bad idea, plus it is not always clear what everything is. In order to preserve clarity, appeal, and focus perspective, information HAS to be filtered. Still and naturally, the user needs more.
Solution: 1 click away, but now you look for it ;)
Why not just leave him the choice then? We’re lucky because we work on interactive and dynamic displays, the UI, its states, the information displayed, can all react to user initiated changes and update. So the ideal scenario is that the user is fed a manageable amount of information, he finds his way through and goes in depth: that’s when interaction kicks in and where the next level of information should be, one or a few clicks away.
To create this scenario it is important to build content in-depth and distribute control on the view elements.
Wrap up time.
Consistency Maintain consistency between the real object, the data structure and its representation.
Perspective Put layout elements in perspective and help uncover stories behind individual events.
Context Provide a sense of place to facilitate user understanding. Display hierarchies, naming and relevant info.
Clarity Ensure the clarity of the display, avoid clutter and irrelevant content.
Appeal Provide a pleasant UX to gratify and motivate users to get to their goals.
Control Design for interaction and build in depth, then give control to users.
If you want to see more, there is a video of our talk:
and the slides are available as well:
A few final thoughts
We used these principles as guidelines, not blindly applied, but nicely though, at the end of the design process many of the key solutions match those principles entirely, and the pieces of the layout have fallen in place like they were meant to from the very beginning. It is safe to say that all of these principles have proven to be very important to us when producing graphs that make the most out of your metrics.
Last month we released the first version of our new graphs for Server Density but still have many improvements still to make. Design is an iterative process but now we’ve figured out the basic principles, as discussed above, we can continue making improvements to things like defaults, dealing with large numbers of series and anything else that comes up from user feedback.
Ultimately we hope that what we discussed here is useful for you to see the importance of good design not as decoration but as a feature.