This week, we announced the release of our Ops Dashboard, allowing users to create custom dashboards to visualise their devices and services.
Naturally, we use Server Density daily to monitor and manage the machines and services that keep us going. As a result, we’ve been making extensive use of the dashboard ourselves to visualise and react to issues, and it’s quickly become one of our favourite features. This post will give you an overview of how we use it on a daily basis, because you might be interested in our set up.
First, here’s our Ops Dashboard:
We run this on a 42″ TV that is wall mounted in our office, so it is always available and can be seen at a glance. This allows people in the office to pick up on issues quickly, often before an alert is triggered, which helps us improve our response time.
The first row contains three service checks:
These three checks cover our three biggest customer facing areas, the v2 login page for the account we use, the v1 API and the v1 login page for the account we have there. This gives us a good overview of speed degradation and will let us see quickly if any services appear to be down from a customer perspective.
Next up is a row giving us an overview of issues that might be affecting us:
This gives us a view of the open alerts on our account. We’ve included the overall figure, but this can fluctuate based on a number of different factors, and usually has a few alerts triggered to indicate non-critical things we should be looking at (e.g. we have alerts for when our disk space gets to 90% as well as 95%, 99%, to allow us to pick up on issues before they start affecting the service). We therefore also include the alert numbers from a number of critical groups that it is important for us to keep an eye on.
The alerts widget is new this month, and allows you to view the number of open alerts on an account, group and device/service level.
Our infrastructure runs on Softlayer, so we monitor their status using our cloud status widget, which gives us an overview of the Softlayer RSS feed. This widget supports the major cloud/hosting companies, including AWS, Rackspace, Joyent, Google App Engine, and Digital Ocean, to name just a few.
Now we move on to a number of graphs, the first three rows of which help us to monitor processing queues and the load on the MongoDB clusters that run v2:
This gives us a nice grouping of related metrics, as issues with our queues tend to cause or be caused by issues with MongoDB.
Finally, we have a grouping of load and memory graphs for our 4 v2 clusters:
Using a vertical layout (I’ve split the above screenshots apart for legibility) allows us to see trends across clusters over time, so viewing a spike in load on one cluster can be compared instantly to the load on other clusters to see any patterns that are emerging.
All these graphs use the graph builder feature, so they are built from a number of different metrics across multiple devices and/or services, which makes them very powerful and information dense.
So that’s it, now you know how we’re using our dashboard, has it inspired you to add a few new widgets to yours? We’d love to know how dashboards fit into your current workflow. If they haven’t made it in, you can always use our dashboards, they’re free for 15 days.