Monitoring meta-Monitoring

Code reviews

By David Mytton,
CEO & Founder of Server Density.

Published on the 17th March, 2016.

Our rallying cry to server monitoring users out there is deceptively simple.

Spend more time with your customers. Spend more time building your business. Spend more time with family. Leave server monitoring to those who do it for a living.

Why is that deceptively simple?

We’ve been through that journey and know how it feels. Delegation doesn’t come easy for us engineers. Why? Because we’d rather do it ourselves. If we delegate the job, it might not get done in the perfect way we envisaged.

So we know we’re asking a lot from customers. We’re asking them to stop monitoring their servers themselves. To stop hosting their own monitoring servers. We know that’s quite a mountain to climb. That’s why we treasure and value every single signup we get.

We do our utmost to honor the trust of thousands of customers every day. But as the cliché goes, with great power comes great responsibility.

meta monitoring 4

How do we honor our “guaranteed alert delivery” promise? How do we make sure our infrastructure doesn’t fail? What tools do we use, in order to notice any hiccups before our customers ever do? Who watches the watchmen?

Monitoring Server Density with Server Density

Let’s address the elephant in the room. Are we creating a single point of failure when we monitor our tools using our tools?

The answer is an unequivocal no.

The premise of server monitoring is that we spot, diagnose, and fix issues well in advance, before they become issues. Internal aberrations notwithstanding, our service is resilient enough to keep working uninterrupted. And it does.

By using Server Density to monitor Server Density, we are placing a vote of confidence in our product. That’s not to say we shun 3rd party options—that would be foolish. We can always fall back on New Relic and Pingdom, if needed. But we only turn to those in rare circumstances, i.e. when our own service is unresponsive.

As evidenced in our uptime metrics, that scenario doesn’t play out that often.

How we monitor Server Density

Server Density comprises a set of services. We monitor each one individually, with a set of alerts and dashboards.

Monitoring meta-monitoring

Dashboards let us spot any metric that is outside the range of what we define as ordinary. We use the 30 day view to evaluate things like execution times and service error rates. When there is a spike, a dip, or anything funny, there are two ways we get informed.

Proactive Dashboards

Every Monday morning our Operations Lead, Pedro and his team, review all dashboards in preparation for a weekly ops call. In that call, we assess the magnitude and severity of any deviations and, depending on the nature of the issue, the Ops team may parachute things right into our development cycle.

Dashboards are not a once-in-a-week process. Far from it. Pedro and his team, even David, our CEO, have dedicated tabs with Server Density dashboards open all the time. Up to date graphs are always at arm’s length.

Reactive Alerts

If a particular metric crosses our predefined thresholds, we receive alerts on our devices. One of the first things we then check is our dashboards, as they allow us to triangulate the issue.

The following story illustrates how we use alerts and dashboards to discover, triangulate, and mitigate any issues in our infrastructure.

The curious case of . . . elevated disk I/O

At some point in late January we started getting alerts for low partition replication count and loss of cluster members count. Taken in isolation, those alerts did not convey much about the root cause.

To triangulate this, we pored over several dashboards until we discovered an irregularity in the Zookeeper / Kafka cluster disk utilisation.

meta monitoring 6

For some mysterious reason, our message handling service was experiencing higher utilisation rates. But why did that happen? Why the jump in I/O activity? Was it additional traffic?

To answer that, we turned our attention to another Server Density performance dashboard: inventory count monitoring.

meta monitoring 1

Our service has been experiencing constant growth over the years. The operative word is constant. I.e. nothing sudden or recent.

In the absence of other solid leads, we decided to review all recent hardware activity on that particular cluster. The Zookeeper Kafka cluster consists of virtual machines housed on our primary data center in Washington DC.

After several discussions with Softlayer, we began to piece things together.

To match our ever increasing traffic (as seen on earlier dashboard), in late January Softlayer transitioned us to higher capacity disks. A side effect of that move was an ever so slightly slower disk throughput.

As per SoftLayer’s guidance, when it comes to database instances requiring high throughput, those should be hosted on dedicated instances and disks. That is to say, the increased I/O activity was here to stay (unless we took the dedicated route, an option we were not interested in).

Higher disk I/O did not pose an immediate risk to our uptime. On the other hand, what we were not comfortable with was hosting the postback processing service on a single cluster. A recent outage on that same cluster (caused by a controlled shutdown) convinced us of the need for even more redundancy for postback processing.

The outcome of this investigation crystallized in our decision to extend our N+1 architecture (one more of everything) to this function. So we added multiple independent clusters across multiple DCs.

We didn’t reduce the disk I/O, but we now understand it better. When we get those alerts, we now know what to look for. And we are far more comfortable with the level of redundancy we have in place.

Summary

Needless to say, all this investigation happened behind the scenes. And that’s the beauty of server monitoring. It’s proactive.

It is that proactive nature of Server Density that allows us to use Server Density to monitor Server Density. Frankly, we can’t think of a better way to spot and triage any imperfections in our infrastructure. We value the trust our customers place on us, so we use the best tool for the job.

What about you?

Do you monitor your monitor? What tools and processes do you have in place. Make sure you add those in the comments.

Articles you care about. Delivered.

Help us speak your language. What is your primary tech stack?

Maybe another time