Server Alerts: a Step by Step Look Behind the Scenes

python

By David Mytton,
CEO & Founder of Server Density.

Published on the 19th November, 2015.

Update: We hosted a live Hangout on Air with some members of Server Density engineering and operations teams, in which we discussed about the infrastructure described in this blog post. We’ve made the video available, which can be found embedded at the bottom of this blog post.

Alert processing is a key part of the server monitoring journey.

From triggering webhooks to sending emails and SMS, this is where all the magic happens. In this article we will explore what takes place in those crucial few milliseconds, from agent data landing on our servers to the final alert appearing on your device.

But before we dive into the inner workings of alerting, we’d go amiss if we didn’t touch on the underlying technology that makes it all possible.

We’re all about Python

At its core, Server Density is a Python company.

Having an experienced Python team is only a part of it. For every performance problem we’ve faced in our infrastructure over the years, we’ve always managed to find a Python way to solve it.

There is so much to like about Python.

We love its syntax and the way it forces us to write readable code. By following the PEP-8 spec we ensure the code is easy to read and maintain. We also appreciate Python’s industry-leading unit testing capabilities, as they offer invaluable gains to our coding efforts. And while we don’t expect 100% testing coverage, we strive to be as close as we can. Python offers some simple and scalable functionalities towards that.

Another key feature is simplicity. From prototyping testing scripts to proof of concept APIs, Python provides numerous small wins and speeds up our workflow. Testing new ideas and trying new approaches is much quicker with Python, compared to other languages.

Last but not least, Python comes with “batteries included”. The vast amount of available modules (they do everything you can imagine), make Python a truly compelling platform for us.

Our server alerts stack

As you can see, our stack is not 100% Python. That said, all our backend developments are Python based. Here is a comprehensive list of the technologies we use:

Now, let’s take a behind-the-scenes look at the alerts processing workflow.

1.Entering the Server Density Network

The agent only ever sends data over HTTPS which means no special protocols or firewall rules are used. It also means the data is encrypted in transit.

It all starts when the JSON payload (a bundle of key metrics the user has chosen to monitor) enters the Cloudflare network. It is then proxied to Server Density and travels via accelerated transit to our Softlayer POP. Using an anycast routed global IP, the payload then hits our Nginx load balancers. Those load balancers are the only point of entry to the entire Server Density network.

2. Asynchronous goodness

Once routed by the load balancers, the payload enters into a Tornado cluster (4 bare-metal servers comprising 1 tornado instance for each of its 8 cores) for processing. We use the kafka-python library to implement the producer, as part of this cluster. This Tornado app is responsible for:

  • Data validation.
  • Statistics collection.
  • Basic data transformation.
  • Queuing payloads to kafka to prepare them for step 3 below.

3. Payload processing

Our payload processing starts with a cluster of servers running Apache Storm. This cluster is running one single topology (a graph of spouts and bolts that are connected with stream groupings), which is where all the key stuff happens.

While Apache Storm is a Java based solution, all our code is using Python. To do this, we use the multi-lang feature offered by Apache Storm. This allows us to use some special Java based Spouts and Bolts which execute Python scripts with all our code. Those are long running processes which communicate over stdout and stdin following the multi-lang protocol defined by Apache Storm.

The cluster communication is done using Zookeeper (the coordination transport) so the output of one process may automatically end up on the process of another node.

At Server Density we have split up the processing effort into isolated steps, each implemented as an Apache bolt. This way we are able to parallelise work as much as possible. It also lets us keep our current internal SLA of 150ms for a full payload process cycle.

4. Kafka consumer

Here we use the standard KafkaSpout component from Apache Storm. It’s the only part of the topology that is not using a Python based implementation. What it does is connect to our Kafka cluster and inject the next payload into our Apache Storm topology, ready to be processed.

5. Enriching our payloads

The payload also needs some data from our database. This information is used to figure out some crucial things, like what alerts to trigger. Specialized bolts gather this information from our databases, attach it to the payload and emit it, so it can be used later in other bolts.

At this point we also verify that the payload is for an active account and an active device. If it’s a new device, we check the quota of the account to decide whether we need to discard it (because we cannot handle new devices on that account), or carry on processing (and increase the account’s quota usage).

We also verify that the provided agentKey is valid for the account it was intended for. If not, we discard the payload.

6. Storing everything in metrics

Each payload needs to be split up in smaller pieces and normalized in order to be stored in our metrics cluster. We group the metrics and generate a JSON snapshot every minute. That snapshot lasts five days. We also store metrics in an aggregated data format once every hour. That’s the permanent format we keep in our time series database.

7. Alert processing

In this step we match the values of the payload against the alert thresholds defined for any given device. If there is a wait time set, the alert starts the counter and waits for subsequent payloads to check for its expiration.

When the counter expires (or if there was no wait value to begin with), we go ahead and emit all the necessary data to the notification bolt. That way, alerts can be dispatched to users based on the preferences for that particular alert.

8. Notifications

Once we’ve decided that a particular payload (or absence of it) has triggered an alert, one of our bolts will calculate which notifications need to be triggered. Then we’ll send one http request per notification to our notifications server, another tornado cluster (we will expand on the inner workings of this in a future post. Stay tuned).

Summary

Everything happens in an instant. Agents installed on 80,000 servers around the world send billions of metrics to our servers. We rely on Python (and other technologies) to keep the trains running, and so far we haven’t been disappointed.

We hope this post has provided some clarity on the various moving parts behind our alerting logic. We’d love to hear from you. How do you use Python in your mission critical apps?

Tech chat: processing billions of events a day with Kafka, Zookeeper and Storm


  • Mohit Khanna

    Hi @David,

    Excellant read.

    This gives some confidence feeling to our dev team too as we almost have a very similar technology stack for our purpose – though instead of Python we use NodeJS (for Rest API ends points) and Java for our Storm processing. We also use Fluentd extensively to stream data into Kafka.

    A few more inputs on Kafka might help us here:

    1. How has been been your experience with Kafka in production as such?
    2. What version of Kafka are you at?
    3. What version of Storm do you use?
    4. Have you ever faced administration issues with maintaining Kafka brokers?
    5. Do you have any need for backing up the entire dataset maintained in Kafka – for say redo/recovery operations?
    6. Any good resources on maintaining Kafka in production that you could point us to?

    Keep up with the blogs – we enjoy reading them! :-)

    • Hi Mohit Khanna !

      We plan to write an specific post about Kafka as part of this series of blog posts describing our infrastructure, however I can share this info with you now, no need to wait for that blog post :).

      1. How has been been your experience with Kafka in production as such?

      So far we didn’t have major issues with it, other than the initial problems to understand a new service and properly tune it for our needs.

      2. What version of Kafka are you at?
      0.8.2 We try to use latest version to get latest bug fixes.

      3. What version of Storm do you use?

      0.9.4 Same comment as before applies, we are preparing the upgrade to 0.10.

      4. Have you ever faced administration issues with maintaining Kafka brokers?

      When we started using them on production we had to rebuild the cluster a couple of times. The way we have our system that’s possible fast and the impact in our service is limited, however that was mostly when we were testing the system before moving it into production.

      The biggest issue we are working right now is on improve our high availability between data centers without losing any payload stored in Kafka. Current setup may cause a couple of minutes of data lose while everything is routed to the other system.

      5. Do you have any need for backing up the entire dataset maintained in Kafka – for say redo/recovery operations?

      Not really, our use case is for real time processing, we take around 150ms to do a full payload processing, and the nature of our data doesn’t allow a flow delay, or the alerts would not be delivered in time. We have some ways to store a problematic payload in our temporary database for inspection in case of need, but that’s not involving Kafka.

      6. Any good resources on maintaining Kafka in production that you could point us to?

      Our setup is not complex at all, so all the documentation we used is the one available at the Kafka website (kafka.apache.org) and its mailing list (users@kafka.apache.org).

      We use https://github.com/whisklabs/puppet-kafka to handle its deployment, but we plan to switch to a .deb based installation.

      I hope this helps :)

    • We’re running a tech chat on this next week: https://plus.google.com/events/c1jv1lkaoh7akrjaecr3q1u9fj8

      (And for those reading this after the event, it has been recorded to our YouTube channel. Just follow the above link!)

      • Mohit Khanna

        We will be attending for sure!

  • Anbu

    Excellent Article @dmytton:disqus I started following Server Density for MongoDB articles and became a regular reader because of sheer quality of the articles like this one.

    Keep up with the good work!

    • Max Zahariadis

      Thanks for reading Anbu, we most definitely will !

  • dennyzhang.com

    Thanks. I’m very interested about metric part.

    Could you share a bit more about your tech stack? Will you also integrate that with slack sealmessly.

Articles you care about. Delivered.

Maybe another time