Server Alerts: a Step by Step Look Behind the Scenes9 Comments
Update: We hosted a live Hangout on Air with some members of Server Density engineering and operations teams, in which we discussed about the infrastructure described in this blog post. We’ve made the video available, which can be found embedded at the bottom of this blog post.
Alert processing is a key part of the server monitoring journey.
From triggering webhooks to sending emails and SMS, this is where all the magic happens. In this article we will explore what takes place in those crucial few milliseconds, from agent data landing on our servers to the final alert appearing on your device.
But before we dive into the inner workings of alerting, we’d go amiss if we didn’t touch on the underlying technology that makes it all possible.
We’re all about Python
At its core, Server Density is a Python company.
Having an experienced Python team is only a part of it. For every performance problem we’ve faced in our infrastructure over the years, we’ve always managed to find a Python way to solve it.
There is so much to like about Python.
We love its syntax and the way it forces us to write readable code. By following the PEP-8 spec we ensure the code is easy to read and maintain. We also appreciate Python’s industry-leading unit testing capabilities, as they offer invaluable gains to our coding efforts. And while we don’t expect 100% testing coverage, we strive to be as close as we can. Python offers some simple and scalable functionalities towards that.
Another key feature is simplicity. From prototyping testing scripts to proof of concept APIs, Python provides numerous small wins and speeds up our workflow. Testing new ideas and trying new approaches is much quicker with Python, compared to other languages.
Last but not least, Python comes with “batteries included”. The vast amount of available modules (they do everything you can imagine), make Python a truly compelling platform for us.
Our server alerts stack
As you can see, our stack is not 100% Python. That said, all our backend developments are Python based. Here is a comprehensive list of the technologies we use:
- Apache Kafka to queue payloads received.
- Apache Storm as the real time processing queue.
- Zookeeper as the cluster communication system.
- MongoDB as the database.
- Python to handle the processing logic.
- PyMongo to access the Mongo database.
- Kazoo to access Zookeeper.
- Kafka-python to implement the Kafka producer.
- Tornado as the webserver to receive the payloads from the sd-agent.
- Python-statsd to track the performance of the whole system.
Now, let’s take a behind-the-scenes look at the alerts processing workflow.
1.Entering the Server Density Network
The agent only ever sends data over HTTPS which means no special protocols or firewall rules are used. It also means the data is encrypted in transit.
It all starts when the JSON payload (a bundle of key metrics the user has chosen to monitor) enters the Cloudflare network. It is then proxied to Server Density and travels via accelerated transit to our Softlayer POP. Using an anycast routed global IP, the payload then hits our Nginx load balancers. Those load balancers are the only point of entry to the entire Server Density network.
2. Asynchronous goodness
Once routed by the load balancers, the payload enters into a Tornado cluster (4 bare-metal servers comprising 1 tornado instance for each of its 8 cores) for processing. We use the kafka-python library to implement the producer, as part of this cluster. This Tornado app is responsible for:
- Data validation.
- Statistics collection.
- Basic data transformation.
- Queuing payloads to kafka to prepare them for step 3 below.
3. Payload processing
Our payload processing starts with a cluster of servers running Apache Storm. This cluster is running one single topology (a graph of spouts and bolts that are connected with stream groupings), which is where all the key stuff happens.
While Apache Storm is a Java based solution, all our code is using Python. To do this, we use the multi-lang feature offered by Apache Storm. This allows us to use some special Java based Spouts and Bolts which execute Python scripts with all our code. Those are long running processes which communicate over stdout and stdin following the multi-lang protocol defined by Apache Storm.
The cluster communication is done using Zookeeper (the coordination transport) so the output of one process may automatically end up on the process of another node.
At Server Density we have split up the processing effort into isolated steps, each implemented as an Apache bolt. This way we are able to parallelise work as much as possible. It also lets us keep our current internal SLA of 150ms for a full payload process cycle.
4. Kafka consumer
Here we use the standard KafkaSpout component from Apache Storm. It’s the only part of the topology that is not using a Python based implementation. What it does is connect to our Kafka cluster and inject the next payload into our Apache Storm topology, ready to be processed.
5. Enriching our payloads
The payload also needs some data from our database. This information is used to figure out some crucial things, like what alerts to trigger. Specialized bolts gather this information from our databases, attach it to the payload and emit it, so it can be used later in other bolts.
At this point we also verify that the payload is for an active account and an active device. If it’s a new device, we check the quota of the account to decide whether we need to discard it (because we cannot handle new devices on that account), or carry on processing (and increase the account’s quota usage).
We also verify that the provided agentKey is valid for the account it was intended for. If not, we discard the payload.
6. Storing everything in metrics
Each payload needs to be split up in smaller pieces and normalized in order to be stored in our metrics cluster. We group the metrics and generate a JSON snapshot every minute. That snapshot lasts five days. We also store metrics in an aggregated data format once every hour. That’s the permanent format we keep in our time series database.
7. Alert processing
In this step we match the values of the payload against the alert thresholds defined for any given device. If there is a wait time set, the alert starts the counter and waits for subsequent payloads to check for its expiration.
When the counter expires (or if there was no wait value to begin with), we go ahead and emit all the necessary data to the notification bolt. That way, alerts can be dispatched to users based on the preferences for that particular alert.
Once we’ve decided that a particular payload (or absence of it) has triggered an alert, one of our bolts will calculate which notifications need to be triggered. Then we’ll send one http request per notification to our notifications server, another tornado cluster (we will expand on the inner workings of this in a future post. Stay tuned).
Everything happens in an instant. Agents installed on 80,000 servers around the world send billions of metrics to our servers. We rely on Python (and other technologies) to keep the trains running, and so far we haven’t been disappointed.
We hope this post has provided some clarity on the various moving parts behind our alerting logic. We’d love to hear from you. How do you use Python in your mission critical apps?
Tech chat: processing billions of events a day with Kafka, Zookeeper and Storm