The exciting adventures of an alert notification
Written by David Mytton — Subscribe now.
A year ago, the alert and notification systems for our server monitoring service, Server Density, were very simple. They were based on batch cron jobs which processed all the items in a database table every minute.
Since then, we have grown significantly and this would no longer work. We now have a very robust alert and notification backend which can easily be scaled just be adding new servers. It’s quite interesting from a technical standpoint, so this is the exciting story of the adventure an alert notification takes through our systems to your inbox.
1: Agent sends a postback
Our monitoring agent reports back every 60 seconds. The stats payload is sent over HTTP (or HTTPS) as a JSON object and is immediately inserted into the database to display the latest data on the dashboard, through the monitoring API and on our graphs. The data is also stored in a
postbacks capped collection inside MongoDB. A separate process transfers these JSON payloads from the
postbacks collection into our RabbitMQ
alertdetection queue. The web server does not queue directly to RabbitMQ because the various PHP AMQP libraries we tried caused too much load on the web server.
2: Is there an alert condition?
We have multiple RabbitMQ consumers listening to the queue waiting for new items. One of these sees there’s a new
alertdetection item and pulls it down. The message pulled from the queue contains the same raw JSON payload. The data is then parsed and compared to all configured alerts to see if there is an alert condition match. In this case, load is a bit too high and so triggers an alert.
3: An alert is triggered
Alerts can have a delay so we need to check the configuration to see if we should alert right away. In this case, we do so. The alert is set to be sent via e-mail and iPhone push notification so 2 queue items are entered into the
emailalerts RabbitMQ queues.
4: Notifications are sent
Different consumers listening to the notification queues pick up the new queue items entered. The iPhone alert payload is built and sent to the Apple Push Notification service whilst the e-mail message is also constructed by a separate process, and then the Postmark API is called. The e-mail data is sent to Postmark to be queued and delivered.
5: The problem hasn’t been fixed
The alert is configured to alert every 5 minutes until the alert condition disappears. Every time the stats postback comes in, we run the comparisons and check already triggered alerts to see if there’s anything we need to do. 5 minutes later we see that the alert is still open and new notifications are triggered.
6: All is well
Shortly afterwards, a postback comes in with the alert condition fixed – load is back down again. We mark the alert fixed and send notifications to tell the user that all is well again.
But what if we stop receiving data?
If your server stops reporting back then there’s no payload to trigger the alert process. As such, we run a separate set of consumers which constantly check to see if we have data from your server and if we’ve stopped receiving postbacks, we’ll trigger the no data alerts after the time period defined in the alert configuration.
It takes seconds
From postback payload coming in to notifications being delivered only takes seconds because we can easily scale out the number of consumers running and processing queue items. Every action is logged and these are exposed in the alert log within the Server Density UI so you can see the times between events.
We are always working on improving this and one of the items on our roadmap is to combine the 2nd step so that the alert triggering bypasses the database and can get inserted into the queue immediately. Unfortunately the various PHP AMQP libraries available aren’t robust enough (connection pooling is the main thing missing here) to handle that many inserts so we’re investigating other queuing systems and methods of handling the high number of inserts.