Using Celery for queuing requests
CEO & Founder of Server Density.
Published on the 25th April, 2013.
For Server Density v2 we’ve taken a lot of time to redesign how we handle our website monitoring, particularly in relation to how we avoid false positives. We’ve also added much more control and visibility into how our queuing system works because the current one is almost like a black box because ZeroMQ is not great at being transparent with regards to what’s actually happening.
Every minute our monitoring nodes issue requests to a public URL/IP, usually a website. They question whether whatever you’re monitoring is running, provide you with response time statistics and gather any other useful information. Oh, and if it’s down – we’ll let you know (even if it is 2am).
Just because your website is up in London, it doesn’t necessarily mean that it’ll perform the same to visitors in Australia. So we monitor performance from locations all over the world. This provides you with an overview of how your service is performing – regardless of visitor location.
To deliver you all of this information we decided to create a director-actor (server-client) infrastructure. The ‘director’ within our data centre, and the ‘actor’ being a machine placed somewhere in a distant location like Sydney (~16,000 km from London). The ‘director’ propagates checks for each of the ‘actors’ locations, which perform the requested check and return the result to the ‘director’.
To facilitate the director-actor (server-client) architecture, we chose a message broker software to act as a transport layer between both entities. For this we’ve used the following tool set: Python, Celery, RabbitMQ and Gevent.
We’ve previously created our own basic queuing system using MongoDB but given Celery supports MongoDB as a backend, we started using that. We know how to scale Mongo in production and it has excellent replication/failover capabilities, but we hit some problems which forced us to use RabbitMQ instead.
If you consider HTTP checks for the POST method, it may be that some users send sensitive data, or use some sort of authorisation for the targeted URL (or both). The monitoring nodes are hosted with various vendors around the world and communicate with our central directors over the internet rather than our secure, internal network. This means we need to ensure communication is secure, which either means a VPN or SSL. We didn’t really want to deal with managing a VPN of a large number of nodes around the world so decided to investigate SSL transports instead.
Celery is smart and robust tool which is really easy to extend. But being in the stage of prototyping means that we do things fast and adjust later, this way we decided to go with the RabbitMQ broker with had full support for SSL encryption inside Celery backends, also MongoDB as a backend was still considered an a experimental option. Have a look inside the Celery docs to decide which broker will work best for you.
We do intend to switch to using MongoDB for the Celery backend eventually, and have submitted a patch to add SSL support to Celery. RabbitMQ is working well but it’s another component in our architecture and given our existing experience with MongoDB, we prefer to reduce the number of 3rd party tools where possible. It also has better failover support.
It was better to solve this problem by looking into what we didn’t want to do as opposed to what we did. We knew we didn’t want to do raw socket connection, write our own broker or worry too much about scalability at this stage. This is the solution that we chose as a result. Admittedly, the tech used may well change in the future – but it’s solving the problem well right now, and we’re still experimenting with it.
The diagram above presents a simplified structure of the current architecture layout for Celery queues. Checks are described by the location on which we are going to execute them. We allow users to create checks from more than one location, so the same URL will often land on different queues at the same time.
The director assembles a check from the information provided by a user. In a URL check for example, that information could be: target URL, check method (GET, POST, PUT, DELETE), payload, timeout, etc. Information in this form will be serialised and allocated to the right location queue.
Server Density currently supports 9 locations with more on the horizon.
At the other end of the process (a couple of thousands kilometres away), our actors are working hard to handle the check location queues. They’re designed to take traveling checks from the queue, interpret the message received and perform the requested actions.
We were concerned about how to handle things here. Primarily because we don’t want to evaluate status just by doing one URL check, here’s why:
>>> import timeit >>> timeit.repeat( ... "requests.get('http://example.com')", ... "import requests", ... number=1, ... repeat=5)
[0.9308030605316162, 2.1385269165039062, 3.0700418949127197, 6.3459131717681885, 1.9396629333496094]
Previously Server Density did one check every minute but we found this had its problems. There are hundreds of things that can go wrong when requests are being sent around the world, alerting users that a website was down for example.
As a result v2 now sends 3 checks each time and records the fastest response – not the average. One check out of the three could timeout for example bringing the entire average down. In this use case people want to see the shortest response time figure. This will be put on the responses queues and send back to the dDirector, which then acts upon the information. We still store the results of all checks and these will be exposed in the UI for the user.
Since we are using Celery gevent we can benefit from the fact that one queue can be handled by more than one worker (and vice versa). So what are the implications of this? If we see any scale related problems we have few options:
- Increase the concurrency level. We are using gevent and can therefore benefit from greenlets.
- Increase the number of processes. Python has this thing called GIL, right? By deploying another process you can try to bypass this and force your box to do things in parallel.
- Increase the number of machines. Install more workers with information on which queue(s) to consume.
The most important change is the check methodology – doing 3 checks per location. Further, every location executes the check at the same time so you can see the response time from every location at once, rather than one at a time. Instead of having to do a lot of the queue management functionality ourselves within ZeroMQ, using Celery we can pass that over to the library and dedicate more time to check methodology. It also provides more ways to inspect the queue status and ties into our own monitoring so we know sooner when things are going wrong.