What causes delays in software projects
For the last few weeks I’ve been testing a new storage backend for the server monitoring time series data we collect in Server Density, used to power our graphs and historical data API. The new system solves a number of scaling issues we’ve been having with timeouts over longer date ranges and allows us to store more data.
Instead of summarising the 1 minute data points after 1 hour we will keep minute by minute data for 2 weeks for all customers, before it starts being summarised. Optional extensions of that (up to 1 year) can also be purchased. We keep all data forever, it just gets summarised up to hourly and then daily after the initial period.
The new system involves a queuing layer which allows us to survive data store outages without losing data – it just gets inserted later. The old system is in PHP but, as with all our new development, we’re now using Python. Celery handles the queuing and since the system operates as an internal web service, Tornado handles the web endpoints with Nginx sitting in front of it as a load balancer.
We’ve been working on this for several months and hoped to get it out sooner, but there have been delays caused by problems picked up with production traffic. As we already use a web service architecture, I have been able to drop in this complete rewrite with no impact to the existing clients – the internal APIs remain the same. This has meant I’ve been able to test against production traffic because the data can be mirrored.
It’s not until you run real traffic through a system over a period of time that you find problems.
- The default logging in Celery is extremely verbose. This caused our logs to fill up very quickly with an INFO log line being logged for every processed queue item.
- Connection pooling in the Python requests library is set too low. The default connection pool size in Python requests is 10. This is insufficient for high volume requests because the pool will be quickly exhausted with an
HttpConnectionPool is full, discarding connectionerror. I needed to implement a custom HTTP adapter to fix this.
- Some of our client code is still in PHP. We use a pseudo asynchronous method of posting where the HTTP connection is closed immediately and the response ignored. This works fine with Apache (our old setup) but Nginx thinks the client has closed the connection and terminates the internal request. The workaround was to quickly loop through the response.
- However, as we started to increase the load this workaround fell apart. This is because the code isn’t truly asynchronous so even though it was returning quickly (within 3ms), the load on our servers increased significantly. And when we simulated slowdowns, the entire server would grind to a halt waiting for responses. So instead we changed the Nginx config to ignore client aborts (see also our public Nginx Puppet manifest). Eventually we’ll remove all our PHP client code and use true asynchronous posts.
- A couple of bugs in Celery were found. One has been fixed and the other we’re still investigating and have a manual workaround for.
- We tested using a small 32GB SSD for the persistent data storage but this was unnecessary because all tasks usually get processed within a second (and so are in memory) and the disk size is too small to handle more than a few minutes of data store downtime where the queue is backing up. So instead we replaced the SSD with a large SATA2 drive.
This project is fairly simple in terms of the amount of code – just a few hundred lines – and very few moving parts (API endpoint + task processing). The complexity comes with the libraries we’re using and how they (and our own code) performs under heavy load. This is what causes the real delays in software projects. And for our customers, we hope to roll this new backend out very soon!
Enjoy this post? You may also like Introducing Sockii: HTTP and WebSocket aggregator