Server Density Blog

Interesting devops tech stuff

Pssst… Server Density v2 is coming soon!

Author Archives: David Mytton

About David Mytton

David Mytton is the founder of Server Density. He has been programming in PHP and Python for over 10 years, regularly speaks about MongoDB (including running the London MongoDB User Group), co-founded the Open Rights Group and can often be found cycling in London or drinking tea in Japan. Follow him on Twitter and Google+.
  1. MongoDB Benchmarks

    8 Comments

    There are no official MongoDB benchmarks because the developers don’t believe they accurately represent real world usage. This is true because you can only really get an idea of performance when you’re testing your own queries on your own hardware. Raw figures can seem impressive but they’re not representative of how your own application is likely to perform. Benchmarks are useful for indicating how different hardware specs might perform but are really only worth it if you use real world queries.

    For Server Density v2 I have been benchmarking MongoDB with different tweaks so we can get maximum performance for our high throughput clusters, but make cost savings for our less important systems. A lot has been said about various choices of write concern, deploying to SSDs and replication lag but there aren’t really any numbers to base your decision on.

    This set of MongoDB benchmarks is not about the absolute numbers but is designed to give you an idea of how each of the different options affects performance. Your own queries will differ but the idea is to prove general assumptions and principles about the relative differences between each of the write options.

    Test methodology

    These MongoDB benchmarks test various options for configuring and querying MongoDB. I wrote a simple Python script to issue 200 queries and record the execution time for each. It was run with Python 2.7.3 and Pymongo 2.5 against MongoDB 2.4.1 on an Ubuntu Linux 12.04 Intel Xeon-SandyBridge E3-1270-Quadcore 3.4GHz dedicated server with 32GB RAM, Western Digital WD Caviar RE4 500GB spinning disk and Smart XceedIOPS 200GB SSD.

    The script was run twice, taking the results from the second execution. This avoids slowdown cause by initially allocating files, collections, etc – MongoDB only creates databases when they’re first written which adds a bit of time to the first call but isn’t really relevant in real world usage.

    
    import time
    import pymongo
    m = pymongo.MongoClient()
    
    doc = {'a': 1, 'b': 'hat'}
    
    i = 0
    
    while (i < 200):
    
    	start = time.time()
    	m.tests.insertTest.insert(doc, manipulate=False, w=1)
    	end = time.time()
    
    	executionTime = (end - start) * 1000 # Convert to ms
    
    	print executionTime
    
    	i = i + 1
    
    

    This is a dummy document because I'm not trying to simulate a real application here. Document size, number/size of indexes and the type of operation will all play a part in the actual numbers. This is only testing inserts but there are other optimisations you can make with updates, particularly ensuring documents don't grow. However, this is sufficient for what I'm trying to show in these tests - the relative difference between the write options.

    Write concern

    The write concern allows you to trade write performance with knowing the status of the write. If you're doing high throughput logging but aren't concerned about possibly losing some writes (e.g. if the mongod crashes or there is a network error) then you can set the write concern low. Your write calls will return quickly but you won't know if they were successful. The write concern can be dialed up to including error handling (the default) so the write will be acknowledged (not necessarily safe on disk).

    It's important to know that an acknowledgement is not the same as a successful write - it simply gives you a receipt that the server accepted the write to process. If you need to know that writes were actually successful one option is to require confirmation the write has hit the journal. This is essentially a safe write to the single node with the option to go further to request acknowledgement from replica slaves. It's much slower to do this but guarantees your data is replicated.

    MongoDB insert() Performance (w flag)

    • w=0 is the fastest way to issue writes, with an average execution time of 0.07ms, max of 0.11ms and min of 0.06ms. This setting disables basic acknowledgment of write operations, but returns information about socket excepts and networking errors to the application.
    • w=1 takes double the time to return, with an average execution time of 0.13ms, max of 0.32ms and min of 0.11ms. This guarantees that the write has been acknowledged but doesn't guarantee that it has reached disk (the journal), so there is still potential for the write to be lost - there's a 100ms window where the journal might not be flushed to disk. Setting j=1 protects against this.
    • j=1 (spinning disk) is several orders of magnitude slower than even w=1, with an average execution time of 34.19ms, max of 34.28ms and min of 34.10ms. The mongod will confirm the write operation only after it has written the operation to the journal. This confirms that the write operation can survive a mongod shutdown and ensures that the write operation is durable.
    • j=1 (SSD) is x3 faster than a spinning disk with an average execution time of 11.18ms, max of 11.24ms and min of 11.11ms.
    • There is an interesting ramp up for the initial few queries every time the script is run. This is likely to do with connection pooling and opening the initial connection to the database, whereas subsequent queries can use the already open connection.
    • Some spikes appear during the script execution. This could be the connection closing and being recreated.

    This means that you can reasonably use the default w=1 as a safe starting point but if you need to be sure data has gone to a single node, j=1 is the option you need. And for high throughput you can half query times by going down to w=0.

    SSD vs Spinning Disk

    It's a safe assumption that SSDs will always be faster than spinning disks, but the question is how much - and is that worth paying for them? The more data you store, the more expensive the SSD will be - higher capacity SSDs are available but they are fairly cost prohibitive. However, MongoDB supports storing databases in directories which can be mounted to their own devices, giving you the option of putting certain databases on SSDs.

    Putting your journal on an SSD and then using the j=1 flag is a good optimisation. You need the --directoryperdb config flag and you can then mount the databases on their own disks. The journal is always in its own directory so you can mount it separately without any changes if you wish.

    MongoDB insert() Performance (j flag)

    Replication

    If you specify a number greater than 1 for the w flag then this will require n number of replica slaves to acknowledge the write before the query completes. I tested this in a x4 node replica set with the primary and a slave in the same data centre (San Jose, USA) as the execution script and the remaining x2 nodes in a different data centre (Washington DC, USA).

    The average round trip time between the nodes in the same data centre is 0.864ms and between different data centres is 71.187ms.

    MongoDB insert() Performance (w > 1 flag)

    • w=2 required acknowledgement from the primary and one of the 3 slaves. Average execution time was 14ms, max of 867ms and min of 1.6ms.
    • w=3 required acknowledgement from the primary plus 2 slaves. Average execution time was 310ms, max of 1329ms and min of 96ms. The killer here is the range in response times, which are affected by network latency + congestion, communication overhead between 3 nodes and having to wait for each one.

    Using an integer for the w flag lets MongoDB decide which nodes must acknowledge. My replica set has 4 nodes and I specified 2 and 3 but I didn't get to choose which ones were part of the acknowledgement. This could be local slaves but could also be remote, which is probably responsible for the range in response times where a remote slave happened to return faster than the local one. More control is possible using tags.

    Conclusion

    It's fairly clear that these MongoDB benchmark results validate the general assumptions that SSDs are faster and there is a fairly variable latency involved with replicating over a network, particularly over long distances. What this experiment shows is the differences between the write concern options so you can make the right tradeoff between durability and performance. It also highlights that you can significantly improve performance if you need the journal based durability by adding SSDs.

    MongoDB benchmarks raw results

      w=0 w=1 j=1
    Spinning
    j=1
    SSD
    w=2
    Same DC
    w=3
    Multi-DC
    Average 0.07ms 0.13ms 34.19ms 11.18ms 14.26ms 311ms
    Min 0.06ms 0.11ms 34.10ms 11.11ms 1.65ms 97ms
    Max 0.11ms 0.32ms 34.28ms 11.24ms 867.29ms 1,329ms

    MongoDB Benchmarks

    mongodb-benchmarks-no-w3

  2. Making a point with SLAs

    3 Comments

    SLAs are generally financially irrelevant because they typically cap any compensation at the total spend under the claim period e.g. if you spend $100 a month you can claim up to $100. When your spend is significant then it’s nice to get the money back but it doesn’t necessarily reflect the full cost of any service interruption.

    Instead, I see SLAs as useful for two reasons:

    To signal commitment to a level of service

    The level of SLA a company provides indicates its confidence in its own reliability and infrastructure. Traditionally, Rackspace have a very good reputation in managed hosting for uptime; something reflected by their 100% uptime guarantee. But it’s important to look a little further to see what caveats they have: does it include scheduled maintenance (almost never), how do they define uptime and do they differentiate between public/private network connectivity vs power failures as different parts of infrastructure with different SLAs?

    This is quite standard so I think the second use is more important:

    To make a point about being unhappy with the service

    I generally make SLA claims even for the smallest outages and problems. It takes time and effort for a provider to process them and calculate the claim amount, as well as the actual payment, which all count as business costs. I’d bet that these are measured quite closely and so I use SLA claims as a method of making a point about the level of service – to encourage them to fix it.

    You can complain about an outage and speak to management but if you want someone to actually notice the issue, make it cost something!

  3. Devops London Demographics

    Leave a Comment

    On 15th March I presented about scaling teams at DevOpsDays London (writeup / slides) and we sponsored the conference, including some sponsored tweets.

    There’s always a lot of activity around the official hashtag and Twitter provide some interesting analytics about the kinds of people interacting with your ads, which I thought it’d be interesting to share.

    • The most popular device used was desktop/laptop, with 59% of impressions.
    • Blackberry was the next popular, followed by iOS and Android.
    • This indicates that laptops are still most popular, which isn’t surprising given the number of attendees you see with their laptops open.
    • What is surprising is Blackberry being the most popular single device (although iOS at 17% and Android at 5% are larger when combined). Is this because those most interested in learning about devops at a conference are from larger companies where a more traditional structure exists?
    • That said, those most likely to interact with the ad in some way (follow, link click, retweet) on mobile are iOS and Android users, compared to only 2% of Blackberry users. Yet desktop still remains the most popular with 76% of ad engagements.
    • The majority of impressions were from male Twitter users (65%) but female users were much more likely to interact with the ads, although I’m not sure how Twitter works out gender from your profile!

    Devops demographics

    These were taken from an ad we ran on Twitter searches for the #devops #devopsdays and #devopslondon hashtags during the London DevOpsDays conference.

  4. Growing an ops team from 1 founder

    1 Comment

    This post is based on a talk I gave at DevOpsDays London 2013slides are at SlideShare.

    In the early days of 2009, it was just me running the Server Density monitoring infrastructure. The service came out of beta in the summer and immediately had a few paying customers which helped to fund the rental of a couple of slices from Slicehost (fancy VPSs). The volume of traffic, simplicity of the service components and small number of servers meant that there were few problems.

    Over the last 4 years the service has grown in terms of team members, data volume, customers and infrastructure so here are a few lessons from scaling the ops team and how things are run.

    Bootstrapping often means leaving things to last minute

    Ideally you’ll anticipate problems and have a solution well in advance, but that’s not always possible. The most likely reason in the early days is cash; or lack of it.

    In August of 2009 I’d just completed our migration from MySQL to MongoDB and it still had problems with eagerly eating up disk space. This prompted setting up a new server with increased disk space because resizing a Slicehost instance would’ve meant some hours of downtime. It went down to the very last few bytes of remaining disk space as the sync completed and I took a snapshot of the df output:

    david@pan ~: df -a
    Filesystem           1K-blocks      Used Available Use% Mounted on
    /dev/sda1            156882796 148489776    423964 100% /
    proc                         0         0         0   -  /proc
    none                         0         0         0   -  /dev/pts
    none                   2097260         0   2097260   0% /dev/shm
    none                         0         0         0   -  /proc/sys/fs/binfmt_misc
    
    david@pan ~: df -ah
    Filesystem            Size  Used Avail Use% Mounted on
    /dev/sda1             150G  142G  415M 100% /
    proc                     0     0     0   -  /proc
    none                     0     0     0   -  /dev/pts
    none                  2.1G     0  2.1G   0% /dev/shm
    none                     0     0     0   -  /proc/sys/fs/binfmt_misc

    It also means trying to find the quickest way to do things

    Time is something you don’t have much of and one of the slowest things is transferring large quantities of data over the internet. We had an unexpected failure where we had to do a full resync of a MongoDB slave in a different data centre, which would’ve taken 6 days. Instead, we copied the data onto a USB disk drive and had UPS ship it to the other facility. Network transfer speeds worked out at around 5MB/s whereas UPS delivered at 11MB/s.

    Let other people help

    Let other people help

    You really need at least one other person to be able to take on-call duties when you’re away but if that’s not possible or as a backup, you could make use of services provided by your hosting company or a 3rd party.

    We quickly moved from Slicehost to managed servers at Rackspace and they were able to do monitoring and respond to issues like servers down or services not running. They took special instructions for different scenarios and you could always phone them and ask them to perform certain actions. I remember several instances where I was away from my computer and was able to phone Rackspace support, asking them to perform some basic recovery actions whilst I got back online.

    Consider support contracts

    In addition to general sysadmin support from your hosting provider, you can buy commercial support contracts for the software products you’re using. This could be Ubuntu Linux, Nginx or MongoDB. Depending on the level of support you can get some pretty involved help when you need it most.

    However, they’re often very expensive and unaffordable as a startup. Even with the greater resources we now have, support contracts are aimed at enterprises with big budgets. One way to workaround this is to be very involved with the projects you use. I was an early adopter of MongoDB and have a close relationship with 10gen, the company behind it, so am able to get good deals on support.

    Also consider what support you really need. Our support contract with MongoDB was well used in the early days because it was a new technology. It’s significantly more stable nowadays and other products, like Apache for example, we’ve never had an issue with.

    Figure out what you have to do and what can be outsourced

    I consider keeping core engineering in-house very important for technology/software companies but there are lots of things that need doing to run operations that could be outsourced to (trusted) individuals on an ad-hoc basis. Engineers are terrible at valuing their own time and often use the argument: “why pay for something I could build/install/configure myself”.

    Candidates for this are things like running through PCI compliance checklists, setting up centralised logging, reorganising servers (e.g upgrading base OSs), researching CDN providers, integrating CI tools, etc. You always want someone technical managing the project to keep things on track and validate the end results, but these are things you don’t need to do yourself.

    Hack traveling

    Hacking traveling

    As part of the founding team and even as an engineer you’re likely to have to travel at some point – to conferences, meeting customers, pitching vendors…or maybe on holiday! It’s relaxing to be uncontactable on the plane but it’s also scary because you have no idea if everything is still running.

    On one of my trips to Japan, as soon as I stepped off the 12 hour flight to Tokyo Narita, I had a flood of SMS alerts as one of our MongoDB servers had encountered a problem 4 hours previously. One of our engineers had been assigned on call for my flight and had already worked with the guys at 10gen and resolved the problem.

    You’ll realise you become a slave to connectivity so trips to Japan are fine, but Tajikstan isn’t really an option. So you need to be able to get internet access and power anywhere you are – tricks such as visiting Starbucks, carrying external hotspots and not running things like updates when you’re away!

    Don’t forget the human aspect

    Dealing with humans - social issues

    There are a lot of cool tools which help to automate processes, and these should be used as much as possible. However, it’s still real people running things in the end. This is the really difficult bit of having a small team because everyone has to pitch in and it can be difficult to share the workload when just a few people know how things work.

    You have to consider who will take the call when things break:

    • How quickly can they get to a computer they can use to fix things?
    • Are you out drinking on a Friday?
    • What happens if someone falls ill? This could be a minor cold or major emergency.
    • This could be the individual engineer or their family members.
    • Does the on-call have enough phone battery? Can they hear their ringtone?
    • Who is backup if the primary doesn’t pick up?

    This is especially the case with outages. They often happen at inconvenient times and big incidents might require you to work for significant periods of time. Dealing with communicating with customers, fixing problems and recovering data can be exhausting especially when there’s nobody else to help. The ultimate goal is to build your team so that shift based on-call cover can be provided but it’s difficult in the beginning with limited resources (both for people and multi-geographic redundancy).

    Nobody is an invested in your service as you and your team

    Although services like Rackspace’s support are helpful in certain situations, they’re never able to know the full story behind your service and how to deal with complex components. For example, MongoDB was a completely new database and didn’t have single server durability for some time – a bad shutdown could require a lengthy database repair, which was important to take steps to avoid such as by properly shutting it down before powering off the server.

    Knowing about the weaknesses and how to deal with them is something that requires greater knowledge of your setup that basic vendor support isn’t going to provide. These things should be a stopgap or supplement the end goal of growing your own team.

    The whole point of devops is that it’s a mixture of engineering and operations so you don’t need to hire dedicated sysadmins. This works well for small startup teams but you will eventually want someone (or multiple people) who are responsible for the day to day operations. Engineers still engage with the team, can deploy, work on testing and debug problems but things like dealing with a failed disk drive or implementing backups is really outside the remit of devops in a large team.

    You know you’re there when you can start hiring “site reliability engineers”!

  5. Checking if a document exists – MongoDB slow findOne vs find

    16 Comments

    One of the biggest optimisations to make MongoDB writes faster is to avoid moving documents by doing updates in place to a preallocated document. You only want to preallocate if the document doesn’t exist and to check this you need to do a query.

    The findOne() method seems like the right choice for this because you can query the relevant index and check to see if a document or None is returned, ideally using a covered index.

    However, it is significantly faster to use find() + limit() because findOne() will always read + return the document if it exists. find() just returns a cursor (or not) and only reads the data if you iterate through the cursor.

    So instead of:

    db.collection.findOne({_id: "myId"}, {_id: 1})

    you should use:

    db.collection.find({_id: "myId"}, {_id: 1}).limit(1)

    By making this change I saw a change in performance of 2 orders of magnitude:

    Query performance after switching from findOne() to find(). Can't even see the response time of the find() call.

    Query performance after switching from findOne() to find(). Can’t even see the response time of the find() call.

    Same graph with the findOne series hidden. Less than 1ms response time.

    Same graph with the findOne series hidden. Less than 1ms response time.

  6. Building our London office – 1 year later

    Leave a Comment

    In February 2012 we moved into our custom built Server Density office in London, UK. Having started construction in November 2011, we designed and fitted out a 3 story office for our London based design and marketing team, with our engineers remaining remote around Europe. Now we’ve been there for a year, I thought I’d provide some more photos and a writeup of some of the changes we’ve made based around what we’ve learnt.

    Boxed Ice office blueprints

    More internet!

    With initially just 4 people working in the office full time, an 8MB ADSL connection from BT was sufficient. However, as we’ve added more office based employees and especially when we have our regular days with the whole team in the office, there was a noticeable slowdown.

    I specifically chose a Draytek Vigor2830 ADSL2/2+ Firewall Router to allow us to add additional connectivity and we did this with a 50MB fibre connection from Eclipse Internet. The router balances traffic across the connections automatically and applies QoS to prioritise important traffic e.g. VoIP and web traffic vs large downloads (such as the iOS SDK!).

    This also gives us some redundancy as the ADSL and fibre networks are separate (at certain points, they still share exchange and curb to building infrastructure). The next step would be to upgrade the BT connection for additional speed and a leased line for redundancy. Unfortunately leased line pricing is expensive – from £300/m for 1MB.

    Office internet breakdown

    The Sun is annoying

    The Sun causes problems with glare, particularly for designers. We spent a lot of time thinking through the lighting in the office but even with that, having windows complicates things! We had to replace blinds a couple of times so we could balance blocking out all light with allowing some natural light in to provide a nice working environment. The blinds we chose have opacity options so we were able to buy matching blinds but adjust the opacity on a per window basis.

    Cover the walls

    The first 6 months we had nothing on the walls which was very modern and minimalist but not very interesting. So asked for suggestions on Google+ and the team all posted ideas for different pictures and things to go on the walls.

    We now have a giant death star, a map of Japan, limited edition cycling prints, XKCD and Portal posters lining the walls. These have come from the whole team so help expressing the company culture and interests.

    Some of these switch stickers also appeared around the place.

    Eject Switch

    Eating lunch together

    Being a design and engineering company, the office is very quiet during the day – people are in the zone and listening to music on headphones. We’re all constantly in chat and using Google+ but it’s easy not to talk to anyone in real life during the day. As such, every day everyone in the office has lunch between 1.30pm – 2.30pm where we sit at our conference table away from computers and talk. Sometimes this is about work, discussing ideas and thinking through problems but often it’s unstructured – cool things on Hacker News, science or other random topics!

    Recycling

    Although we spent a lot of time thinking about the most energy efficient way to build the office in terms of heating, lighting and building materials, for some reason the London Borough Council don’t provide recycling to business premises. We have had to pay a commercial company to provide bags and collect our recycling, which is quite a large volume given that modern packaging is quite good at being easily recyclable.

    Things get messy, quickly

    Even though we’re essentially paperless, there are still things that need to be out on desks and/or stored. Glasses, mugs, notepads, mobile test devices (iPhones, iPads, Android phones), keys, medication, food, etc. We purchased a few under desk cabinets and coat hangers to keep things tidy.

    We have a cleaner come once a week to do normal things like vacuum, tidy up, dust things, etc. Now we have quite a few people, once a week is not quite often enough but twice a week is too much. Cleaning is also disruptive and we experimented with early morning and late night schedules (weekend didn’t fit in with the cleaner’s schedule). Most of us arrive in the office around 10.30 – 11am and leave by 8pm so the cleaner comes around 9pm.

    Power usage

    We were able to optimise the office power consumption by switching off hidden equipment on timers – network routers/switches aren’t used for most of the day so they get turned off on timers.

    Batteries are also a surprising requirement. Wireless mice and keyboards run out of power very quickly and at inconvenient times!

    Overall

    Adding an office to an initially remote company has worked very well. All our infrastructure is already in place to work anywhere so there’s no feeling of those working from home being isolated from those in the office. We use Google+ and HipChat extensively and so get the benefits of being able to hire the best engineers anywhere in the world at the same time as having all designers in the same location, where they work together best.

    More photos

    Server Density office door

    Server Density office door

    Office entrance, conference table

    Office entrance, conference table

    Upstairs floor with stairs barrier wall

    Upstairs floor with stairs barrier wall

    Upstairs design team area with map of Japan and Death Star

    Upstairs design team area with map of Japan and Death Star

    Upstairs area with sun blinds

    Upstairs area with sun blinds

    Pair deploying new services for testing

    Pair deploying new services for testing

    Portal posters into kitchen area

    Portal posters into kitchen area

    Unused downstairs area ready for expansion

    Unused downstairs area ready for expansion

  7. How to configure nginx as a load balancer

    6 Comments

    Last year we deployed our own load balancers using pound, but we are now transitioning to nginx because it is more actively developed and has much better features including caching and new support for web sockets.

    nginx is the frontend load balancer for v2 of our server monitoring service, Server Density, which is about to go into beta testing. However, it is already in production for internal services like the metrics backend (which powers our graphing).

    Basic nginx load balancer configuration

    You need x2 modules which are built into the nginx core: Proxy, which forwards requests to another location, and Upstream, which defines the other location(s). They should be available by default.

    Within your nginx.conf file you need to specify 2 blocks. The first of these is upstream which defines the nodes within the load balanced cluster:

    upstream web_rack {
        server 10.0.0.1:80;
        server 10.0.0.2:80;
        server 10.0.0.3:80;
    }

    Here you have 3 nodes with a web server listening on port 80. The group has been called web_rack. This is the destination for the proxy and this upstream module deals with distributing that proxied request across the defined nodes. There are different options for how the distribution works including defining nodes with higher priority and what happens if nodes are down.

    Next you tell the “vhost” about this upstream rack:

    server {
        listen 80;
        server_name www.example.com;
        location / {
            proxy_pass http://web_rack;
        }
    }

    This creates an equivalent of an Apache vhost listening on www.example.com port 80 and all requests are proxied to the web_rack, which then distributes them to the 3 nodes we have configured.

    The full nginx.conf file would look like this:

    http {
    	upstream web_rack {
    	    server 10.0.0.1:80;
    	    server 10.0.0.2:80;
    	    server 10.0.0.3:80;
    	}
    
    	server {
    	    listen 80;
    	    server_name www.example.com;
    	    location / {
    	        proxy_pass http://web_rack;
    	    }
    	}
    }

    More advanced options

    The docs for each module contain more examples of the options you can include but some of the ones we make use of are:

    nginx load balancer log formatting

    It can be useful for debugging to dump a load of info into the request logs. We use:

    log_format upstreamlog '[$time_local] $remote_addr - $remote_user - $server_name  to: $upstream_addr: $request upstream_response_time $upstream_response_time msec $msec request_time $request_time';

    so we can see where things are coming from and where they’re going, plus how long the response took.

    This goes in the http config block and you can find other variables in the docs.

    Conditional forwarding based on the HTTP Method

    We distribute GET and POST requests in certain situations e.g. our graphing is very write heavy so we have dedicated, separate processes dealing with POSTs and GETs to avoid contention. Nginx splits these up:

    location / {
        if ($request_method = POST)
        {
            proxy_pass http://post_rack;
            break;
        }
    
        proxy_pass http://get_rack;
    }

    This goes in the server config block.

    Proxy headers

    Again useful for debugging we add headers into the proxied request so we can see where things are going and where they have come from, plus some timestamps for monitoring:

    proxy_set_header        Host $host;
    proxy_set_header        X-Real-IP $remote_addr;
    proxy_set_header        X-Forwarded-For $proxy_add_x_forwarded_for;
    proxy_set_header        X-Queue-Start "t=${msec}000";

    This goes in the http config block.

    Deploying nginx with Puppet

    All the examples above are hand written but we actually make use of Puppet to configure everything, using our own fork to add some non-supported features like the conditional HTTP method routing, custom logging and SSL improvements (not mentioned above since they’re not load balancer specific but things like better cipher choices and HSTS headers).

    Read more about our usage of deploying nginx with Puppet or get our Puppet Nginx module from Github.

  8. MongoDB schema design pitfalls

    Leave a Comment

    One of things that makes MongoDB easy to get started with is you don’t have to think about schema design – just shove data in and it’ll let you query it. That helps initial development and has benefits down the line when you want to change your document structure. That said…

    …so just like any database, to improve performance and make things scale, you still have to think about schema design. This has been well covered elsewhere (here, here and here) so here are some more in depth considerations to avoid the pitfalls of MongoDB schema design:

    1. Avoid growing documents

    If you add new fields or the size of the document (field names + field values) grows past the allocated space, the document will be written elsewhere in the data file. This has a hit on performance because the data has to be rewritten. If this happens a lot then Mongo will adjust its padding factor so documents will be given more space by default. But in-place updates are faster.

    You can find out if your documents are being moved by using the profiler output and looking at the moved field. If this is true then the document has been rewritten and you can get a performance improvement by fixing that (see below).

    2. Use field modifiers

    One way to avoid rewriting a whole document and modifying fields in place is to specify only those fields you wish to change and use modifiers where possible. Instead of sending a whole new document to update an existing one, you can set or remove specific fields. And if you’re doing certain operations like increment, you can use their modifiers. These are more efficient on the actual communication between the database as well as the operation on the data file itself.

    3. Pay attention to BSON data types

    BSON logo

    A document could be moved even by changing a field data type. Consider what format you want to store your data in e.g. if you rewrite (float)0.0 to (int)0 then this is actually a different BSON data type, and can cause a document to be moved.

    4. Preallocate documents

    If you know you are going to add fields later, preallocate the document with placeholder values, then use the $set field modifier to change the actual value later. As noted above, be sure to preallocate the correct data type – beware: null is a different type!

    However, trigger the preallocation randomly because if you’re suddenly creating a huge number of new documents, that too will have an impact e.g. if you create a document for each hour, you want to do them in advance of that hour balanced over a period of time rather than creating them all on the hour.

    5. Field names take up space

    This is less important if you only have a few million documents but when you get up to billions of records, they have a meaningful impact on your index size. Disk space is cheap but RAM isn’t, and you want as much in memory as possible.

    6. Consider using _id for your own purposes

    Every collection gets _id indexed by default so you could make use of this by creating your own unique index. For example if you have a structure based on date, account ID and server ID like we do with our server monitoring metrics storage for Server Density, you can use that as the index content rather than having them each as separate fields. You can then query by _id with the single index instead of using a compound index across multiple fields.

    7. Can you use covered indexes?

    If you create an index which contains all the fields you would query and all the fields that will be returned by that query, MongoDB will never need to read the data because it’s all contained within the index. This significantly reduces the need to fit all data into memory for maximum performance. These are called covered queries. The explain output will show indexOnly as true if you are using a covered query.

    8. Use collections and databases to your advantage

    You can split data up across multiple collections and databases:

    • Dropping a whole collection is significantly faster than doing a remove() on the documents within it. This can be useful for handling retention e.g. you could split collections by day. A large number of collections usually makes little difference to normal operations, but does have a few considerations such as namespace limits.
    • Database level locking lets you split up workloads across databases to avoid contention e.g. you could separate high throughput logging from an authentication database.

    Test everything

    Make good use of the system profiler and explain output to test you are doing what you think you are doing. And run benchmarks of your code in production, over a period of time. There are some great examples of problems uncovered with this in this schema design post.

  9. How an expired credit card can shut down your entire business

    2 Comments

    Back in the old days when you had physical servers hosting your website, a hosting provider would have to take physical steps to suspend a customer’s account – blocking network access or powering down equipment. In the virtualised/cloud era, it’s now a case of issuing a single command to halt or delete your instances.

    Instead of paying manually by invoice through an accounts receivables team where due dates and multiple attempts to collect payment would be made, billing is now completely automated through credit card. You could forget to pay off your card limit. Your card could be stolen. It might expire. You could be on holiday and not read the payment alerts. Or something could simply go wrong with your bank.

    This means there’s potential for a failed payment to become a single point of failure as your hosting provider suspends your account for overdue invoices. Do you know how many times your vendor will attempt to bill your account? Will they just suspend your instances or will they be deleted? Is this the same across your hosting vendor? E-mail? DNS?

    X (software/infrastructure/platform) as a service has many benefits but have you considered billing as a single point of failure?

  10. What causes delays in software projects

    Leave a Comment

    Server monitoring graphs iPad

    For the last few weeks I’ve been testing a new storage backend for the server monitoring time series data we collect in Server Density, used to power our graphs and historical data API. The new system solves a number of scaling issues we’ve been having with timeouts over longer date ranges and allows us to store more data.

    Instead of summarising the 1 minute data points after 1 hour we will keep minute by minute data for 2 weeks for all customers, before it starts being summarised. Optional extensions of that (up to 1 year) can also be purchased. We keep all data forever, it just gets summarised up to hourly and then daily after the initial period.

    The new system involves a queuing layer which allows us to survive data store outages without losing data – it just gets inserted later. The old system is in PHP but, as with all our new development, we’re now using Python. Celery handles the queuing and since the system operates as an internal web service, Tornado handles the web endpoints with Nginx sitting in front of it as a load balancer.

    We’ve been working on this for several months and hoped to get it out sooner, but there have been delays caused by problems picked up with production traffic. As we already use a web service architecture, I have been able to drop in this complete rewrite with no impact to the existing clients – the internal APIs remain the same. This has meant I’ve been able to test against production traffic because the data can be mirrored.

    It’s not until you run real traffic through a system over a period of time that you find problems.

    • The default logging in Celery is extremely verbose. This caused our logs to fill up very quickly with an INFO log line being logged for every processed queue item.
    • Connection pooling in the Python requests library is set too low. The default connection pool size in Python requests is 10. This is insufficient for high volume requests because the pool will be quickly exhausted with an HttpConnectionPool is full, discarding connection error. I needed to implement a custom HTTP adapter to fix this.
    • Some of our client code is still in PHP. We use a pseudo asynchronous method of posting where the HTTP connection is closed immediately and the response ignored. This works fine with Apache (our old setup) but Nginx thinks the client has closed the connection and terminates the internal request. The workaround was to quickly loop through the response.
    • However, as we started to increase the load this workaround fell apart. This is because the code isn’t truly asynchronous so even though it was returning quickly (within 3ms), the load on our servers increased significantly. And when we simulated slowdowns, the entire server would grind to a halt waiting for responses. So instead we changed the Nginx config to ignore client aborts (see also our public Nginx Puppet manifest). Eventually we’ll remove all our PHP client code and use true asynchronous posts.
    • A couple of bugs in Celery were found. One has been fixed and the other we’re still investigating and have a manual workaround for.
    • We tested using a small 32GB SSD for the persistent data storage but this was unnecessary because all tasks usually get processed within a second (and so are in memory) and the disk size is too small to handle more than a few minutes of data store downtime where the queue is backing up. So instead we replaced the SSD with a large SATA2 drive.

    This project is fairly simple in terms of the amount of code – just a few hundred lines – and very few moving parts (API endpoint + task processing). The complexity comes with the libraries we’re using and how they (and our own code) performs under heavy load. This is what causes the real delays in software projects. And for our customers, we hope to roll this new backend out very soon!