Evolution of our hosting – hybrid
CEO & Founder of Server Density.
Published on the 17th April, 2012.
Last week we completed the migration of almost 70 servers from the Terremark Enterprise Cloud to Softlayer. The majority of the work was done with zero downtime but the final stage was moving the live application data across, which resulted in 2h56m of app downtime. This was purely the amount of time it took to export, compress, transfer (SCP) and reimport into MongoDB from Terremark in Miami to Softlayer in Washington, DC. Everything went to plan and there were no unexpected issues during the move itself.
Since Server Density started in 2009 the hosting environment has evolved from a single VPS at Slicehost, to purely dedicated at Rackspace, then purely cloud at Terremark and now we’re in a hybrid cloud/dedicated environment at Softlayer. The initial vendor choice from Slicehost through to Terremark was mostly driven by cost but our financial situation now means we were able to mostly ignore cost, even though it still worked out cheaper overall with Softlayer compared to what we were paying Terremark.
The main driver for the decision to switch was performance. We reached the limits of the shared storage platform provided by Terremark and although there was usually no customer impact due to our redundant setup, we were battling with performance issues in the background. We were also seeing CPU resource allocation issues where performance was unpredictable and was often degraded based on usage from other customers in the same shared environment. Terremark weren’t able to provide us with dedicated resources fast enough or at a reasonable cost so we started looking at alternatives. After talking to our friends at Mixpanel (and reading their blog) and at one of our customers, Struq, we selected Softlayer.
The Server Density application core data store is now running on dedicated servers with SSDs and 2Gbps internal networking for MongoDB; all other servers are cloud based. Softlayer provision new hardware within 1-4 hours so we can very quickly switch servers over to dedicated hardware as needed because everything shows up on the same network. We’re seeing great performance from their cloud instances too and have some interesting benchmarks comparing Softlayer and Terremark.
Combined with storing all data and indexes in RAM, we get an average MongoDB query speed of 0.35ms at around 1000 writes (mostly updates) per second from the dedicated hardware + SSDs for the core application database. These servers are benchmarked at around 15k findAndModify (atomic) queries per second and went up to almost 30k inserts per second during our data import. This is separate from our internal metrics service which stores historical time series data and is running on the Softlayer cloud, giving us an average query speed of 0.09ms at around 3000 writes (mostly updates) per second. Note that although the cloud appears to be giving better performance here, the comparison is not the same because of the types of queries, size of documents being written and how of working set is made up.
Better per instance performance also means we were able to get rid of 15 nodes with no real increase in spec for the remaining nodes. For example we went from 7 down to 3 web nodes serving the main web app (1+1 redundancy). Although it’s cool to say you manage hundreds of servers, it’s actually better to be able to reduce the number to avoid licensing costs (e.g. we pay 10gen for MongoDB support and Canonical for Ubuntu support on a per server basis), reduce management overhead and to reduce the number of things that can fail.
We are deployed in the Softlayer Washington, DC data centre with disaster recovery in San Jose. However, now the migration is done our next project is serving from multiple data centres. This is something we’ve been working on for some time and all the necessary code/application changes are done. Once deployed, you will be accessing Server Density from one of several locations, depending on which is closest to you. And of course we’ll be able to survive failure of an entire data centre as a result of these changes.
Over the next few weeks I’ll be writing about different aspects of our new setup.
Photo: By Victorgrigas (Own work) [CC BY-SA 3.0 (http://creativecommons.org/licenses/by-sa/3.0)], via Wikimedia Commons