Last night we had an issue in our failover data center in San Jose, USA. We use our data center (Softlayer) internal DNS resolvers in the /etc/resolv.conf on all our servers and their resolvers went down. This didn’t cause any customer impact but meant that all the servers in that data center were unable to resolve remote hostnames.
Our workaround for this was to change the resolvers to something else and because we’re using Puppet, we can do this very quickly across a large cluster of machines. However, we only wanted to apply the change to the SJC data center.
Our hostname policy is fairly straightforward. An example web server is:
This is made up of several parts:
hcluster3 – this describes what the server is used for. In this case, it’s of cluster 3, which hosts our alerting and notification service (all of Server Density is built using a service orientated architecture). Other examples could be mtx2 (our time series metrics storage cluster, version 2) or sdcom (servers which power our website).
web1 – this is a web server (either Apache or nginx) and is number 1 in the cluster. We have multiple load balanced web servers.
sjc – this is the data center location code, San Jose in this case. We also have locations like wdc (Washington DC) or tyo (Tokyo).
sl – this is the facility vendor name, Softlayer in this case. We also have vendors like rax (Rackspace) and aws (Amazon Web Services).
When we had a much smaller number of servers, the naming convention was based on characters in His Dark Materials by Philip Pullman. So for example, a master database server was Lyra with the slave being Pan. Picking names like this doesn’t scale after 10 or so servers but then you can transition to other things, like names of stars, lakes, rivers, etc.
We moved to the current naming structure a few years ago and this now allows us to quickly identify key information about our servers but also helps when we want to filter by provider or specific locations.
In our Puppet /etc/resolv.conf template we can then do things like:
<% if (domain =~ /sl/) -%>
<% if (domain =~ /sjc.sl/) -%>
# google DNS - temp until SL fixed
<% else %>
# Internal Softlayer DNS
<% end -%>
And when it comes to using the Puppet console, we can quickly trigger puppet runs or take actions against specific nodes:
How do other people do this?
If we dissect a traceroute to the Server Density website we can pick out locations (lon, nyc, wdc) and router names:
Our ISP uses names related to the use case with “-less” appended, because when they ordered their first piece of equipment they took the structure from the Dilbert cartoon of the day!
Looking at how Amazon route traffic for the static website hosting feature of S3, you can even look at network maps to visualise where packets are flowing from the UK, to Osaka and then Tokyo in Japan.
This works nicely for a relatively small number of servers which are not transient and have defined workloads but it could be problematic if you have a huge number of machines and/or frequently launch/destroy transient cloud instances or VMs. You can see how Amazon deals with this by setting the hostname to the internal IP address but I could see other solutions where you use an instance ID instead. I’d be interested to learn how other people are doing this.
Each month we’ll round up all the feature changes and improvements we made this month to our server and website monitoring product, Server Density.
December is always a strange month because the final weeks are taken up with holidays, so we decided to focus on many smaller improvements. There were a lot of changes, bug fixes and performance improvements behind the scenes but these are the main things you might notice.
Resizable dashboard widgets
All dashboard widgets can now be resized to any reasonable size you want. This allows you to fit multiple graphs on the same line and reorder the widgets as you like. Simply hover over the right side bar of the widget you want to resize and click, hold and drag the widget left or right.
Linux/Mac/FreeBSD agent: aggregate CPU stats & i/o stats
The latest version of the Linux, Mac and FreeBSD Python-based monitoring agent now returns aggregated CPU stats so you can see a value for “ALL” as well as each individual CPU core. This is useful if you have many cores.
i/o stats are also now collected on OS X with metrics for kilobytes per transfer, transfers per second and megabytes per second on disk0.
The new Windows agent release also introduces new stats for disk i/o: disk read/writes (bytes/s), disk % utilization, average disk queue length, disk reads/writes per second, average seconds per transfer.
When there is an error in MySQL replication, MySQL reports the Seconds_Behind_Master status as NULL. This is now handled and reported as -1 to allow alerting.
Updated product documentation
With the migration of users from the v1 product, our old support site was out of date and confusing as some articles referenced v1 and some v2. We’ve now updated all the docs so they reference v2 only.
For those still on v1, we recommend migrating your account. We’ll be announcing the ability to have an engineer support you with migrations in 2014.
Displaying agent version
The currently installed agent version is now shown when viewing each device, with links through to the release notes.
Improvements for small screens
We’ve made some initial improvements to help users on smaller screen sizes e.g. laptops, to ensure the UI resizes and adjusts without wrapping.
Triggered alerts highlighted
When viewing alert configs, any config with a triggered alert will be highlighted.
Mb/s network traffic alerting
We’ve supported graphing network traffic data in MB/s (megabytes per second) and Mb/s (megabits per second) for a long time but only supported alerting in MB/s. Alert configs can now be created in Mb/s too.
The main release for January will be our mobile apps for both iPhone and Android, supporting push notifications. We’re also working on detailed process level monitoring.
Last week I compared cloud instances against dedicated servers showing that for long running uses such as databases, it’s significantly cheaper if you do not use the cloud, but that’s not the end of it. Since you are still paying on a monthly basis then if you project the costs out 1 or 3 years, you end up paying much more than it would have cost to purchase the hardware outright. This is where buying your own hardware and colocating it becomes a better option.
Continuing the comparison with the same specs for a long running database instance, If we price a basic Dell R415 with x2 processors each with 8 cores, 32GB RAM, a 500GB SATA system drive and a 400GB SSD, then the one time list price is around $4000 – more than half the price of the SoftLayer server at $9,468/year in the previous article.
Dell PowerEdge R415 front
Of course, the price you pay SoftLayer includes power and bandwidth and these are fees which depend on where you locate your server. Power usage is difficult to calculate because you need to actually stress test the server to figure out the maximum draw and then run real workloads to see what your normal usage is.
My company, Server Density, just started experimenting with running our own hardware in London. We tested our 1U Dell with very similar specs as discussed above was using 0.6A normally but stress tested with everything maxed out to 1.2A. Hosting this with the ISP who supplies our office works out at $161/month or $1932/year (it would work out cheaper to get a whole rack at a big data centre but this was just our first step).
This makes the total annual cost look as follows:
Remember, again, that this is a database server so whilst with Rackspace, Amazon and SoftLayer you pay that price every year, after the first year with colocation the annual cost drops to $1932 because you already own the hardware. Further, the hardware can also be considered an asset which has tax benefits.
Server Density is still experimenting at on small scale but I spoke to Mark Schliemann VP of Technical Operations at Moz.com because they run a hybrid environment. They recently moved the majority of their environment off AWS and into a colo facility with Nimbix but are still using AWS for processing batch jobs (the perfect use case for elastic cloud resources).
Moz worked on detailed cost comparisons to factor in the cost of the hardware leases (routers, switches, firewalls, load balancers, SAN/NAS storage & VPN), virtualization platforms, misc software, monitoring software/services, connectivity/bandwidth, vendor support, colo and even travel costs. Using this to calculate their per server costs means on AWS they would spend $3,200/m vs $668/m with their own hardware. Projecting out 1 year results in costs of $8,096 vs AWS at $38,400.
Moz’s goal for the end of Q1 2014 is to be paying $173,000/month for their own environment plus $100,000/month for elastic AWS cloud usage. If they remained entirely on AWS it would work out at $842,000/month.
Optimizing utilization is much more difficult on the cloud because of the fixed instance sizes. Moz found they were much more efficient running their own systems virtualized because they could create the exact instance sizes they needed. Cloud providers often increase CPU allocation alongside memory when in real world uses you tend to need one or the other. Running your own environment allows you to optimize this and was one of the big areas Moz have used to improve their utilization. This has helped them become much more efficient with spend.
Right now we are able to demonstrate that our colo is about 1/5th the cost of Amazon but with RAM upgrades to our servers to increase capacity we are confident we can drive this down to something closer to 1/7th the cost of Amazon.
Colocation has its benefits once you’re established
Colocation looks like a winner but there are some important caveats:
First and foremost, you need in-house expertise because you need to build and rack your own equipment and design the network. Networking hardware can be expensive and if things go wrong, you need to have the knowledge about how to deal with the problem. This can involve support contracts with vendors and/or training your own staff. However, this does not usually require hiring new people because the same team that has to deal with cloud architecture, redundancy, failover, APIs, programming, etc, can work on the ops side of things running your own environment.
The data centers chosen have to be easily accessible 24/7 because you may need to visit at unusual times. This means having people on-call and available to travel, or paying remote hands at the data center high hourly fees to fix things.
You have to purchase the equipment upfront which means large capital outlay but this can be mitigated by leasing.
So what does this mean for the cloud? On a pure cost basis, buying your own hardware and colocating it is significantly cheaper. Many will say that the real cost is hidden with staffing requirements but that’s not the case because you still need a technical team to build your cloud infrastructure.
At a basic level, compute and storage are commodities. The way the cloud providers differentiate is with supporting services. Amazon has been able to iterate very quickly on innovative features, offering a range of supporting products like DNS, mail, queuing, databases, auto scaling and the like. Rackspace has been slower to do this but is now starting to offer similar features.
Flexibility of cloud needs to be highlighted again too. Once you buy hardware you’re stuck with it for the long term but the point of the example above was that you had a known workload.
Considering the hybrid model
Perhaps a hybrid model makes sense, then? This is where I believe a good middle ground is and we can see Moz making good use of such a model. You can service your known workloads with dedicated servers and then connect to the public cloud when you need extra flexibility. Data centers like Equinix offer Direct Connect services into the big cloud providers for this very reason, and SoftLayer offers its own public cloud to go alongside dedicated instances. Rackspace is placing bets in all camps with public cloud, traditional managed hosting, a hybrid of the two and support services for OpenStack.
And when should you consider switching? Dell(s dell) cloud exec Nnamdi Orakwue said companies often start looking at alternatives when their monthly AWS bill hits $50,000 but is even this too high?
Using cloud infrastructure is the natural starting point for any new project because it’s one of the ideal use cases for cloud infrastructure – where you have unknown requirements; the other being where you need elasticity to run workloads for short periods at large scale, or handle traffic spikes. The problem comes months later when you know your baseline resource requirements.
Let’s consider a high throughput database as an example. Most web applications have a database storing customer information behind the scenes but whatever the project, requirements are very similar – you need a lot of memory and high performance disk I/O.
Evaluating pure cloud
Looking at the costs for a single instance illustrates the requirements. In the real world you would need multiple instances for redundancy and replication but will just work with a single instance for now:
Amazon EC2 c3.4xlarge (we can’t consider m2.2xlarge because it is not SSD backed)
= 30GB RAM, 320GB SSD storage
= $1.20/hr or $3726 + $0.298/hr heavy utilization reserved
Databases also tend to exist for a long time and so don’t generally fit into the elastic model. This means you can’t take advantage of the hourly or minute based pricing that makes cloud infrastructure cheap in short bursts.
Another issue with databases is they tend not to behave nicely if you’re contending for I/O on a busy host so both Rackspace and Amazon let you pay for dedicated instances — on Amazon this has a separate fee structure and on Rackspace you effectively have to get their largest instance type. Calculating those costs out for our annual database instance would look like this:
(The extra $2 per hour on EC2 is charged once per region)
Note that because we have to go for the largest Rackspace instance, the comparison isn’t direct — you’re paying Rackspace for 120GB RAM and x4 300GB SSDs. On one hand this isn’t a fair comparison because the specs are entirely different but on the other hand, Rackspace doesn’t have the flexibility to give you a dedicated 30GB instance.
Consider the dedicated hardware option…
Given the annual cost of these instances, the next logical step is to consider dedicated hardware where you rent the resources and the provider is responsible for upkeep. Here at Server Density, we use Softlayer, now owned by IBM, and have dedicated hardware for our database nodes. IBM is becoming very competitive with Amazon and Rackspace so let’s add a similarly spec’d dedicated server from SoftLayer, at list prices:
To match a similar spec we can choose the Dual Processor Hex Core Xeon 2620 – 2.0Ghz Sandy Bridge with 32GB RAM, 32GB system disk and 400GB secondary disk. This costs $789/month or $9,468/year. This is 80 percent cheaper than Rackspace and 61 percent cheaper than Amazon before you add data transfer costs – SoftLayer includes 5,000GB of data transfer per month which would cost $600/month on both Amazon and Rackspace, a saving of $7200/yearly.
… or buy your own
There is another step you can take as you continue to grow — purchasing your own hardware and renting data center space i.e. colocation. We’ll look into the tradeoffs on that scenario in a post to come so make sure you subscribe.
Anyone who has ever run a full screen flash video or Google Hangouts knows that they are good stress tests for your CPU, and after a while the fans start spinning as the CPU usage causes temperatures to increase.
Now we’re graphing that, we can actually see and prove the direct correlation between CPU load and temperature. Look how closely they match each other:
Of course this also means more power is being drawn and we can see a similar correlation between load and power usage:
These graphs are cool to look at, and there’s a purpose behind monitoring metrics of this kind – making sure they stay within acceptable ranges. We want to know if temperatures suddenly go up (could indicate a failed fan or a data center issue), if power suddenly drops to zero on one PSU (again, failure of that PSU) and same for fan speed if it drops too low.
For more information on how you could start monitoring metrics like this, it’s worth trying out our hosted server monitoring software. After all, it’s free for 15 days!
Each month we’ll round up all the feature changes and improvements we made this month to our server and website monitoring product, Server Density.
Dashboard and custom graph builder
At the start of the month we released the ability to create custom graphs by combining any metric from any device or service check. These graphs can then be arranged onto custom dashboards alongside status widgets for service checks, showing uptime, response time and status.
The notification center allows you to see all open and closed alerts across your account. Previously the red sign that shows the count of open alerts would be a global count across your whole account. You can now toggle a filter to have this only count alerts where you are a recipient so the count only reflects the alerts you care about.
There’s also a new toggle on the far right which allows you to disable or enable the context sensitive nature of the notification center. As you browse the app the notification center will change to show only device or service specific alerts depending on where you are in the app. This toggle allows you to disable this if you always want to see the global view.
We found were were often getting requests for packages inbetween the old server and web check quotas of our 1, 10, 50 and 100 packages. As such, we’ve added new packages inbetween at 2, 5, 25 and 75. Pricing is mostly the same but has been slightly decreased or increased to make the jumps logical. Existing customers remain on their existing packages with no changes to prices but you can just switch to the new ones from within your account.
Increased free SMS alerts
We are now offering more free SMS credits for each pricing package. The limits have been set high enough to make them effectively unlimited, although there is still a cap which is based on the package you are on. You can see the new included SMS credits here.
For existing customers the increased SMS credits will be automatically applied to your account on your next billing date.
The Flexiant cloud platform now has an official plugin in their latest release which enables you to easily add, remove or change the status of servers being monitored automatically as you make the same changes inside your Flexiant platform. Flexiant is a cloud management platform which gives service providers, telcos and others the ability to create and sell cloud services.
We’ve started work on our mobile apps for iPhone and Android so you can get push notification alerts and manage your notifications from your device. These will be out in January.
Chaos Monkey is a service which runs in the Amazon Web Services (AWS) that seeks out Auto Scaling Groups (ASGs) and terminates instances (virtual machines) per group.
The idea is to randomly kill parts of your infrastructure to check the redundancy of components and ensure you can handle failover.
We decided to build our own lightweight version as a simple Python script because we don’t use AWS or Java. When doing this, or using the actual Netflix code, there are some considerations to keep in mind:
Trigger chaos events during business hours
Although real failures can happen at any time, you want deliberate failures to happen when people are around to a) respond to them and b) fix them. It’s not fair to be waking people up with unnecessary on-call events in the middle of the night! Consider what business hours are not just for you but your entire team, especially if you have remote workers or several offices in different timezones.
Decide what level of mystery you want
When our script triggers a chaos event, it posts into our Hipchat room saying it has done so, but not what the details are. This partially simulates a real outage because in the initial stages you need to triage the alerts to see where the failures are but it means that everyone knows to look out for strange things. This prevents issues going unnoticed so you can see if you need to improve your monitoring not by being told stuff is broken by customers, but by discovering the results of a known chaos event.
Have several failure modes
Killing instances is just one way to simulate failure but doesn’t cover all possible options. It’s good to try and simulate as many different complete or partial failures as possible. We use the Softlayer API to trigger server power downs but also use their API to disable public and/or private networking interfaces too. This gives you full failure with a power off but also a network failure mode where the host still remains up (and may even still report up to your monitoring).
Don’t trigger sequential events
After one chaos event you don’t want to have to deal with another one just a short time later, especially if the bugs discovered aren’t fixed yet. Have a wait period so after triggering one event another one won’t be triggered for a few hours. You don’t want people constantly firefighting.
Play around with the event probability
Events should be infrequent and random and there may be none triggered for several days. This helps to test your on-call response to keep the unexpected nature of these kinds of events real.
None of the issues that we discovered from these chaos events were in the server level software – failover in load balancers (Nginx) and databases (MongoDB), for example, works very well. Every problem we have found has been in our own code, mostly in how the code interacts with databases in failover mode and mostly in libraries we’ve not written. This has allowed us to report bugs upstream and improve the resiliancy of our own software but does require some engineering time and effort to get right.
Using a Chaos Monkey is really the only way to test how your infrastructure will behave under unknown failure conditions. Failure will happen so you have to engineer around it – doing so in a theoretical way only goes so far and the only test is to trigger random events in the real world.
Today we have released the ability to create custom graphs by combining any metric from any device or service check. These graphs can then be arranged onto custom dashboards alongside status widgets for service checks, showing uptime, response time and status.
Custom graphs allow you to compare metrics across multiple servers and design the graphs you need for troubleshooting or monitoring metrics across clusters. Design your own graphs with any metric, including plugins and with the choice of which axis to plot the line on.
These can then be placed onto dashboards designed to be viewed on a big TV or separate display, so you can see the status at a glance. You can create multiple dashboards across multiple time ranges, all shared within your account.
This is available now for all Server Density v2 accounts. Initial dashboard widgets are service check status, uptime and response time plus graphs – we want to hear what other widgets you’d like to see next!
Each month we’ll round up all the feature changes and improvements we made this month to our server and website monitoring product, Server Density.
Global notification center
An expandable right hand panel reveals the new notification center which gives you a global view of alerting on your account. You can see all open alerts across devices and service checks, filtering by specific group, device, service or whether the alert is notifying you or across the whole account. It’s context sensitive so changes depending on what your current view is and allows you to view the alert history for closed alerts.
It will also display any errors for your cloud servers, such as failing to start cloud instances.
This will form the basis for our upcoming mobile apps, which will initially focus on alerts before expanding to all functionality.
Improved service monitoring accuracy
We received a number of reports of service monitor web checks thinking services were down or timing out because the monitoring nodes were very sensitive to network issues. We’ve pushed out changes to fix this so you will see any issues of false positives resolved, meaning no incorrect alerts saying sites are down when they are up caused by transient timeouts. The check methodology has been changed to retry failed requests within a few seconds to verify they are actually down.
Left / right graph axis
You can choose which axis each graph series will be plotted, which allows you to view series with different scales or units on the same graph. This makes it easier to compare metrics which have very different values and still be able to see the spikes.
You can now set up “no data received” alerts on a group level rather than needing to do it on individual servers. This was originally postponed as it requires active checking on a regular schedule across all members of a group, which is more complex to implement than on individual devices.
All v1 users can now self migrate their own accounts to v2 by clicking the tab in-app. This will migrate all settings but still let you use v1 alongside v2, so as not to affect production monitoring. Full details are here.
At the start of November we’ll be releasing our custom graph builder and dashboards to allow you to create graphs combining metrics across devices and service checks, plus creating custom dashboards to display your metrics and graphs.
Recently we’ve been reviewing the infrastructure that powers our server and website monitoring service, Server Density, and as a result we have started an experiment looking into buying and colocating our own physical hardware.
Currently, the service is run from 2 data centers in the US with Softlayer and we’re very happy with the service. The ability to deploy new hardware or cloud VMs within hours or minutes on a monthly contract, plus the supporting services like global IPs is very attractive. However, we’re now spending a significant amount of money each month which makes it worth considering running our own hardware.
In particular, the large servers which power our high throughput time series MongoDB databases for our graphing are very expensive when you project the cost out over a long period of time. We’re processing over 25TB of inbound data and working with that volume and making the graphs fast requires lots of RAM and big SSDs, both of which are very expensive when billed monthly.
What about the cloud?
Cloud infrastructure like EC2 or Rackspace Cloud is perfect for a number of use cases. It’s great for startups who want cheap (in the short term) servers and don’t know their workload patterns. It’s also great for elastic workloads and scaling quickly. However, our use case is completely different – we have a consistent level of traffic all day, every day and it only grows. It doesn’t fluctuate because servers are constantly sending us data all the time. This means it’s very easy to predict our workload and the flexibility the cloud offers isn’t necessary.
What about dedicated?
We currently have a mixture of dedicated servers (for our databases, which need high memory and guaranteed disk i/o performance) and VMs (for web servers and other processing tasks such as alerting which is usually CPU bound). These are managed by us but rented from Softlayer so we don’t need to deal with networking, failed hardware, etc. However, if you calculate the cost out more than a month you can actually buy the full cost of the server after around 6 months, and given lifetime of servers is usually more than that there is a significant cost saving.
So why choose colocation?
If you have a consistent workload then you can always buy significantly higher spec hardware at a much lower cost than renting it monthly. It counts as an asset and so there are also tax benefits. You get much more control over your infrastructure.
There are downsides
Renting servers from the likes of Softlayer or Amazon removes a lot of the “old school” sysadmin work. As soon as you manage your own setup there are quite a few things you need to consider:
What happens in an emergency when hardware fails? You need someone to physically repair/replace the hardware.
We will design our infrastructure so multiple servers can fail without needing immediate replacement. We already have redundancy on both server and data center levels so will do the same thing here. We’ll be able to fail over to an entirely separate data center if everything fails but will have redundancy on the server level too.
We’ll also consider the most common failure scenarios and ensure they can be fixed by remote hands rather than needing to be physically at the data center – things like hot swappable disks and power supplies make this easy, ensuring we keep spares on site.
And finally we’ll be fairly close to the facilities so can send people there if absolutely necessary. We’re looking at data centers just 3 miles from our London office, some a bit further away on the other side of London (12 miles away) as well as nearby European facilities in Amsterdam and Frankfurt.
Can you deal with all the extra technical requirements of the supporting infrastructure, in particular networking.
Network problems are the worst to debug because they’re often transient and can very easily cause massive problems. You’re responsible for this. We have in-house expertise, with several of us having prior hardware experience. We also have support contracts with key vendors so we can always escalate issues when necessary.
If you need to scale suddenly, can you provision new capacity in time, and do you have the financial resources to make the hardware purchases in one go?
This is more difficult but given our predictable demand, we don’t think this will be an issue. That said, we are looking at having either our primary or secondary data center also offer us rentable servers which can be introduced to the rack or at least private network on short notice. We can rent hardware from them on a short term basis whilst we ramp up our own hardware capacity.
Experimenting with one London colocation server
Given the importance of our infrastructure, we have decided to start the experiment with a single server to run our internal tools.
The old setup
Right now we have the following servers at Softlayer powering some internal stuff:
Build master (buildbot): VM x2 CPU 2.0Ghz, 2GB RAM – $89/m
Build slave (buildbot): VM x1 CPU 2.0Ghz, 1GB RAM – $40/m
Staging load balancer: VM x1 CPU 2.0Ghz, 1GB RAM – $40/m
Staging server 1: VM x2 CPU 2.0Ghz, 8GB RAM – $165/m
Staging server 2: VM x1 CPU 2.0Ghz, 2GB RAM – $50/m
Puppet master: VM x2 CPU 2.0Ghz, 2GB RAM – $89/m
Total: $473/m USD
It’s also worth mentioning that Softlayer include 1TB public data transfer by default, but these are all mostly doing internal private network traffic which is free anyway.
The colocation replacement server
We have purchased a Dell 1U rack server to replace these 6 servers:
32GB Memory for 2CPU (8x4GB Dual Rank LV RDIMMs) 1600MHz
x4 1TB, SATA, 3.5-in, 7.2K Hard Drive (Hot-plug)
1 Redundant Power Supply (2 PSU) 500W
Total: £2,066 GBP ($3,346 USD).
We also purchased rack rails and an additional x2 disks for spares. The plan is to virtualise the server and run multiple VMs from the one machine.
Dell server boxes
Dell PowerEdge R415 front
Dell server insides – dual AMD 8 core CPUs
Memory and CPU
Our own RAID
To achieve our goal of server level redundancy and the ability for remote hands to fix as much as possible, the disks are hot swappable but we will also run with RAID10. However, the only RAID card Dell offers has an admin interface which requires Silverlight. We didn’t want to have a core part of the system rely on proprietary plugins which will most likely be end of life soon, so ordered the server without a RAID controller and opted to buy our own instead – Adaptec RAID 6405E SATA4 4 Channel Storage Controller PCI-Express – for £123 ($199 USD).
Servers are obviously designed to be used without any keyboard, mouse and monitor for the majority of their life which means you need to have these available for the initial setup. We didn’t have any non-Mac displays in the office so had to buy a cheap Dell TFT before @devopstom suggested a KVM Console to USB 2.0 Portable Laptop Crash Cart Adapter which simulates everything you need onto a Mac, Linux or Windows system.
Initial setup with a cheap TFT
The total cost of the server is therefore £2,189 ($3,544 USD) which, based on replacing those 6 servers at $473/m, means we will break even on the hardware after 8 months. And since the Dell server is much higher spec, we will be able to fit more on there and/or give more capacity to the existing tools.
London colocation pricing
I found most colocation prices for London don’t charge you for the rack – everything is based around networking and power. The former is easy to figure out based on existing stats we have but power is more difficult – you actually have to buy the hardware and run different loads on it to figure out what you need using something like this Energie Power Meter.
To make things more complex, some providers quote in Amps and others in kWh. They’re essentially the same (technically they’re not but for comparison you can just consider them the same).
Power pricing gets complex and is charged differently for each facility. For example, Telecity charge as follows:
The Dell server (above) at idle draws around 0.57A but you have to test a real workload because if you go over the allocated amount you will likely be shut down. We’ve not got that far yet so I don’t have any figures for now – this will be in a followup blog post.
Redundant hot swappable power supplies
Locations for London colocation
There are a number of big name providers, plus quite a few smaller companies who resell space in the larger facilities and occasionally have their own. There seem to be 3 key sites in London:
West London / Heathrow, including Acton (conveniently close to our office!)
Central, around Holborn and the City
Where you pick depends on things like:
How quickly can you reach the data center in an emergency? Do you need to send someone on-site to fix things? Are they coming from your office or from home? What public transport links are there, and what happens out of hours when trains etc might not be running?
Whether you have any strict latency requirements e.g. being close to the City/London stock exchange where milliseconds count for real time trading.
London Equinix locations
All colocation packages charge based on a minimum committed bandwidth. They’ll usually give you multiple 10/100/1000 ports which can burst but you pay based on a known monthly minimum. Pricing decreases based on the amount committed but generally looks like this:
One company (Coreix) offered us significantly lower rates starting at £50/m for 150Mbps going up to £1,500/m for 1000Mbps. This is suspiciously low compared to other providers so am unsure what to think about it.
These are prices for the networking products that data centers offer – this is usually a multi-homed product across multiple links/providers but you are free to work directly with the transit vendors that exist in each data center. If you have large traffic requirements, want to choose a specific transit provider or just want control over all the networking, this is an option. Without any custom needs and just moving into the world of colo, choosing a package from the data center provider is a good choice for us.
Dual network ports
Inter data center metro fiber
All the big data centers are connected via a “metro” fiber ring so you can have multiple facilities, as we plan to do. We already get this from Softlayer for free as part of their private network – one of their great features – but with colo you have to pay for it. Pricing is based on committed usage and we do a large amount of internal traffic with database replication and processing of monitoring payloads.
Examples of pricing are 100Mbps for £150-350/m and 1000Mbps for £750/m, although again Coreix were much cheaper quoting us £350 for 1000Mbps.
Choosing a provider
We got quotes from Telecity and Equinix as the two big players, plus Andrews & Arnold (who also provide our office connectivity but are very expensive), Coreix and 4D as smaller players. The big guys own multiple facilities and the smaller ones actually have 1 of their own, and then resell space in Telecity. Coreix are suspiciously cheap compared to everyone else, by quite a large margin.
We’re likely to pick one of the big players for our core x2 data centers but for this experiment will host the internal tools server with a different provider. This gives us some vendor redundancy and the big guys only sell by the quarter or half rack at a minimum. Our tools server only needs 1U for this experiment, but ultimately we’ll be purchasing at least 1 full rack in each data center.
I’ll be following up this post in a month or so once we’ve deployed the server, with some final costings and anything else I learn. If you’d like to be notified of that, you can subscribe to our blog.