Server monitoring that doesn't suck.

See for yourself

Author Archives: David Mytton

About David Mytton

David Mytton is the founder of Server Density. He has been programming in PHP and Python for over 10 years, regularly speaks about MongoDB (including running the London MongoDB User Group), co-founded the Open Rights Group and can often be found cycling in London or drinking tea in Japan. Follow him on Twitter and Google+.
  1. How to monitor MongoDB

    Leave a Comment
    Update: We hosted a live Hangout on Air with Paul Done from MongoDB discussing how to monitor MongoDB. We’ve made the slides and video available, which can be found embedded at the bottom of this blog post.

    We use MongoDB to power many different components of our server monitoring product, Server Density. This ranges from basic user profiles all the way to high throughput processing of over 30TB/month of time series data.

    All this means we keep a very close eye on how our MongoDB clusters are performing, with detailed monitoring of all aspects of the systems. This post will go into detail about the key metrics and how to monitor your MongoDB servers.

    MongoDB Server Density Dashboard

    Key MongoDB monitoring metrics

    There is a huge range of different things you should keep track of with your MongoDB clusters, but only a few that are critical. These are the monitoring metrics we have on our critical list:

    Oplog replication lag

    The replication built into MongoDB through replica sets has worked very well in our experience. However, by default writes only need to be accepted by the primary member and replicate down to other secondaries asynchronously i.e. MongoDB is eventually consistent by default. This means there is usually a short window where data might not be replicated should the primary fail.

    This is a known property, so for critical data, you can adjust the write concern to return only when data has reached a certain number of secondaries. For other writes, you need to know when secondaries start to fall behind because this can indicate problems such as network issues or insufficient hardware capacity.

    MongoDB write concern

    Replica secondaries can sometimes fall behind if you are moving a large number of chunks in a sharded cluster. As such, we only alert if the replicas fall behind for more than a certain period of time e.g. if they recover within 30min then we don’t alert.

    Replica state

    In normal operation, one member of the replica set will be primary and all the other members will be secondaries. This rarely changes and if there is a member election, we want to know why. Usually this happens within seconds and the condition resolves itself but we want to investigate the cause right away because there could have been a hardware or network failure.

    Flapping between states should not be a normal working condition and should only happen deliberately e.g. for maintenance or during a valid incident e.g. hardware failure.

    Lock % and disk i/o % utilization

    As of MongoDB 2.6, locking is on a database level, with work ongoing for document level locking in MongoDB 2.8. Writes take a global database lock so if this situation happens too often then you will start seeing performance problems as other operations (including reads) get backed up in the queue.

    We’ve seen high effective lock % be a symptom of other issues within the database e.g. poorly configured indexes, no indexes, disk hardware failures and bad schema design. This means it’s important to know when the value is high for a long time, because it can cause the server to slow down (and become unresponsive, triggering a replica state change) or the oplog to start to lag behind.

    However, it can trigger too often, so you need to be careful. Set long delays e.g. if the lock remains above 75% for more than 30 minutes and if you have alerts on replica state and oplog lag, you can actually set this as a non-critical alert.

    Related to this is how much work your disks are doing i.e. disk i/o % utilization. Approaching 100% indicates your disks are at capacity and you need to upgrade them i.e. spinning disk to SSD. If you are using SSDs already then you can provide more RAM or you need to split the data into shards.

    MongoDB SSD performance benchmarks

    Non-critical metrics to monitor MongoDB

    There are a range of other metrics you should keep track of on a regular basis. Even though they might be non-critical, they will help avoid issues escalating to critical production problems if dealt with and investigated.

    Memory usage and page faults

    Memory is probably the most important resource you can give MongoDB and so you want to make sure you always have enough! The rule of thumb is to always provide sufficient RAM for all of your indexes to fit in memory, and where possible, enough memory for all your data too.

    Resident memory is the key metric here – MongoDB provides some useful statistics to show what it is doing with your memory.

    Page faults are related to memory because a page fault happens when MongoDB has to go to disk to find the data rather than memory. More page faults indicate that there is insufficient memory, so you should consider increasing the available RAM.


    Every connection to MongoDB has an overhead which contributes to the required memory for the system. This is initially limited by the Unix ulimit settings but then will become limited by the server resources, particularly memory.

    High numbers of connections can also indicate problems elsewhere e.g. requests backing up due to high lock % or a problem with your application code opening too many connections.

    Shard chunk distribution

    MongoDB will try and balance chunks equally around all your shards but this can start to lag behind if there are constraints on the system e.g. high lock % slowing down moveChunk operations. You should regularly keep an eye on how balanced the cluster is.

    We have released a free tool to help with this. It can be run standalone, programmatically or as part of a plugin for Server Density.

    Tools to monitor MongoDB

    Now you know the things to keep an eye on, you need to know how to actually collect those monitoring statistics!

    Monitoring MongoDB in real time

    MongoDB includes a number of tools out of the box. These are all run against a live MongoDB server and report stats in real time:

    • mongostat – this shows key metrics like opcounts, lock %, memory usage and replica set status updating every second. It is useful for real time troubleshooting because you can see what is going on right now.
    • mongotop – whereas mongostat shows global server metrics, mongotop looks at the metrics on a collection level, specifically in relation to reads and writes. This helps to show where the most activity is.
    • rs.status() – this shows the status of the replica set from the viewpoint of the member you execute the command on. It’s useful to see the state of members and their oplog lag.
    • sh.status() – this shows the status of your sharded cluster, in particular the number of chunks per shard so you can see if things are balanced or not.

    MongoDB monitoring, graphs and alerts

    Although the above tools are useful for real time monitoring, you also need to keep track of statistics over time and get notified when metrics hit certain thresholds – some critical, some non-critical. This is where a monitoring tool such as Server Density comes in. We can collect all these statistics for you, allow you to configure alerts and dashboards and graph the data over time, all with minimal effort.

    MongoDB graphs

    If you already run your own on-premise monitoring using something like Nagios or Munin, there are a range of plugins for those systems too.

    MongoDB themselves provide free monitoring as part of the MongoDB Management Service. This collects all the above statistics with alerting and graphing, similar to Server Density but without all the other system, availability and application monitoring.

    Monitor MongoDB Slides

    Monitor MongoDB Video

  2. What’s in your on call playbook?

    Leave a Comment

    Back in February we started centralising and revamping all our ops documentation. I played around with several different tools and ended up picking Google Docs to store all the various pieces of information about Server Density, our server monitoring application.

    We make use of Puppet to manage all of our infrastructure and this acts as much of the documentation – what is installed, configuration, management of servers, dealing with failover and deploys – but there is still need for other written docs. The most important is the incident response guide, which is the step by step checklist all our on-call team run through when an alert gets triggered.

    iPhone Server Monitoring Alert

    Why do you need an incident response guide?

    As your team grows, you can’t just rely on one or two people knowing everything about how to deal with incidents in an ad-hoc manner. Systems will become more complex and you’ll want to distribute responsibilities around team members, so not everyone will have the same knowledge. During an incident, it’s important that the right things get done in the right order. There are several things to remember:

    • Log everything you do. This is important so that other responders can get up to speed and know what has been done, but is also important to review after the incident is resolved so you can make improvements as part of the postmortem.
    • Know how to communicate internally and with end-users. You want to make sure you are as efficient as possible as a team, but also keep your end-users up to date so they know what is happening.
    • Know how to contact other team members. If the first responder needs help, you need a quick way to raise other team members.

    All this is difficult to remember during the stress of an incident so what you need is an incident response guide. This is a short document that has clear steps that are always followed when an alert is triggered.

    Google Docs Incident Handling

    What should you have in your incident response guide?

    Our incident response guide contains 6 steps which I’ve detailed below, expanded upon to give some insight into the reasoning. In the actual document, they are very short because you don’t want to have complex instructions to follow!

    1. Log the incident in JIRA. We use JIRA for project management and so it makes sense to log all incidents there. We open the incident ticket as soon as the responder receives the alert and it contains the basic details from the alert. All further steps taken in diagnosing and fixing the problem are logged as comments. This allows us to refer to the incident by a unique ID, it allows other team members to track what is happening and it means we can link the incident to followup bug tasks or improvements as part of the postmortem.
    2. Acknowledge the alert in PagerDuty. We don’t acknowledge alerts until the incident is logged because we link the acknowledgment with the incident. This helps other team members know that the issue is being investigated rather than someone has accidentally acknowledged the alert and forgotten about it.
    3. Log into the Ops War Room in Hipchat. We use Hipchat for real time team communication and have a separate “war room” which is used only for discussing ongoing incidents. We use sterile cockpit rules to prevent noise and also pipe in alerts into that room. This allows us to see what is happening, sorted by timestamp. Often we will switch to using a phone call (usually via Skype because Google Hangouts still uses far too much CPU!) if we need to discuss something or coordinate certain actions, because speaking is faster than typing. Even so, we will still log the details in the relevant JIRA incident ticket.
    4. Search the incident response Google Docs folder and check known issues. We have a list of known issues e.g. debug branches deployed or known problems waiting fixes which sometimes result in on-call alerts. Most of the time though it is something unusual and we have documentation on all possible alert types so you can easily search by error string and find the right document, and the steps for debugging. Where possible we try to avoid triggering on-call alerts to real people where a problem can be fixed using an automated script, so usually these steps are debug steps to help track down where the problem is.
    5. If the issue is affecting end-users, do a post to our status site. Due to the design of our systems, we very rarely have incidents which affect the use of our product. However, where there is a problem which causes customer impact, we post to our public status page. We try and provide as much detail as possible and post updates as soon as we know more, or at the very least every 30m even if there is nothing new to report. It seems counter-intuitive that publicising your problems would be a good thing, but customers generally respond well to frequent updates so they know when problems are happening. This is no excuse for problems happening too frequently but when they do happen, customers want to know.
    6. Escalate the issue if you can’t figure it out. If the responder can’t solve the issue then we prefer they bring in help sooner rather than prolong the outage. This is either by escalating the alert to the secondary on-call in PagerDuty or by calling other team members directly.

    Replying to customer emails

    Another note we have is regarding support tickets that come in reporting the issue. Inevitably some customers are not aware of your public status page and they’ll report any problems directly to you. We use Zendesk to set the first ticket as a “Problem” and direct the customer to our status page. Any further tickets can be set as “Incidents” of that “Problem” so when we solve the issue, we can do a mass reply to all linked tickets. Even though they can get the same info from the status page, it’s good practice to email customers too.

    What do you have in your playbook?

    Every company handles incidents differently. We’ve built this process up over the years of experience, learning how others do things and understanding our own feelings when services we use have outages. You can do a lot to prevent outages but you can never eliminate them, so you need to spend as much time planning the process for handling them. What do you have in your incident response processes? Leave a comment!

  3. Cloud location matters – latency, privacy, redundancy

    Leave a Comment

    This article was originally published on GigaOm.

    Now that we’re seeing intense competition in the cloud infrastructure market, each of the vendors is looking for as many ways to differentiate itself as possible. Big wallets are required to build the infrastructure and picking the right locations to deploy that capital is becoming an important choice. Cloud vendors can be innovative on a product or technical level, but location is just as important — which geographies does your cloud vendor have data centers in and why does that matter?

    Why is location important?

    There are a number of reasons why a diverse range of locations is important:

    • Redundancy: Compared to the chances of a server failure, whole data center outages are rare — but they can happen. In the case of power outages, software bugs or extreme weather, it’s important to be able to distribute your workloads across multiple, independent facilities. This is not just to get redundancy across data centers but also across geographies so you can avoid local issues like bad weather or electrical faults. You need data centers close enough to minimize latency but far enough to be separated by geography.
    • Data protection: Different types of data have different locality requirements e.g. requiring personal data to remain within the EU.
    • User latency: response times for the end user are very important in certain applications, so having data centers close to your users is important, and the ability to send traffic to different regions helps simplify this. CDNs can be used for some content but connectivity is often required to the source too.

    Deploying data centers around the world is not cheap, and this is the area where the big cloud providers have an advantage. It is not just a case of equipping and staffing data centers — much of the innovation is coming from how efficient those facilities are. Whether that means using the local geography to make data centers green, or building your own power systems, this all contributes to driving down prices, which can only truly be done at scale.

    How do the top providers perform?

    The different providers all have the concept of regions or data centers within a specific geography. Usually, these are split into multiple regions so you can get redundancy within the region, but this is not sufficient for true redundancy because the whole region could fail, or there could be a local event like a storm. Therefore, counting true geographies is important:

    Cloud provider locations

    Azure is in the lead with 12 regions followed by Softlayer (10), Amazon (8) and Rackspace (6). Google loses out, with only 3 regions.

    Where is the investment going?

    It’s somewhat surprising that Amazon has gone for so long with only a single region in Europe — although this may be about to change with evidence of a new region based in Germany. If you want redundancy then you really need at least 2 data centers nearby, otherwise latency will pose a problem. For example, replicating a production database between data centers will experience higher latency if you have to send data across the ocean (from the U.S. to Ireland, say). It’s much better to replicate between Ireland and Germany!

    AWS Map

    Softlayer is also pushing into other regions with the $1.2 billion investment it announced for new data centers in 2014. Recently it launched Hong Kong and London data centers, with more planned in North America (2), Europe (2), Brazil, UAE, India, China, Japan and Australia (2).

    Softlayer network map

    The major disappointment is Google. It’s spending a lot of money on infrastructure and actually have many more data centers worldwide than are part of Google Cloud – in USA (6), Europe (3) and Asia (2) – which would place it second behind Microsoft. Of course, Google is a fairly new entrant into the cloud market and most of its demand is going to be from products like search and Gmail, where consumer requirements will dominate. Given the speed at which it’s launching new features, I expect this to change soon if it’s really serious about competing with the others.

    Google data center locations

    What about China?

    I have specifically excluded China from the figures above but it is still an interesting case. The problem is that while connectivity inside China is very good (in some regions), crossing the border can add significant latency and packet loss. Microsoft and Amazon both have regions within China, but they require a separate account and you usually have to be based in China to apply. Softlayer has announced a data center in Shanghai, so it will be interesting to see whether it can connect their global private network with good throughput. As for Google, it publicly left China 4 years ago so it may never launch a region there.

    It’s clear that location is going to be a competitive advantage, one where Microsoft currently holds first place but will lose it to Softlayer soon. Given the amount of money being invested, it will be interesting to see where cloud availability expands to next.

  4. How to monitor Nginx

    Update: We hosted a live Hangout on Air with Rick Nelson the Technical Solutions architect from NGINX, in which we dug deeper into some of the issues discussed in this blog post. We’ve made the slides and video available, which can be found embedded at the bottom of this blog post.

    Nginx is a popular web server which is often used as a load balancer because of its performance. It is used extensively at Server Density to power our public facing UI and APIs, and also for its support for WebSockets. As such, monitoring Nginx is important because it is often the critical component between your users and your service.

    Monitor Nginx from the command line

    Monitoring Nginx in real time has advantages when you are trying to debug live activity or monitor what traffic is being handled in real time. These methods make use of the Nginx logging to parse and display activity as it happens.

    Enable Nginx access logging

    For monitoring the real time Nginx traffic, you first need to enable access logging by editing your Nginx config file and adding the access_log directive. As a basic example:

    server {
        access_log /var/log/nginx/access_log combined;

    Then restart Nginx and tail the log as requests hit the server to see them in real time:

    tail -f /var/log/nginx/access_log

    Using ngxtop to parse the Nginx access log

    Whilst tailing the access log directly is useful for checking a small number of requests, it quickly becomes unusable if you have a lot of traffic. Instead, you can use a tool like ngxtop to parse the log file for you, displaying useful monitoring stats on the console.

    $ ngxtop
    running for 411 seconds, 64332 records processed: 156.60 req/sec
    |   count |   avg_bytes_sent |   2xx |   3xx |   4xx |   5xx |
    |   64332 |         2775.251 | 61262 |  2994 |    71 |     5 |
    | request_path                             |   count |   avg_bytes_sent |   2xx |   3xx |   4xx |   5xx |
    | /abc/xyz/xxxx                            |   20946 |          434.693 | 20935 |     0 |    11 |     0 |
    | /xxxxx.json                              |    5633 |         1483.723 |  5633 |     0 |     0 |     0 |
    | /xxxxx/xxx/xxxxxxxxxxxxx                 |    3629 |         6835.499 |  3626 |     0 |     3 |     0 |
    | /xxxxx/xxx/xxxxxxxx                      |    3627 |        15971.885 |  3623 |     0 |     4 |     0 |
    | /xxxxx/xxx/xxxxxxx                       |    3624 |         7830.236 |  3621 |     0 |     3 |     0 |
    | /static/js/minified/utils.min.js         |    3031 |         1781.155 |  2104 |   927 |     0 |     0 |
    | /static/js/minified/xxxxxxx.min.v1.js    |    2889 |         2210.235 |  2068 |   821 |     0 |     0 |
    | /static/tracking/js/xxxxxxxx.js          |    2594 |         1325.681 |  1927 |   667 |     0 |     0 |
    | /xxxxx/xxx.html                          |    2521 |          573.597 |  2520 |     0 |     1 |     0 |
    | /xxxxx/xxxx.json                         |    1840 |          800.542 |  1839 |     0 |     1 |     0 |

    For more long running monitoring of the logs, Luameter is a better tool that has improved performance for long running monitoring.

    Nginx monitoring and alerting – Nginx stats

    The above tools are useful for monitoring manually but aren’t useful if you want to automatically collect Nginx monitoring statistics and configure alerts on them. Nginx alerting is useful for ensuring your web server availability and performance remains high.

    The basic Nginx monitoring stats are provided by HttpStubStatusModule – metrics include requests per second and number of connections, along with stats for how requests are being handled.

    Server Density supports parsing the output of this module to automatically graph and trigger alerts on the values, so we have a guide to configuring HttpStubStatusModule too. Using this module you can keep an eye on the number of connections to your server, and the requests per second throughput. What values these “should” be will depend on your application and hardware.

    nginx monitoring alerts

    A good way to approach configuring Nginx alerts is to understand what kind of baseline traffic your application experiences and set alerts around this e.g. alert if the stats are significantly higher (indicating a sudden traffic spike) and if the values are suddenly significantly lower (indicating a problem preventing traffic somewhere). You could also benchmark your server to find out at what traffic level things start to slow down and the server becomes too overloaded – this will then act as a good upper limit which you can trigger alerts at too.

    Nginx monitoring and alerting – server stats

    Monitoring Nginx stats like requests per second and number of connections is useful to keep an eye on Nginx itself, but its performance will also be affected by how overloaded the server is. Ideally you will be running Nginx on its own dedicated instance so you don’t need to worry about contention with other applications.

    Web servers are generally limited by CPU and so your hardware spec should offer the web server as many CPUs and/or cores as possible. As you get more traffic then you will likely see the CPU usage increase.

    CPU % usage itself is not necessarily a useful metric to alert on because the values tend to be per CPU or per core. It’s more useful to set up monitoring on average CPU utilisation across all CPUs or cores. Using a tool such as Server Density, you can visualise this and configure alerts so you can be notified when the CPU is overloaded – our guide to understanding these metrics and configuring CPU alerts will help.

    On Linux this average across all CPUs is abstracted out to another system metric called load average. It is a decimal number rather than a percentage and allows you to understand load from the perspective of the operating system i.e. how long processes are waiting for access to the CPU. The recommended threshold for load average therefore depends on how many CPUs and cores you have – our guide to load average will help you understand this further.

    Monitoring Nginx and load balancers with Nginx Plus

    If you purchase a commercial version of Nginx then you get access to more advanced monitoring (and other features) without having to recompile Nginx with the HttpStubStatusModule enabled.

    Nginx Plus includes monitoring stats for connections, requests, load balancer counts, upstream metrics, the status of different load balancer upstreams and a range of other metrics. A live example of what this looks like is provided by Nginx themselves. It also includes a JSON Nginx monitoring API which would be useful for pulling the data out into your own tools.

    Monitoring nginx in real time

    Monitoring the remote status of Nginx

    All of the above metrics monitor the internal status of Nginx and the servers it is running on but it is also important to monitor the experience your users are getting too. This is achieved by using external status and response time tools – you want to know if your Nginx instance is serving traffic from different locations around the world (wherever your customers are) and the kind of response time performance.

    This is easy to do with a service like Server Density because of our in-built website monitoring. You can check the status of your public URLs and other endpoints from custom locations and get alerts when performance drops or there is an outage.

    This is particularly useful when you can build graphs to correlate the Nginx and server metrics with remote response time, especially if you are benchmarking your servers and want to know when a certain load average starts to affect end user performance.

    Monitor Nginx Slides

    Monitor Nginx Video

  5. Sysadmin Sunday 189

    Leave a Comment
  6. How we use Puppet – infrastructure, config, failover, deploys

    Comments Off on How we use Puppet – infrastructure, config, failover, deploys

    We’ve been using Puppet to manage the infrastructure behind Server Density for several years now. It helps us in a number of different ways and although we use it as standard config management, that’s actually only about 25% of our use case.

    We have 4 main uses for Puppet – infrastructure, config management, failover and deploys – each of which I’ll go through here.

    How we use Puppet – Infrastructure

    We first started using Puppet when we moved our environment to Softlayer, where we have a mixture of bare metal servers and public cloud instances, totalling around 75-100. When this was set up, we ordered the servers from Softlayer then manually installed Puppet before applying our manifests to get things configured.

    Although we recently evaluated moving to running our own environment in colo data centres, we have made the decision to switch our environment from Softlayer to Google Cloud. My general view remains that colo is significantly cheaper in the long run but there are some initial capital expenses which we don’t want to spend. We also want to make use of some of the Google products like BigQuery.

    I’ll be writing about this in more detail as we complete the move and am also giving two talks – one in the Bay Area on the 19th June and the other in London on the 2nd of July – about this.

    Using Google Cloud (specifically, Google Compute Engine), or indeed any one of the other major cloud providers, means we can make use of Puppet modules to define the resources within our code. Instead of having to manually order them through the control panels, we can define them in the Puppet manifests alongside the configuration. We’re using the gce_compute module but there are also modules for Amazon and others.

    For example, defining an instance plus a 200GB volume:

    gce_instance { 'mms-app1':
      ensure         => present,
      machine_type   => 'n1-highmem-2',
      zone           => 'us-central1-a',
      network        => 'private',
      tags           => ['mms-app', 'mongodb'],
      image          => 'projects/debian-cloud/global/images/backports-debian-7-wheezy-v20140605',
    gce_disk { 'mms-app1-var-lib-mongodb':
      ensure      => present,
      description => 'mms-app1:/var/lib/mongodb',
      size_gb     => '200',
      zone        => 'us-central1-a',

    The key here is that we can define instances in code, next to the relevant configuration for what’s running on them, then let Puppet deal with creating them.

    How we use Puppet – Config management

    This is the original use case for Puppet – defining everything we have installed on our servers in a single location. It makes it easy to deploy new servers and keep everything consistent.

    It also means any unusual changes, fixes or tweaks are fully version controlled and documented so we don’t lose things over time e.g. we have a range of fixes for MongoDB to work around issues and make optimisations which have been built up over time and through support requests, all of which are documented in Puppet.

    Puppet in Github

    We use the standard module layout as recommended by Puppet Labs, contained within a Github repo and checked with puppet lint before commit so we have a nicely formatted, well structured library describing our setup. Changes go through our usual code review process and get deployed with the Puppet Master picking up the changes and rolling them out.

    Previously, we wrote our own custom modules to describe everything but more recently where possible we use modules from the Puppet Forge. This is because they often support far more options and are more standardised than our own custom modules. For example, the MongoDB module allows us to install the server and client, set options and even configure replica sets.

      include site::mongodb_org
      class {'::mongodb::server':
        ensure    => present,
        bind_ip   => '',
        replset   => 'mms-app',
      mount {'/var/lib/mongodb':
        ensure  => mounted,
        atboot  => true,
        device  => '/dev/sdb',
        fstype  => 'ext4',
        options => 'defaults,noatime',
        require => Class['::mongodb::server'],
      mongodb_replset { 'mms-app':
        ensure  => present,
        members => ['mms-app1:27017', 'mms-app2:27017', 'mms-app3:27017']

    We pin specific versions of packages to ensure the same version always gets installed and we can control upgrades. This is particularly important to avoid sudden upgrades of critical packages, like databases!

    The Server Density monitoring agent is also available as a Puppet Forge module to automatically install the agent, register it and even define your alerts.

    All combined, this means we have our MongoDB backups running on Google Compute Engine, deployed using Puppet and monitored with Server Density.


    How we use Puppet – Failover

    We use Nginx as a load balancer and use Puppet variables to list the members of the proxy pool. This is deployed using a Puppet Forge nginx module we contributed some improvements to.

    When we need to remove nodes from the load balancer rotation, we can do this using the Puppet web UI as a manual process, or by using the console rake API. The UI makes it easy to apply the changes so a human can do it with minimal chance of error. The API allows us to automate failover in particular conditions, such as if one of the nodes fails.

    How we use Puppet – Deploys

    This is a more unusual way of using Puppet but has allowed us to concentrate on building a small portion of the deployment mechanism, taking advantage of the Puppet agent which runs on all our servers already. It saves us having to use custom SSH commands or writing our own agent, and allows us to customise the deploy workflow to suit our requirements.

    It works like this:

    1. Code is committed in Github into master (usually through merging a pull request, which is how we do our code reviews).
    2. A new build is triggered by Buildbot which runs our tests, then creates the build artefacts – the stripped down code that is actually copied to the production servers.
    3. Someone presses the deploy button in our internal control panel, choosing which servers to deploy to (branches can also be deployed) and the internal version number is updated to reflect what should be deployed).
    4. /opt/puppet/bin/mco puppetd runonce -I is triggered on the selected hosts and the Puppet run notices that the deployed version is different from the requested version.

    5. The new build is copied onto the servers.


    Status messages are posted into Hipchat throughout the process and any one of our engineers can deploy code at any time, although we have a general rule not to deploy non-critical changes after 5pm weekdays and after 3pm on Fridays.

    There are some disadvantages to using Puppet for this. Firstly, the Puppet agent can be quite slow on low spec hardware. Our remote monitoring nodes around the world are generally low power nodes so the agent runs very slowly. It’s also eventually consistent because deploys won’t necessarily happen at the same time, so you need to account for that in new code you deploy.

    Puppet is most of your documentation

    These four use cases mean that a lot of how our infrastructure is set up and used is contained within text files. This has several advantages:

    • It’s version controlled – everyone can see changes and they are part of our normal review process.
    • Everyone can see it – if you want to know how something works, you can read through the manifests and understand more, quickly.
    • Everything is consistent – it’s a single source of truth, one place where everything is defined.

    It’s not all of our docs, but certainly makes up a large proportion because it’s actually being used, live. And everyone knows how much they hate keeping the docs up to date!

    Puppet is the source of truth

    It knows all our servers, where they are, their health status, how they are configured and what is installed.

  7. How to keep on top of security updates


    Every computer user needs to stay on top of updates for their apps, browsers, and OSs. As a consumer, it’s easy to do this – the best browsers like Chrome auto update in the background and over the air updates on iOS make sure most people get updates quickly, and easily. These kind of mass market updates tend to be well tested so you can update without fear of bricking.

    The life of a sysadmin is a little more complex though. Although the mainstream OSs provide mature package management features, there are often a lot of frequent updates across the OS, kernel updates which require reboots, application (e.g. databases) updates which need testing and library release which need recompiling or updating codebases.

    I’ve written before about our canary concept for rolling out updates because we deploy them manually rather than letting the OS auto update whenever it likes, but how do you discover new releases in the first place?

    OS X Update

    Critical security update notifications

    The first step is to separate general OS/library/app updates with critical security releases. Like the recent Heartbleed bug, these need to be deployed as quickly as possible.

    The first point of call is the OS security announcements. We use Ubuntu and so have the choice of subscribing to the Ubuntu Security Announcements list, or via RSS. In these announcements I look out for keywords like “remote” or “denial of service” because these mean that there’s an external risk, contrasted to an exploit which requires a local user. This is more difficult to exploit because it first requires access to our servers, which is restricted to our ops team.

    Debian has a similar mailing list as do other Linux distros like Red Hat. Windows has an update mechanism built in which makes it easier than with multiple Linux distros.

    Ubuntu updates

    Updating software packages

    Most of the software products and libraries we use are installed through system packages using apt. This is our preferred method of installing because it makes it easy to manage via Puppet, is standardised across all our servers, offers signing of packages and importantly, gives us a single place to apply upgrades. This is one of the reasons why our monitoring agent recommended installation method is via our OS packages.

    A lot of packages are provided as part of the OS release from Ubuntu package maintainers, so they get rolled into the security notifications.

    The disadvantage of this approach is that you often don’t get the very latest version through the OS vendor packages. To work around this, vendors will usually provide their own official repositories (e.g. our MongoDB installations are through the MongoDB, Inc repos).

    apt-get upgrade

    Everything else is manual

    A few things are installed manually, such as PHP pecl packages for some of our legacy apps. We can update them using pecl but have to keep an eye on the release mailing lists for those specific packages.

    And where we run “on-premise” versions of products we also subscribe to the relevant mailing lists. Our policy is to prefer SaaS where possible but some things aren’t available like this – we run our own Puppet Enterprise server (with an announce mailing list) and have the MongoDB Backup Service running within our infrastructure.

    We also keep an eye on pages like the MongoDB Alerts and mailing list.


    What about general security mailing lists?

    You could subscribe to mailing lists like BUGTRAQ or the CVE lists but these are really high volume and aren’t specific enough to our environment. We can get what we need from the vendors, mostly from Ubuntu.

    There are also commercial products like the Ubuntu Landscape product which we used for a while but it was too expensive to maintain a support contract for so many servers, with limited value. So long as we can stay up to date with the most important security releases, we can be sure that we have a secure infrastructure as regards to software updates.

  8. What’s new in Server Density – Apr 2014

    Comments Off on What’s new in Server Density – Apr 2014

    Each month we’ll round up all the feature changes and improvements we have made to our server and website monitoring product, Server Density.

    Elastic dashboard graphs

    Static dashboard graphs are good if your cluster never changes but if you have an elastic environment where new instances appear and disappear, you can use the new elastic graph widget which will add and remove items as they appear and disappear in your account.

    You can specify a search term for your inventory of devices and service checks, which can be a regular expression with wildcards. Once you create your graph, it will be automatically updated based on this search term so if new servers appear that match, they will automatically be added to the graph. This is useful if you set a prefix for your cluster and always name new servers based on that prefix. Read more.


    Android app

    Following our iPhone app, last week we released our Android app to allow you to manage alerts and get push notifications directly to your Android devices. Find out more here.

    Android Server Monitoring App Use Case

    HipChat notifications

    We use HipChat pretty extensively at Server Density and we’ve now released native integration so you can have your alerts posted directly into your HipChat rooms. Read how to set this up.


    Slack notifications

    In addition to HipChat, you can also have notifications sent directly to Slack. Read how to set this up.

    Ansible playbooks

    Several customers have released their Ansible playbooks to allow you to install the agent and manage your monitoring using Ansible.

    Server Density 日本



    (We have launched a Japanese language version of this blog!)

  9. Network performance at AWS, Google, Rackspace and Softlayer

    Comments Off on Network performance at AWS, Google, Rackspace and Softlayer

    This post was originally published on GigaOm.

    Compute and storage are essentially commodity services, which means that for cloud providers to compete, they have to show real differentiation. This is often achieved with supporting services like Amazon’s DynamoDB and Route 53, or Google’s BigQuery and Prediction API, which complement the core infrastructure offerings.

    Performance is also often singled out as a differentiator. Often one of the things that bites production usage, especially in inherently shared cloud environments, is the so-called “noisy neighbor” problem. This can be other guests stealing CPU time, increased network traffic and, particularly problematic for databases, i/o wait.

    In this post I’m going to focus on networking performance. This is very important for any serious application because it affects the ability to communicate and replicate data across instances, zones and regions. Responsive applications and disaster recovery, areas where up-to-date database replication is critical, require good, consistent performance.

    It’s been suggested that Google has a massive advantage when it comes to networking, due to all the dark fibre it has purchased. Amazon has some enhanced networking options that take advantage of special instance types with OS customizations, and Rackspace’s new Performance instance types also boast up to 10 Gbps networking. So let’s test this.


    I spun up the listed instances to test the networking performance between them. This was done using the iperf tool on Linux. One server acts as the client and the other as the server:

    Server: iperf -f m -s
    Client: iperf -f m -c hostname

    The OS was Ubuntu 12.04 (with all latest updates and kernel), except on Google Compute Engine, where it’s not available. There, I used the Debian Backports image.

    The client was run for three tests for each type – within zone, between zones and between regions – with the mean average taken as the value reported.

    Amazon networking performance

    Amazon networking performance

    Amazon’s larger instances, such as the c3.8xlarge tested here, support the enhanced 10 GB networking, however you must use the Amazon Linux AMI (or manually install the drivers) within a VPC. Because of the additional complexity of setting up a VPC, which isn’t necessary on any other provider, I didn’t test this, although it is now the default for new accounts. Even without that enhancement, the performance is very good, nearing the advertised 10 Gbits/sec.

    However, the consistency of the performance wasn’t so good. The speeds changed quite dramatically across the three test runs for all instance types, much more than with any other provider.

    You can use internal IPs within the same zone (free of charge) and across zones (incurs inter-zone transfer fees), but across regions, you have to go over the public internet using the public IPs, which incurs further networking charges.

    Google Compute Engine networking performance

    Google networking performance

    Google doesn’t currently offer an Ubuntu image, so instead I used its backports-debian-7-wheezy-v20140318 image. For the f1-micro instance, I got very inconsistent iperf results for all zone tests. For example, within the same us-central-1a zone, the first run showed 991 Mbits/sec, but the next two showed 855 Mbits/sec and 232 Mbits/sec. Across regions between the US and Europe, the results were much more consistent, as were all the tests for the higher spec n1-highmem-8 server. This suggests the variability was because of the very low spec, shared CPU f1-micro instance type.

    I tested more zones here than on other providers because on April 2, Google announced a new networking infrastructure in us-central-1b and europe-west-1a which would later roll out to other zones. There was about a 1.3x improvement in throughput using this new networking and users should also see lower latency and CPU overhead, which are not tested here.

    Although 16 CPU instances are available, they’re only offered in limited preview with no SLA, so I tested on the fastest generally available instance type. Since networking is often CPU bound, there may be better performance available when Google releases its other instance types.

    Google allows you to use internal IPs globally — within zone, across zone and across regions (i.e., using internal, private transit instead of across the internet). This makes it much easier to deploy across zones and regions, and indeed Google’s Cloud platform was the easiest and quickest to work with in terms of the control panel, speed of spinning up new instances and being able to log in and run the tests in the fastest time.

    Rackspace networking performance

    Rackspace networking performance

    Rackspace does not offer the same kind of zone/region deployments as Amazon or Google so I wasn’t able to run any between-zone tests. Instead I picked the next closest data center. Rackspace offers an optional enhanced virtualization platform called PVHVM. This offers better i/o and networking performance and is available on all instance types, which is what I used for these tests.

    Similar to Amazon, you can use internal IPs within the same location at no extra cost but across regions you need to use the public IPs, which incur data charges.

    When trying to launch x2 120 GB Performance 2 servers at Rackspace, I hit our account quota (with no other servers on the account) and had to open a support ticket to request a quota increase, which took them about an hour and a half to approve. For some reason, launching servers in the London region also requires a separate account, and logging in and out of multiple control panels soon became annoying.

    Softlayer networking performance

    Softlayer networking performance

    Softlayer only allows you to deploy into multiple data centers at one location: Dallas. All other regions have a single facility. Softlayer also caps out at 1 Gbps on its public cloud instances, although its bare metal servers do have the option of dual 1 Gbps bonded network cards, allowing up to 2 Gbps. You choose the port speed when ordering or when upgrading an existing server. They also list 10Gbit/s networking as available for some bare metal servers.

    Similarly to Google, Softlayer’s maximum instance size is 16 cores, but it also offers private CPU options which give you a dedicated core versus sharing the cores with other users. This allows up to eight private cores, for a higher price.

    The biggest advantage Softlayer has over every other provider is completely free, private networking between all regions whereas all other provider charge for transfer out of zone. When you have VLAN spanning enabled, you can use the private network across regions, which gives you an entirely private network for your whole account. This makes it very easy to deploy redundant servers across regions and is something we use extensively for replicating MongoDB at Server Density, moving approx 500 Mbits/sec of internal traffic across the US between Softlayer’s Washington and San Jose data centers. Not having to worry about charges is a luxury only available with Softlayer.

    Who is fastest?

    Who is fastest?

    Amazon’s high spec c3.8xlarge really gives the best performance across all tests, particularly within the same zone and region. It was able to push close to the advertised 10 GB throughput, but the high variability of results may indicate some inconsistency in the real-world performance.

    Yet for very low cost, Google’s low spec f1-micro instance type offers excellent networking performance: ten times faster than the terrible performance from the low spec Rackspace server.

    Softlayer and Rackspace were generally bad performers overall, but at least Rackspace gets some good inter-zone and inter-region performance and performed well for its higher instance spec. Softlayer is the loser overall here with low performance plus no network-optimized instance types. Only their bare metal servers have the ability to upgrade to 10 Gbits/sec network interfaces.

    Mbits/s per CPU?

    CPU allocation is also important. Rackspace and Amazon both offer 32 core instances, and we see good performance on those higher spec VMs as a result. Amazon was fastest for its highest spec machine type with Rackspace coming second. The different providers have different instance types, and so it’s difficult to do a direct comparison on the raw throughput figures.

    An alternative ranking method is to calculate how much throughput you get per CPU. We’ll use the high spec inter-zone figures and do a simple division of the throughput by the number of CPUs:

    Mbits per CPU

    The best might not be the best value

    If you have no price concerns, then Amazon is clearly the fastest, but it’s not necessarily the best value for money. Google gets better Mbits/sec per CPU performance, and since you pay for more CPUs, it’s a better value. Google also offers the best performance on its lowest spec instance type, but it is quite variable due to the shared CPU. Rackspace was particularly poor when it came to inter-region transfer, and Softlayer isn’t helped by its lack of any kind of network-optimized instance types.

    Throughput isn’t the end of the story though. I didn’t look at latency or CPU overhead and these will have an impact on the real world performance. It’s no good getting great throughput if it requires 100 percent of your CPU time!

    Google and Softlayer both have an advantage when it comes to operational simplicity because their VLAN spanning-like features mean you have a single private network across zones and regions. You can utilize their private networking anywhere.

    Finally, pricing is important, and an oft-forgotten cost are the network transfer fees. This is free within zones for all providers, but only Softlayer has no fees for data transfer across zones and even across regions. This is a big saver.

  10. Android Server Monitoring App released

    Comments Off on Android Server Monitoring App released

    After the release of our monitoring app for iPhone 2 months ago, the same app is now available on Android!

    It allows you to see your notification center on the move so you can quickly see open and closed alerts assigned to you and across your whole account. It includes push notification alerts and custom sounds so you can always be woken up!

    Android Server Monitoring App Use Case

    The app is free for all v2 customers (and users on the trial). Find out more on the product page of our website.