Premium Hosted Website & Server Monitoring Tool.

(Sysadmin / Devops blog)

visit our website

Author Archives: David Mytton

About David Mytton

David Mytton is the founder of Server Density. He has been programming in PHP and Python for over 10 years, regularly speaks about MongoDB (including running the London MongoDB User Group), co-founded the Open Rights Group and can often be found cycling in London or drinking tea in Japan. Follow him on Twitter and Google+.
  1. How we use Hipchat to work remotely

    Leave a Comment

    Server Density started as a remote working company, and it wasn’t until our 3rd year that we opened an office in London. Even now, most of the team still work remotely from around the UK, Portugal and Spain so we use a variety of tools to help us stay connected and work efficiently together.

    One of the most important is Hipchat. We use this as a chat room but it’s also our company news feed. Everything that goes on during the day (and night) gets posted and we use a variety of rooms for different purposes.

    The main room

    Everyone is always logged in here during their working day and it also acts as a real time news feed of what is going on. General chat happens here but it’s mostly a way for everyone to see what’s happening. This is particularly useful if you go away and come back later because you can see what has happened.

    Main Hipchat room

    We use various integrations and the Hipchat API to post in events from things like:

    • Github activity: commits, pull requests, comments.
    • Buildbot: Build activity.
    • Deploys: From our own custom build system, we can see when deploys are triggered (and by whom) and then when they get deployed to each server.
    • Signups: New trial signups get posted in as we often like to see who is using the product.
    • Account changes: New purchases and package changes
    • JIRA: All our issue tracking and development work is tracked with JIRA, which posts some activity like new issues and status changes.
    • Zendesk: New support tickets and Twitter mentions so everyone can keep an eye on emerging issues.
    • Alerts: We use some of our competitors to help us maintain an independent monitor of our uptime, and pipe in alerts using the new Server Density HipChat integration and PagerDuty.

    Ops war room

    All incidents from our alerting systems get piped into a war room so we can see what’s going on in real time, chronological order and not be distracted by other events. The key here is maintaining a sterile cockpit rule so we use this room to only discuss ongoing incidents. This also is useful for other people (e.g. support) to track what’s happening without interrupting the responders.

    Server Density hipchat


    We have a bot that runs in all of our chat rooms. It’s fairly simple and based off Hubot but allows us to do things like query Google Images or check the status of Github.


    At a glance

    We use our own server monitoring ops dashboard on the TV in the office using a Chromecast and I also have an iPad at my desk which constantly shows the Hipchat room, so I can see things as they happen throughout the day.

    Server Density office

  2. Dealing with OpenSSL bug CVE-2014-0160 (Heartbleed)


    Yesterday a serious vulnerability in the OpenSSL library was disclosed in CVE-2014-0160, also dubbed the Heartbleed Vulnerability. Essentially this means you probably need to regenerate the private keys used to create your SSL certificates, and have them reissued by your certificate authority.


    This isn’t a difficult task but does take some time to get OpenSSL updated across all your servers, then go through the process to generate, reissue and install certificates across all locations they are deployed.

    We have completed this process for all of our websites and applications and for Server Density v2 we use perfect forward security which should protect against retrospective decryption of previous communications.

    However, in the latest release of our server monitoring iPhone app we enabled certificate pinning which means until our latest update is approved by Apple, the app will not log in. You will still receive push notifications for alerts but attempts to log in to the app will fail. Certificate pinning embeds our SSL certificate within the app which is then used to prevent man in the middle attacks – the certificate that is returned through the API calls to our service is verified against the known certificate embedded in the app.

    We discussed the best way to approach the reissue of certificates this morning and considered holding off to allow us to submit a new build to Apple with pinning disabled, then do a future update with the new certificate. However, we felt that the security vulnerability was severe enough that we should patch it for all our users at the expense of causing a small number of users to be unable to use the iPhone app for a few days.

    We have requested Apple expedite the review process but it still takes at least 24 hours to get a new release out. In the meantime, you should check to see if your OpenSSL version is vulnerable and if so, update!

  3. MongoDB on Google Compute Engine – tips and benchmarks


    Over the last 4 years running MongoDB in production at Server Density, I’ve been able to work on deployments on dedicated hardware, VMs and across multiple cloud providers.

    The best environment has always been dedicated servers because of the problems with host contention, particularly with CPU and disk i/o but Google has been quite vocal about the consistency and performance of their Compute Engine product, particularly about how they’ve eliminated the noisy neighbour problem with intelligent throttling.

    So I thought I’d try it out to see how MongoDB performs on Google Compute Engine.

    Google Compute Engine Logo

    Testing the Write Concern – performance vs durability

    The MongoDB Write Concern is historically controversial because Mongo was originally designed to get very high write throughput at the expense of durability but this wasn’t well documented. The default was changed a while back and it now gives you an acknowledgement that a write was accepted, but is still quite flexible to allow you to tune whether you want speed or durability.

    I am going to test a range of write concern options to allow us to see what kind of response times we can expect:

    Unacknowledged: w = 0 AND j = 0

    This is a fire and forget write where we don’t know if the write was successful and detection of things like network errors is uncertain.

    Acknowledged: w = 1 AND j = 0 (the default)

    This will give us an acknowledgment that the write was successfully received but no indication that it was actually written. This picks up most errors e.g. parse errors, network errors etc.

    Journaled: w = 1 AND j = 1

    This will cause the write to wait until it has been both acknowledged and written to the journal of the primary replica set member. This gives you single server durability but doesn’t guarantee the data has been replicated to other members of your cluster. In theory you could have a data center failure and lose the write.

    Replica acknowledged: w = 2 AND j = 0

    The test will give us an idea how long it takes for the write to be acknowledged on the replica set primary and acknowledged by at least 1 other member of the replica set. This gives us some durability across 2 servers but in theory the write could still fail on both because we are not doing a check for the write hitting the journal.

    Replica acknowledged and Journaled: w = 2 AND j = 1

    This ensures that the write is successfully written to the primary and has been acknowledged by at least one of the replica set members.

    Replica acknowledged with majority: w = majority AND j = 0

    In a multi datacenter environment you want to know that your writes are safely replicated. Using the majority keyword will allow you to be sure that the write has been acknowledged on the majority of your replica set members. If you have the set deployed evenly across data centers then you know that your data is safely in multiple locations.

    Replica acknowledged with majority and journaled: w = majority AND j = 1

    Perhaps the most paranoid mode, we will know that the write was successfully acknowledged by the primary and was replicated to a majority of the nodes.

    Environment configuration

    Replica sets

    Real world applications use replica sets to give them redundancy and failover capabilities across multiple data centers. To accurately test this, the test environment will involve 4 data nodes across 2 zones: x2 in the us-central1-a zone and x2 in the us-central1-b zone.

    In a real deployment you must have a majority to maintain the set in the event of a failure, so we should deploy a 5th node as an arbiter in another data center. I’ve not done this here for simplicity.

    Google Compute Engine

    I tested with the n1-standard-2 (2 vCPUs and 7.5GB RAM) and n1-highmem-8 (8 vCPUs and 52GB RAM) instance types – with the backports-debian-7-wheezy-v20140318 OS image.

    Be aware that the number of CPU cores your instance has also affects the i/o performance. For maximum performance then you need to use the 4 or 8 core instance types even if you don’t need all the memory they provide.

    There is also a bug in the GCE Debian images where the default locale isn’t set. This prevents MongoDB from starting properly from the Debian packages. The workaround is to set a default:

    sudo locale-gen en_US.UTF-8
    sudo dpkg-reconfigure locales

    Google Persistent Disks

    It’s really important to understand the performance characteristics of Google Persistent Disks and how IOPs scale linearly with volume size. Here are the key things to note:

    • At the very least you need to mount your MongoDB dbpath on a separate persistent disk. This is because the default root volume attached to every Compute Engine instance is very small and will therefore have poor performance. It does allow bursting for the OS but this isn’t sufficient for MongoDB which will typically have sustained usage requirements.
    • Use directoryperdb to give each of your databases their own persistent disk volume. This allows you to optimise both performance and cost because you can resize the volumes as your data requirements grow and/or to gain the performance benefits of more IOPs.
    • Putting the journal on a separate volume is possible even without directoryperdb because it is always in its own directory. Even if you don’t put your databases on separate volumes, it is worth separating the journal onto its own persistent disk because the performance improvements are significant – up to x3 (see results below).
    • You may be used to only needing a small volume for the journal because it uses just a few GB of space. However, allocating a small persistent disk volume will mean you get poor performance because the available IOPs increase with volume size. Choose a volume of at least 200GB for the journal.
    • If you split all your databases (or even just the journal) onto different volumes then you will lose the ability to use snapshots for backups. This is because the snapshot across multiple volumes won’t necessarily happen at the same time and will therefore be inconsistent. Instead you will need to shut down the mongod (or fsync lock it) and then take the snapshot across all disks.

    I’ve run the testing several times with different disk configurations so I can see the different performance characteristics:

    1. With no extra disks i.e. dbpath on the default 10GB system volume
    2. With a dedicated 200GB persistent disk for the dbpath
    3. With a dedicated 200GB persistent disk for the dbpath and another dedicated 200GB persistent disk for the journal

    Test methodology

    I wrote a short Python script to insert a static document into a collection. This was executed 1000 times and repeated 3 times. The Python timeit library was used to complete the tests so the fastest time was taken, as per the docs indicating that the mean/standard deviation of the 3 test cycles is not that useful.

    Results – MongoDB performance on Google Compute Engine


    Test results - n1-standard-2


    Test results - results-n1-highmem-8


    There’s not much difference in performance when you’re using a single persistent disk volume until you start increasing the durability options because acknowledged and unacknowledged writes are just going to memory. When you increase the write concern options then you become limited by the disk performance and it’s clear that splitting the journal onto a separate volume makes a significant difference.

    The increased performance of the larger n1-highmem-8 instance type with 8 cores vs 2 cores of the n1-standard-2 is also apparent in the figures – although the actual difference is quite small, it is still a difference which would likely help in a real world environment.

    The key takeaway is that performance decreases as durability increases – this is a tradeoff. For maximum performance with durability then you can take advantage of Compute Engine’s higher spec instance types and larger, separate persistent disk volumes per database / journal.

  4. What’s new in Server Density – Mar 2014

    Leave a Comment

    Each month we’ll round up all the feature changes and improvements we have made to our server and website monitoring product, Server Density.

    Server snapshot

    The new snapshot view allows you to browse back in time minute by minute to see the exact state of a server at any given time. It reveals the key metric absolute values, visualising them in an easy to read format. This is great for picking out spikes and unusual events on the graphs.

    You can access this from the Snapshot tab when viewing your servers, or by clicking on the Snapshot link from the popover in the graphs.

    Server snapshot

    Improved graph popovers

    If you have many series on the graph, the new popovers shown when hovering contain the items in a scrollable view. They’re more intelligent about where they show up on the page to avoid your mouse and by clicking the graph, you can pin them in the same position whilst moving the mouse cursor away. This makes it easier to compare across multiple graphs, especially when scrolling.

    Graph popovers

    iPhone background refreshing

    Since the release of our iPhone monitoring app last month, we’ve pushed out several minor updates to fix bugs and have added background refreshing of the alert list so it’s up to date as soon as you load the app.

    v2 Chef recipe

    Thanks to some excellent pull requests from Mal Graty at idio (one of our customers), we now have a v2 of our Chef recipe. On a basic level it will install the agent and register devices through our API but also includes full LWRP support and the ability to define your alerts in Chef so you can manage alert configuration in your code.

    See what’s changed in the pull request and visit Github for the recipe. The version number has been bumped in Berkshelf.

    Opsgenie integration

    If you’re using Opsgenie for your alerting, they now have an official guide for using our webhooks to trigger alerts.

    List search filters in URL

    If you search and/or filter your devices/services list, these will be reflected in the URL to allow you to link to the list and use your browser back button to go back to the list.

    New blog theme

    We’ve redesigned the theme for the blog to make it more minimalist and to work better on mobile devices.

    What’s next?

    For the next month, we’re working finishing the last few polishes for our Android app, Firefox support and dynamic graphing for the dashboard plus a few other things! We’ll also be adding process lists to the snapshot, although this won’t be out for a few months.

  5. 5 things you probably don’t know about Google Cloud


    At the beginning of the month I wrote a post on GigaOM following some experimenting with Google Cloud: 5 things you probably don’t know about Google Cloud.

    1. Google Compute Engine Zones are probably in Ireland and Oklahoma
    2. Google’s Compute Zones may be isolated, but they’re surprisingly close together
    3. Scheduled maintenance takes zones offline for up to two weeks
    4. You cannot guarantee where your data will be located
    5. Connectivity across regions isn’t fast

    I think Google is currently in the best position to challenge Amazon because they have the engineering culture and technical abilities to release some really innovative features. IBM has bought into some excellent infrastructure at Softlayer but still has to prove its cloud engineering capabilities.

    Amazon has set the standard for how we expect cloud infrastructure to behave, but Google doesn’t conform to these standards in some surprising ways. So, if you’re looking at Google Cloud, have a read of the article.

  6. The tech behind our time series graphs – 2bn docs per day, 30TB per month


    This post was also published on the MongoDB Blog.

    Server Density processes over 30TB/month of incoming data points from the servers and web checks we monitor for our customers, ranging from simple Linux system load average to website response times from 18 different countries. All of this data goes into MongoDB in real time and is pulled out when customers need to view graphs, update dashboards and generate reports.

    We’ve been using MongoDB in production since mid-2009 and have learned a lot over the years about scaling the database. We run multiple MongoDB clusters but the one storing the historical data does the most throughput and is the one I shall focus on in this article, going through some of the things we’ve done to scale it.

    1. Use dedicated hardware, and SSDs

    All our MongoDB instances run on dedicated servers across two data centers at Softlayer. We’ve had bad experiences with virtualisation because you have no control over the host, and databases need guaranteed performance from disk i/o. When running on shared storage (e.g a SAN) this is difficult to achieve unless you can get guaranteed throughput from things like AWS’s Provisioned IOPS on EBS (which are backed by SSDs).

    MongoDB doesn’t really have many bottlenecks when it comes to CPU because CPU bound operations are rare (usually things like building indexes), but what really causes problem is CPU steal – when other guests on the host are competing for the CPU resources.

    The way we have combated these problems is to eliminate the possibility of CPU steal and noisy neighbours by moving onto dedicated hardware. And we avoid problems with shared storage by deploying the dbpath onto locally mounted SSDs.

    I’ll be speaking in-depth about managing MongoDB deployments in virtualized or dedicated hardware at MongoDB World this June.

    2. Use multiple databases to benefit from improved concurrency

    Running the dbpath on an SSD is a good first step but you can get better performance by splitting your data across multiple databases, and putting each database on a separate SSD with the journal on another.

    Locking in MongoDB is managed at the database level so moving collections into their own databases helps spread things out – mostly important for scaling writes when you are also trying to read data. If you keep databases on the same disk you’ll start hitting the throughput limitations of the disk itself. This is improved by putting each database on its own SSD by using the directoryperdb option. SSDs help by significantly alleviating i/o latency, which is related to the number of IOPS and the latency for each operation, particularly when doing random reads/writes. This is even more visible for Windows environments where the memory mapped data files are flushed serially and synchronously. Again, SSDs help with this.

    The journal is always within a directory so you can mount this onto its own SSD as a first step. All writes go via the journal and are later flushed to disk so if your write concern is configured to return when the write is successfully written to the journal, making those writes faster by using an SSD will improve query times. Even so, enabling the directoryperdb option gives you the flexibility to optimise for different goals e.g. put some databases on SSDs and some on other types of disk (or EBS PIOPS volumes) if you want to save cost.

    It’s worth noting that filesystem based snapshots where MongoDB is still running are no longer possible if you move the journal to a different disk (and so different filesystem). You would instead need to shut down MongoDB (to prevent further writes) then take the snapshot from all volumes.

    3. Use hash-based sharding for uniform distribution

    Every item we monitor (e.g. a server) has a unique MongoID and we use this as the shard key for storing the metrics data.

    The query index is on the item ID (e.g. the server ID), the metric type (e.g. load average) and the time range but because every query always has the item ID, it makes it a good shard key. That said, it is important to ensure that there aren’t large numbers of documents under a single item ID because this can lead to jumbo chunks which cannot be migrated. Jumbo chunks arise from failed splits where they’re already over the chunk size but cannot be split any further.

    To ensure that the shard chunks are always evenly distributed, we’re using the hashed shard key functionality in MongoDB 2.4. Hashed shard keys are often a good choice for ensuring uniform distribution but if you end up not using the hashed field in your queries, could actually hurt performance because then a non-targeted scatter/gather query has to be used.

    4. Let MongoDB delete data with TTL indexes

    The majority of our users are only interested in the highest resolution data for a short period and more general trends over longer periods, so over time we average the time series data we collect then delete the original values. We actually insert the data twice – once as the actual value and once as part of a sum/count to allow us to calculate the average when we pull the data out later. Depending on the query time range we either read the average or the true values – if the query range is too long then we risk returning too many data points to be plotted. This method also avoids any batch processing so we can provide all the data in real time rather than waiting for a calculation to catch up at some point in the future.

    Removal of the data after a period of time is done by using a TTL index. This is set based on surveying our customers to understand how long they want the high resolution data for. Using the TTL index to delete the data is much more efficient than doing our own batch removes and means we can rely on MongoDB to purge the data at the right time.

    Inserting and deleteing a lot of data can have implications for data fragmentation but using a TTL index helps because it automatically activates PowerOf2Sizes for the collection, making disk usage more efficient. Although as of MongoDB 2.6 this storage option will become the default.

    5. Take care over query and schema design

    The biggest hit on performance I have seen is when documents grow, particularly when you are doing huge numbers of updates. If the document size increases after it has been written then the entire document has to be read and rewritten to another part of the data file with the indexes updated to point to the new location, which takes significantly more time than simply updating the existing document.

    As such, it’s important to design your schema and queries to avoid this, and to use the right modifiers to minimise what has to be transmitted over the network and then applied as an update to the document. A good example of what you shouldn’t do when updating documents is to read the document into your application, update the document, then write it back to the database. Instead, use the appropriate commands – such as set, remove, and increment – to modify documents directly.

    This also means paying attention to the BSON data types and pre-allocating documents, things I wrote about in MongoDB schema design pitfalls.

    6. Consider network throughput & number of packets

    Assuming 100Mbps networking is sufficient is likely to cause you problems, perhaps not during normal operations but probably when you have some unusual event like needing to resync a secondary replica set member.

    When cloning the database, MongoDB is going to use as much network capacity as it can to transfer the data over as quickly as possible before the oplog rolls over. If you’re doing 50-60Mbps of normal network traffic, there isn’t much spare capacity on a 100Mbps connection so that resync is going to be held up by hitting the throughput limits.

    Also keep an eye on the number of packets being transmitted over the network – it’s not just the raw throughput that is important. A huge number of packets can overwhelm low quality network equipment – a problem we saw several years ago at our previous hosting provider. This will show up as packet loss and be very difficult to diagnose.


    Scaling is an incremental process – there’s rarely one thing that will give you a big win. All of these tweaks and optimisations together help us to perform thousands of write operations per second and get response times within 10ms whilst using a write concern of 1.

    Ultimately, all this ensures that our customers can load the graphs they want incredibly quickly, whilst behind the scenes we know that data is being written quickly, safely and that we can scale it as we continue to grow.

  7. Creating custom sounds for iPhone and Android mobile apps

    Leave a Comment

    With the new server monitoring mobile apps for iPhone and Android, we decided to fulfil one of the most common requests we had with our old apps – custom alert sounds. This is important because you want to ensure that you notice alerts coming in from your server and website monitoring alerts, and distinguish them from other push notifications on your phone.

    Both iPhone and Android allow you to specify custom sounds for push notifications so we hired a sound designer and composer to create some custom sounds just for our apps. The brief was to create a range of custom sounds for alerts – to be able to wake people up but still sound nice i.e a broad scope to leave the creativity up to the expert.

    We also asked them to document the process…enter Chris Rowan, the composer we worked with on this project:

    I’m Chris Rowan, freelance sound designer and composer. My specialties lie in creating original sounds for different types of media. For Server Density I developed alert sounds for the purpose of alerting customers when a server is malfunctioning. When creating the sounds for the app, my goal was to create a range of audible alerts that would contrast the usual tones and jingles associated with most apps. These contrasting alerts give an interesting range, helping users distinguish between Server Density and everyday mobile alerts. Three techniques where used making the server density application alert:

    1. MIDI
    2. Synthesis
    3. The manipulation of recorded sounds from natural sources

    Below is an analysis of three alternate sounds using the alternate techniques described above.

    Midi Instruments

    When using audio editing software, alert sounds can be developed through midi instruments. This is achieved through the manipulation of the instrument’s tone and body through modulation and alteration of the instrument’s parameters, such as flange or chorus.

    Midi instruments

    One of the sounds that utilised this technique is the manori stab (named after the MIDI instrument used).

    The original manori stab has a cascading element that I liked and thought would work well as an alert, but the length was too short and the attack to quick. To achieve the desired sound, pitch shifting and time stretching was required. These tools lowered the tone of the sound and also elongated it.

    Time Stretching can be extremely destructive so it is always best to use it sparingly. However, pitch shifting is a non destructive technique that can effectively heighten or lower the pitch of the sound. Pitch shifting is a better option and will use less memory to generate.


    To create a truly original sound, digital synthesis gives you the means. Most sound editing software will have some sort of synth tool but for the sound ‘Delays on Delays’ I used an analogue synth.

    I chose to do this because having a physical medium to create can give you more accurate results making experimenting a lot faster and easier. The ‘Nord Rack X2′ Synth was used to create ‘Delays’, starting with a pure tone (Sine Wave), adjusting parameters and altering the attack, sustain, release and decay.

    Nord Rock Synth

    Synthesis can be used to create extremely complex sounds from the ground up and is used to make a lot of mobile device ringtones and alerts.

    Manipulation of recordings

    To contrast the synth heavy sounds described perviously, I recorded a range of natural sounds from wooden knocks to glass squeaks. After collecting the materials needed, I began recording a collection of sounds that would then be mixed, cut, and EQ’ed into desirable tones. I also utilised reverb to give the sounds more tail and body.


    For the sound ‘Glass tones – Toned Down’ I recorded the sound of a finger running over the rim of a half full wine glass. This creates a looping tone that oscillates.

    Wine glass recording

    Once the sound was moved into Logic Pro, cutting was achieved using the sample editor. EQ, modulation and pitch shifting were used to alter the tone and make the sound brighter. Reverb was added to give a smooth tail and bigger body.

    When manipulating recordings it is important to apply effects gradually as plugins such as distortion and modulation will amplify unwanted artefacts previously unheard.

    The final choice

    The entire process created 14 sounds, most of which we have included in the app so the user can choose their favourite. Based on feedback we’ll see if we need any more created. After an internal vote we decided to make “space ring” the default, as a nice mixture of a futuristic sci-fi and alert tone.

    To see more of my work please check out my YouTube channel and LinkedIn.

    … and we’re back!

    So thanks to Chris Rowan for his fantastic work on our alert sounds and for documenting it for us to enjoy. If you’re interested in getting your hands on our new iPhone app, then it is available over here (If you haven’t got an account with us you’ll need to signup for your free trial)!

  8. What’s new in Server Density – Feb 2014

    Leave a Comment

    Each month we’ll round up all the feature changes and improvements we made this month to our server and website monitoring product, Server Density.

    New server overview

    Previously you had to configure a range of graphs for each of your devices, but we found that most people just set up a few standard graphs. Although the most common are defaulted, we have designed a new overview for each server which reveals all the key metrics without needing any configuration.

    Server overview

    This includes disk usage, CPU, load average, disk i/o and networking. You can view the graphs over the configured time range as well as the current, latest reported value. And the data auto refreshes when the agent posts in. The old graphs still remain under the “Metrics” tab so you can continue to add custom graphs for other metrics not in the default view.

    iPhone mobile app

    You can now get push notification alerts directly to your iPhone with our official, native server monitoring iPhone app available for free for all v2 accounts. It includes custom sounds too!

    iPhone server monitoring app - Server Density

    Advanced process monitoring alerting

    You can now configure a number of different process alerts:

    • Is a process running? Get alerted when processes crash or stop running e.g. is Apache running?
    • How many processes are running? Get alerted when processes go over or below a certain number e.g. are there at least 10 Apache worker processes?
    • Process CPU usage Get alerted when a specific process uses a certain amount of CPU resources e.g. is Apache using too much CPU?
    • Process memory usage
      Get alerted when a specific process uses a certain amount of memory e.g. is Apache using too much memory?

    Process monitoring

    New China monitoring node

    For service web checks you can now add our monitoring node in China, near Shanghai, to see if your site is loading from within the region.

    Advanced CPU monitoring alerting

    If you’re running the latest version of the monitoring agent for Linux, you will have the new “ALL” metric for CPUs, which gives you an average across all your CPU cores for things like idle, io wait, user, etc. You can now choose to have alerts triggered on specific CPU cores, if any of the cores match the alert config, or if the average across all cores matches the config. This is much more fine grained alerting than previously possible.

    CPU alert

    SPDY 3.1

    We’ve updated our load balancers to offer the latest SPDY 3.1 protocol which includes performance enhancements for multiplexing many HTTP requests. As per Cloudflare:

    A key advantage of SPDY is its ability to multiplex many HTTP request streams onto a single TCP connection. In the past, various hacks (such as domain sharding) have been used to get around the fact that only sequential, synchronous requests were possible with HTTP over TCP. SPDY changed all that.

    SPDY/3 introduced flow control so that SPDY clients (and servers) could control the amount of data they receive on a SPDY connection. SPDY/3.1 extended flow control to individual SPDY streams (each SPDY connection handles multiple simultaneous streams of data). Flow control is important because different clients (think of the differences in available memory in laptops, desktops and mobile phones) will have varying limitations on how much data they can receive at any one time.

    This means you should see better performance if you’re using the latest browser versions. Currently we only officially support Chrome, although Firefox does generally work and will be officially supported soon.

  9. How do you document your ops infrastructure?


    As your team and infrastructure grows, one of the most important things is how it’s documented. Anyone new joining the team, existing members working on new areas, or even the on-call team needs to know how things work.

    The first line of documentation is essentially config management, and for this we use Puppet. This defines things like packages, config files, server roles, etc however, it only defines the “state”. In addition to this, documentation needs to cover things like emergency response, how to deal with alerts, failover procedures, processes, checklists and vendor information.

    What do we want from our ops documentation?

    I’ve just started a project at Server Density to revamp all our docs. We’ve had some problems which could have been avoided or resolved faster if our docs were better. As our infrastructure continues to grow, this is important to address properly, and then keep well maintained.


    Historically, we used Confluence as a wiki but more recently have been using Github with markdown formatted files alongside code, however there are some problems:

    • Search. Searching in Github is more designed to search code, and requires some filters for the organisation and repository. We’d need to split the docs to a separate repo to avoid the code alongside them also being searched. In Confluence search was never very accurate and also quite slow.
    • Editing. The biggest challenge for any documentation is keeping it up to date. Being able to quickly edit the docs is important and there’s some overhead with a wiki format or having to commit code – it’s minor, but is an extra step. Formatting is also inflexible.
    • Collaboration. Being able to work on a doc simultaneously or discuss changes/comment on areas of a doc is much better in Github than on Confluence but is still focused around individual commits, or pull requests combining specific changes. This works well for a specific body of work but not for ongoing discussions.
    • Speed. Github has good performance but Confluence is really quite slow at everything. We used their hosted version rather than the on-premise install.

    In summary, we want a system that has minimal barriers to creating/editing docs, can be searched quickly and accurately, is easy to collaborate on and ideally it should also be available offline and/or downloadable.


    How do other people document their infrastructure

    I asked on Twitter to see what other people were doing, having looked online and not found much about what other companies are doing (other than a brief mention of Confluence by Etsy).

    You can click through to see the range of replies – they included things like Mediawiki, Github Wiki, OneNote, Hackpad, Confluence and some more complex tools with offline sync. Also noted was how Github do this, using markdown files which are sync’d offline too.

    What did we pick for ops documentation?

    Having already tried Confluence and markdown files in Github, I decided to try something different – Google Docs. The whole team already has access to it through the web, offline and via mobile; documents can be created and edited very quickly, in-line and collaborated on by multiple team members; it has a built in drawing tool so we can create system diagrams; it’s very fast to load; and crucially, search is incredibly fast and accurate. Indeed, it is Google search afterall! You can also download documents in multiple formats to store offline if you prefer.

    I’m still working on building things out, transferring info and making sure we have everything up to date. The key will be how it performs during incident response and when someone needs to find something. And if it doesn’t work then the docs are available in various formats so will be easy to migrate out.

    In the meantime, if you’re doing something different or have a good way to address the documentation problem – please do comment!

    Google Docs

  10. What’s new in Server Density – Jan 2014


    Each month we’ll round up all the feature changes and improvements we made this month to our server and website monitoring product, Server Density.

    New dashboard widgets – RSS feeds, cloud status, open alerts

    Yesterday we officially announced our ops dashboard which includes a range of new widgets for our dashboard. These are all useful for displaying an overview of your entire infrastructure, e.g. on a big TV!

    Dashboard in office

    There are several new widgets available:

    Cloud status

    Choose your cloud vendor and product/region and we’ll pull in the latest status from their public status feeds. This is useful to see if any alerts are being caused by known issues in your region. We support status feeds from Amazon Web Services, Rackspace Cloud, Digital Ocean, Joyent, Google Compute, Linode, IBM SoftLayer and Microsoft Azure.

    RSS feed

    If you want to see the latest items from a generic RSS feed or a cloud provider we don’t support, you can enter the URL and we’ll pull in the latest items.

    Open alerts

    Display how many open alerts there are on your entire account, at a group level or on specific devices or service checks.

    Group alerts for service checks

    You have been able to create alerts on a group level for devices/servers for some time, but we have now extended this functionality to service web checks too. This means you only need to create the alerts once on a group level and all members of that group will inherit the alert config.

    You can configure group alerts when viewing the Alerting tab for a particular web check or by clicking the name of the group in the services list.

    Service check group alerts

    What’s next?

    We’re planning to submit our iPhone app to Apple at the end of this month and the Android app will follow shortly afterwards at the start of Feb. We’ll then be moving on to more detailed process monitoring as well as a range of improvements to the device view with better default graphs for specific metrics.