Author Archives: David Mytton

About David Mytton

David Mytton is the founder of Server Density. He has been programming in PHP and Python for over 10 years, regularly speaks about MongoDB (including running the London MongoDB User Group), co-founded the Open Rights Group and can often be found cycling in London or drinking tea in Japan. Follow him on Twitter and Google+.
  1. Moving to a metric centric model – Zoom, SNMP, statsd, JMX & 70 new plugins

    Leave a Comment

    Earlier this year, we completed a migration to a new time series database (OpenTSDB) backed by a new datastore (Google Cloud Bigtable). There were many reasons for this migration but one of the main ones was to support a range of new features we wanted to build on top of it.

    Having now completed the full migration of our entire infrastructure (not just the metrics backend) to Google Cloud as of the beginning of Nov, I’m pleased to announce we’re now rolling out a range of new features to Server Density SaaS monitoring.

    A move to a metric centric model

    The industry has changed significantly since Server Density started in 2009. In those days, servers (instances, nodes or devices) were the core component of infrastructure and so it made sense for metrics to belong to an instance.

    Today, metrics are the main component of monitoring. A metric might still be associated with a server but it could also be associated with an application, a container or a function.

    In this release, we’re introducing a transition to this metric centric model which starts with how configuration of graphs works on dashboards. Graphs are now built series-by-series by choosing a number of filters, one of which can be the name of the device or service. In particular, you can specify string based match patterns which makes it much easier to plot metrics from a cluster of servers or containers.

    Server Density metric centric model

    We no longer have the concept of “elastic graphs” because all graphs are now capable of dynamically updating as new matches to your filters come online. This means you could have just a single series configuration that can match many metrics.

    Multi-dimensional metrics

    Many of our customers make use of our API or the agent custom plugin framework to send us custom metrics. Previously, custom metrics had a top level name and then multiple key/value pairs. Only a single level of metrics was supported.

    The move to the new time series database has allowed us to implement multi-dimensional metrics so you can now have custom metrics with any number of levels of data. This is useful to embed contextual information into the metric hierarchy e.g. container IDs which can then be filtered using the new graphing configuration options.

    You will see this change in system metrics and official plugins because we have moved to a dot notation format for all metric names. This makes it easy to determine the hierarchy and filter based on name matches.

    Plugins now also support multiple metric types including gauges, counters, histograms, rates, counts and raw metrics as we have historically. You can find details with example code in our documentation.

    Drag to zoom on graphs

    One of the major changes in this release is refactoring of our data model. This was required with the shift to a more metric centric model but those data model changes have also allowed us to build out drag to zoom functionality which has been frequently requested. The foundations for this were implemented earlier in the year with our new graphing architecture based on React, Redux and D3.

    Server Density Graph Zoom


    The JMX plugin will automatically pick up 11 core metrics and you can then configure it further to pick up any custom metrics. The details are in our documentation.


    The new version of our monitoring agent can act as a collector for statsd metrics. This means you can instrument your code to measure any metric you like – from execution time through to throughput counters. The agent listens locally and doesn’t require any special configuration to aggregates the values before sending them into Server Density for reporting and alerting. This means you can report at any volume you wish without incurring network overhead.

    It’s simple to post metrics to Sdstatsd. The python example below will send the metric ‘application.metric.example’ with a random value between 1-100.

    import socket
    from random import randint
    count =  randint(1, 100)
    HOST = 'localhost'
    PORT = 8125
    MESSAGE = 'application.metric.example:{}|c|#example: tag'.format(count)
    sock = socket.socket(socket.AF_INET, socket.SOCK_DGRAM) 
    sock.sendto(MESSAGE, (HOST, PORT))

    You can find other examples in our documentation.


    If you have custom hardware running on your network, the agent can be configured to issue requests to SNMP devices and collect stats data from them. Since the agent sits on your servers, it allows us to collect SNMP metrics from behind your firewall without any custom firewall rules needing to be set up to allow external access. You configure the agent to collect metrics from each device and can also configure your own MIBs. Full details are in our documentation.

    Lots of new plugins

    Many of the plugins we’ve had in development for some time depend on this release because they are multi-dimensional. This means we now have full support for 70 new and improved plugins including:

    Activemq, Cassandra, Docker, Elastic, HAProxy, HDFS, Kafka, Kubernetes, Memcache, MongoDB, MySQL, Nginx, Openstack, PGBouncer, PHP FPM, Postfix, PostgreSQL, RabbitMQ, Redis, Riak, Solr, Spark, Supervisor, Tomcat, Varnish, vSphere and Zookeeper.

    These are available to all customers for no extra cost. We are publishing the documentation for these over the coming days but feel free to get in touch to ask help in configuring these in the meantime – everything is done via agent config files which include detailed comments.


    These new features are available to new accounts right now. We have released a new version of our *nix monitoring agent (2.2.0) which supports all of the above changes and is the minimum version required. Windows support is on our roadmap for 2018.

    Due to the changes in our data model, we’ll be rolling this out to existing accounts gradually with the full release completed for everyone by the end of Jan. You won’t need to make any changes – you’ll see an update to the “What’s new” panel in-app once the migration is completed.

    statsd support will be available for everyone. SNMP and JMX will be available to customers on our Pro or Enterprise pricing only. Customers on older pricing can get in touch to find out about switching to get access to SNMP and JMX.

  2. How to do code reviews

    Leave a Comment

    Peer review is a well established concept in the scientific world, leading to higher quality work, improvements to techniques and an avenue for individuals to learn by examining the work of others.

    Following those ideas, code reviews are an important part of our approach to engineering at Server Density. With hundreds of teams relying on our product for mission critical infrastructure monitoring, it is important that we have a robust development process that can deliver on a high level of code quality. Not only that but code reviews are a great opportunity for our team to learn new approaches and understand the decisions behind engineering choices.

    That said, there are challenges. In how much detail should you approach a review? How do you provide constructive feedback? How do you deal with an implementation you might not agree with? How do you review your manager’s code?

    This is a look behind the scenes at how we do code reviews at Server Density.

    The process

    Opening the review

    All our code reviews are conducted by using Github’s Pull Request functionality. We do all our development on branches, with master always being considered stable. Opening a pull request initiates the code review process and we use a template to provide instructions for the reviewer. A number of key questions are asked by default:

    • What does this PR do? A brief description of what is actually happening, and usually a link to the JIRA ticket that is associated with the piece of work. This helps to provide context about the purpose of the code.
    • Where should the reviewer start? It’s often not obvious where to begin so this should be explicitly stated. It is typically a particular file and may include pointers to direct the reviewer to the starting point to begin their review.
    • How should this be manually tested? The reviewer might not be familiar with this precise area of the codebase, or how to set up the right environment to verify the test case. This is where commands will be listed for the reviewer to execute in order to set up the test, and run the code. Listing precise steps to test the code may be appropriate but often it’s general functionality that is described, leaving it up to the reviewer to step through the changes. This avoids any bias by the author only describing how they have tested it which might miss key elements.
    • Who should review or be notified? Sometimes you might want to ask a specific person to review the code but often this is a group of people e.g. the entire backend engineering team. Anyone can pick up any review so this lets everyone know that a PR is waiting.
    • Some additional questions about the impact on internal services, infrastructure and documentation are also templated to ensure that the full impact of the PR is discussed. There may be a need for infrastructure changes or for the support team to update documentation. Depending on the scope of the review, we also require confirmation that the PR has been tested on the range of browsers and platforms we support and the keyboard controls have been included (if appropriate).

    The template is set up to provide a self contained environment for testing this change and deal with anything that might be touched by it. We link Github to our build system in Travis so that the reviewer can easily see that tests have passed and verify the output. The idea is that once the review is completed, it can be merged into master and deployed with minimal risk.

    Conducting the review

    First, test the code! The PR template will have detailed instructions for testing all parts of the implementation. The reviewer should follow those steps and ensure that everything is tested with good data, bad data and any edge cases. Trying to break it is good!

    Sometimes, pre-existing bugs are found. These are tested against master as well and if found to exist in both, a note added to the review with a link to the existing JIRA case (or a new one opened). It might be easy to fix as part of this PR but is not always necessary. We usually don’t want PRs to experience scope creep.

    PRs for frontend code will also include ensuring that everything is working from a visual perspective. This starts with checking the UI looks right but also includes verifying the implementation against designs and may involve the designers themselves.

    The backend team may have additional considerations around backwards compatibility and what impact merging this PR might have on existing data and APIs.

    Side effects of PRs can lurk in corners so it is important to test key parts of the application to ensure no unexpected regressions have been introduced. Parts of the UI which use the same CSS classes or React components are obvious places to start but there may be other cascading effects. Our test suite helps with this because we have extensive unit and integration tests, but there are always areas without coverage or which can’t be fully tested.

    Unit tests can really help to understand why and how the code is working. A useful approach can be to intentionally introduce bugs into the code to see how the tests behave. For newer parts of the codebase, tests can often serve as documentation and so the reviewer should include a note about this if the tests are key to understanding the code.

    For large changes, the test results should be noted in the PR comments. This provides a useful audit log of what was successfully tested and helps provide a methodology for commenting on changes with overall impressions, screenshots/gifs and reproduce steps for bugs.

    Once functional testing is completed, the reviewer will add a comment to allow the author to start working through the feedback. However, the job of the reviewer isn’t over yet. The next step is to read the code, trying to understand every change at least file by file if not line by line. A good commit history will be helpful here because it allows the reviewer to understand how the changes progressed.

    Having two separate steps – one functional, another a close code review – allows for a fully comprehensive review. That is why a comment is added to state the first part of the review is completed but that there are still more comments to come.

    When something isn’t understood then the right approach is to ask questions. Equally, opinions are always welcome. The reviewer should think through whether the changes are logical and how the approach matches the rest of the codebase. If the reviewer finds changes they disagree with in style or strategy rather than function, this should raise a question or suggestion for an alternative approach.

    Be specific in how you suggest alternatives. Saying something like “Why not just use the helper here?” Is not as useful as stepping through the exact changes you’d like to see. It not only helps the author know exactly what the reviewer is suggesting (no crossed wires) but it helps avoid code fatigue. After all, the author has probably been working with this code for a long time already!

    There is a balance between disagreeing because it’s not how you would do something vs disagreeing because the approach is questionable. It’s important to remain open minded about methods you haven’t considered.

    Long term maintainability of the codebase is important. If there’s something puzzling you now, it will be even more puzzling in a few months! Pointing out small things which could make reading the code easier will help with this.

    Finally, the reviewer should summarise what they like, highlight the most important in-line questions and state that the review is done. It then heads back to the author to respond.

    The style

    The challenge with written communication is transmitting intention and avoiding misinterpretation. Whether it is email, chat or discussion threads, it is all too easy to apply meaning which wasn’t intended. This can result in negative feelings and unnecessary conflict.

    To help reduce the chances of this, there are a few style guidelines we try to apply.

    For the reviewer

    • Always assume the PR is the best possible approach within the constraints of time and existing code. Also assume the author has thought about it longer than you. The PR doesn’t necessarily show the full journey, only the end result.
    • Try to be as positive as possible in giving feedback. There are reasons for the way the code is written the way it is but it is important to have an opinion and consider how the approach might be improved. The focus should be on that improvement rather than any negativity.
    • Take care not to seem passive aggressive when trying to avoid negativity. The author is placing themselves and the code they have worked hard on in a vulnerable position. At Server Density we have many nationalities represented, all with their own styles of communication, often with English not being the first language. Be gentle and mindful of cultural differences when working with diverse teams.
    • The sandwich method of feedback is useful to remind you to specifically highlight what you like at the same time as giving constructive feedback. Don’t be afraid to mention all the problems you see with the changes – this is the purpose of the review.
    • Always write in full sentences. Using a few emojis can help but avoid jokes, irony and sarcasm – these are too easy to misunderstand in written communication.

    For the author

    • Assume the reviewer means well and there is nothing deliberately passive aggressive.
    • Have an opinion too! You had reasons for implementing the code the way you did so don’t worry about explaining them. We often refer to PRs in the future because they contain important context about architectural and implementation decisions. However, the foundational design and planning work will have been done well before the PR stage and so re-hashing old architectural topics and decisions is probably not a good idea unless they are fundamental to the review.
    • If you think the reviewer is missing a point, help them understand but try to avoid being too defensive.
    • The PR is not a chat – that’s what Slack is for, and we don’t use Slack to conduct reviews. Think through the question and make sure you understand all aspects, then take time to answer.
    • Accept feedback. If you find yourself arguing (even if just inside your head) about every comment then take a break. If you’re really unhappy then arrange a video chat (we use Hangouts) because the limitations of written communication can make it worse.

    For junior team members

    You will do reviews where the author is your manager – they may be the Tech Lead, Engineering Manager or even, occasionally, the CEO. They might know more about the code base and/or have more development experience. When you first join the company, we will try to only give you PRs that you can handle but the ultimate goal is that everyone participates in reviews, merges to master, and presses the deploy button. Don’t worry about who the author is – the above approach applies to everyone equally.

    It’s a tricky balance to be thorough in your reviews at the same time as being speedy but you will get better at this over time. Conducting reviews is a great way to learn the codebase and is a key part of our onboarding process for new hires.

    If you don’t understand something then this is a problem with the PR, not you. Don’t be afraid to ask. There may be some implied knowledge that was missed due to the assumptions of the author, so this is a good opportunity for them to teach you and improve our documentation! Highlight this and a video discussion will usually be the next step.

    The merge

    The most exciting bit, and probably the most nerve-wracking moment – merging and deploying!

    Only commit and merge what you’re happy to maintain for a long time to come. Depending on the type of code this could be months or years, depending on what part of the code base it is touching.

    Once a piece of code is merged, it is part of the collective work of the team. Forget who wrote it and treat it as your responsibility during the deploy process. All parts of the codebase are owned by everyone.

    Unit tests, integration tests, functional testing, code reviews, PRs, automated deployment, staging, monitoring and rapid rollback are all there to ensure that you can deploy regularly, with confidence.

    Thanks to Daniele De Matteis, Kerry Gallagher, Enrique J. Hernández, Sonja Krause-Harder and Richard Powell for helping with this post.

  3. How we do HumanOps at Server Density


    HumanOps came from experience of Server Density’s team being on call. In the early years, I was on call 24/7 for long periods of time. As the team grew, we implemented policies and processes to help share the load and deal with the challenges of being interrupted or woken up.

    Through building and selling a monitoring product that is designed to wake people up, we noticed that our customers were experiencing the same kinds of challenges with on call that we had experienced ourselves. Talking with customers revealed this was common within the industry so we examined our own approach and researched best practices within the industry. This led to creating a community to share and discuss a set of principles we called HumanOps.

    Just like transitioning technical practices to adopt the ideas of DevOps, bringing faster deployment, modern tooling and shared development / operations responsibilities, we are hoping HumanOps will help organisations adopt a human approach to building and operating systems.

    Here’s how the HumanOps principles work at Server Density.

    1. Humans build and fix systems

    At the heart of HumanOps are the humans building, running and maintaining the systems. It might be obvious but it’s important to state as the first principle because without acknowledging that running operations necessarily involves humans, it’s too easy to just think about the servers, cloud services and APIs.

    In practice, this means ensuring that in all aspects of systems design and management, you think about how humans are involved from the beginning. Areas to consider include:

    • What aspects of the system operation can be automated? Removing humans from the day to day operation is ideal because they should only be involved when something goes wrong that the system cannot fix.
    • What alerts should you configure to involve humans? At what point does the system need someone to investigate a problem and is it critical, or something that can be resolved during working hours?
    • When humans do have to get involved, how is that involvement highlighted to the rest of the team and management? Are you keeping track of out of hours alerts, the frequency and long term trends?

    2. Humans get tired and stressed, they feel happy and sad

    This is the starting point for improvements to your processes. When working with computers, you can reasonably expect them to perform in the same way regardless of the time of day. This is, of course, the key benefit of computer systems – they can reliably execute tasks without getting tired.

    A common mistake is applying the same logic to humans, or simply not thinking about how humans react differently in different situations. Emotions, stress and fatigue introduce variability so catering for this is an important part of designing the system.

    An example of this is dealing with human error. Computers don’t make mistakes. They won’t suddenly press the wrong button because they were too tired. Humans can, and without the right training and safeguards they probably will. Human error is a natural part of any system, and understanding how that can affect things is important. It should be considered a symptom rather than a problem, and encourage you to look deeper at the context that allowed a human to make the wrong decision.

    Training is one to way to help here. The goal is for training to be as realistic as possible, so when the real thing happens it feels no different from training. This helps to reduce stress in difficult situations, because you know what you’re supposed to be doing. Stress arises from uncertainty coupled with the pressure of knowing the system is broken, so anything that can be done to alleviate that is beneficial. At Server Density, we run war games to simulate common alert scenarios, so that everyone knows what they should do in each situation.

    3. Systems don’t have feelings yet. They only have SLAs

    SLAs are a well understood method of defining what you should expect from a particular service or API. You should be able to easily determine whether a service is hitting its SLA or not, and what happens if it doesn’t. This makes it easy to gauge your expectations.

    4. Humans need to switch off and on again

    Similar to #2, unlike computers which can run constantly for many months and years, humans need time to rest. Responding to alerts and dealing with complex systems quickly takes its toll, so time to rest and recover must be built into the processes. A human can only maintain focused concentration for 1.5 – 2 hours before needing a break or suffering from deteriorating performance.

    The way we deal with this at Server Density is through how we schedule our on call rotations. The primary/secondary roles cycle through the team and we have specific response time guidelines depending on whether you are primary or secondary. This helps to reduce the feeling of being tied to your laptop e.g. the secondary isn’t required to respond as quickly so doesn’t necessarily have to be close to their laptop at all times.

    Further, we have on call recovery time off booked automatically for the next working day whenever you respond to an alert out of hours. The responder has the choice to forgo that time off if they wish but the company will never ask them to do that. This ensures responders have sufficient time to recover and there is no pressure on them not to take it e.g. By asking them to actively request it vs it being given automatically.

    5. The wellbeing of human operators impacts the reliability of systems

    Giving people time off after dealing with alerts overnight might sound like us just being nice to our team. However, as nice as it may seem there is also a business reason behind it – people who are tired make mistakes and there are many examples of major outages caused or made worse by operator fatigue.

    Just like insurance, it can be hard to show a direct benefit because you’re hoping you never have to use it. The benefit of reducing the chances of human error is that something bad does not happen. That can be hard to measure, but there is logical reasoning that if your human operators are happy, they will make better decisions.

    6. Alert fatigue == Human fatigue

    Receiving too many alerts is known as alert fatigue. It’s when you receive so many that you tune out and ignore them, potentially missing something important. It defeats the point of alerting, which should be a rare event to notify humans that something serious is wrong.

    Solving this involves auditing your monitoring to ensure that the alerts you get are actually actionable, and should be actioned.

    7. Automate as much as possible, escalate to a human as a last resort

    This is linked to #6 because alerts should only ever reach a human if the system can’t fix itself. Waking someone up to reboot a server or perform a simple manual action is not acceptable. Where something can be scripted, it should be. Humans should only ever be involved to diagnose complex issues and perform unusual actions which must have human decision or supervision.

    Unfortunately, this is difficult to solve after a system has gone into production. This is because with modern technologies such as Kubernetes and cloud APIs, it is possible to automate recovery of almost every type of failure but it is a lot of work to retrofit new technologies to legacy systems. Of course, there is a cost in both running redundant systems and the time required to implement, but it will repay itself with the time saved from your human team and the reliability offered to customers.

    The right principle to apply to building new infrastructure is that nothing in production should ever be done manually. Everything should be templated and scripted, so it can be handled automatically.

    When retrofitting legacy infrastructure, a balance has to be struck because it may not be realistic to rewrite major components into containers, for example. But there may be ways to achieve similar goals e.g. Moving a self-hosted database to a managed service such as AWS RDS.

    8. Document everything. Train everyone.

    Nobody really likes writing documentation but it quickly becomes necessary as your team grows and as the system becomes more complex. You need sufficient documentation such that someone with limited knowledge of the detailed internals is able to resolve problems with checklists and run books.

    Training is just as important, and will help to reveal deficiencies in the documentation. Running realistic simulations in addition to walking people through how things work is essential for anyone on call.

    At Server Density we make use of Google Drive to help make our documentation easily accessible and searchable to the whole company, but there are plenty of other options for hosting your docs.

    9. Kill the shame game

    Getting to the root cause of a problem will almost inevitably mean that you find that someone made a mistake, didn’t plan every scenario, made mistaken assumptions or introduced a bug. This is normal and people should not be shamed as a result because they will be less likely to want to help discover the problems next time.

    Nobody is perfect and everyone has broken production at least once! The important part is not blaming an individual, but learning how to make the system better and more resilient to those kinds of problems. It is almost never the case that someone deliberately caused a breakage and so people should be comfortable owning up to their mistakes as soon as they realize them, so a fix can be implemented quickly. Failures should be viewed as an opportunity to learn and get better as a team.

    The way to implement this is with the principle of blameless post-mortems. This involves completing an analysis of the incident to understand what went wrong right down to the root cause but without singling out an individual at fault.

    10. Human issues are system issues

    There is a tendency to consider human and system issues separately. It is normal to be able to justify spending on additional system capacity and failover but managers are less used to thinking about human issues with the same priority. All the principles above highlight why human issues are just as important, and so they should be given the same time consideration and budget.

    When planning our development cycles at Server Density, we often prioritise tasks based on whether the fix will reduce the number of out of hours alerts. Implementing fixes for issues discovered in our incident post mortems becomes high priority if the issue is waking people up, or has the potential to in the future.

    11. Human health impacts business heath

    The justification for #10 is that if our human health and wellbeing is impacting on our work and contributing to system problems, and system problems are causing loss of revenue or reputation, then human health is directly related to business health. Hiring is expensive and time consuming so looking after your team is just good business.

    12. Humans > systems

    Although it is important to consider humans and systems to be the same in terms of level of impact they have on each other, and how interconnected they are, humans are ultimately the most important. After all, why does your business exist in the first place? To provide a service to other humans! And why do people do a particular job if not to help provide them with a living?

    Not only that but improving life for your own team is easily justifiable. To be able to hire and retain the best people, you must have good working practices. Constantly being woken up, blaming people for errors and not fixing problems will eventually take its toll on people. Increased stress levels over a prolonged period of time can have significant health impacts and has been linked to high blood pressure, heart disease, obesity and diabetes. Many organisations are unintentionally impacting the health of their employees in significant ways.

    At Server Density we believe this is an unacceptable cost of business success.

  4. AWS Outage Teaches Us Monitor Cloud Like It’s Your Data Center

    Comments Off on AWS Outage Teaches Us Monitor Cloud Like It’s Your Data Center

    At the beginning of the month, AWS suffered a major outage of its S3 service, the original storage product which launched Amazon Web Services back in 2006.

    Reliance on this service was highlighted by the vast number of services which suffered downtime or degraded service as a result. The root cause turned out to be human error followed by cascading system failures.

    With a growing dependence on the cloud for computing and with no signs of demand for cloud resources abating, we really need to treat those resources like the on-premises data center that we relied on for so many years.

    As the article, “3 Steps to Ensure Cloud Stability in 2017” points out “it’s critical to ensure the stability of your cloud ecosystem” and that starts with monitoring. The article offers the following advice: “Ensure that you have access to reports which can give you actionable, predictive analytics around your cloud so that you can stay ahead of any issues. This goes a long way in helping your cloud be stable.”

    Of course, I couldn’t agree more! Server Density even built an app to send notifications when cloud providers have outages.

    The cloud might provide “unlimited” scalability and instant provisioning, but the SLAs and reliability guarantees are often confused with meaning 100% uptime and complete reliability. Note that S3 itself guarantees 99.99% uptime every year, which equates to just under an hour of expected downtime.

    But note that the outage only affected the US East region. Other regions were unaffected, yet the fact that many services suffered outages indicates they are relying on a single region for deployments. AWS runs many zones within regions, which are equivalent to individual data centers but are still within a logical group and a small geographical area. Cross region deployment is typically reserved for mitigating against geographic events e.g. storms, but should also be used to mitigate software and system failures. Good systems practice means code changes get rolled out gradually and indeed, AWS states that regions are entirely isolated and operated independently.

    S3 itself has a feature which automates cross region replication. Of course, this doubles your bill because you have data in two regions, but it does allow you to switch over in the event an entire region is lost. Whether that cost is worth it depends on the type of service you’re running. Expecting an hour a year of downtime is the starting point for the cost benefit calculation, but this particular outage took the service offline for more than that.

    Human error can never be eliminated, but the chances can be reduced. Using automation, checklists and ensuring teams practice incident response all contribute to good system design. Having a plan when things go wrong is crucial, just as crucial as testing the plan actually works on a regular basis! And when the incident is resolved, following up with a detailed (and blameless) post mortem will provide reassurance to customers that you are working to prevent the same situation from happening again.

    Outages will always throw up something interesting, such as the AWS Status Dashboard itself being hosted on S3. The key is knowing when something is going wrong, having a plan and closing it up with a post mortem.

  5. Time series data with OpenTSDB + Google Cloud Bigtable

    Comments Off on Time series data with OpenTSDB + Google Cloud Bigtable

    For the last 6 years, we’ve used MongoDB as our time series datastore for graphing and metrics storage in Server Density. It has scaled well over the years and you can read about our setup here and here, but last year we came to the decision to replace it with Google Cloud Bigtable.

    As of today, the migration is complete and all customers are now reading from our time series setup running on Google Cloud. We successfully completed a migration with over 100,000 writes per second into a new architecture with a new database on a new cloud vendor with no downtime. Indeed, all customers should notice is even faster graphing performance!

    I presented this journey at Google’s Cloud Next conference last week, so this post is a writeup of that talk, which you can watch below:

    The old architecture

    Our server monitoring agent posts data back from customer servers over HTTPS to our intake endpoint. From here, it is queued for processing. This was originally based on MongoDB as a very lightweight, redundant queuing system. The payload was processed against the alerting rules engine and any notifications sent. Then it was passed over to the MongoDB time series storage engine, custom written in PHP. Everything ran on high spec bare metal servers at Softlayer.

    The old Server Density time series architecture

    Scaling problems

    Over the years, we rewrote every part of the system except the core metrics service. We implemented a proper queue and alerts processing engine on top of Kafka and Storm, rewriting it in Python. But MongoDB scaled with us until about a year ago, when the issues that had been gradually growing began to cause real pain.

    • Time sink. Whilst we used a product, MongoDB, it is designed as a general purpose database and we had to implement a custom API and schema to have it handle time series data efficiently. This was taking a lot of time to maintain.
    • We want to build more. The metrics service was custom built and as a small team, we didn’t have the time to build basic time series features like aggregation and statistics functions. We were focused on other areas of the product without time to enhance basic graphing.
    • Unpredictable scaling. Contrary to popular belief, MongoDB does scale! However, getting sharding working properly is complex and replica sets can be a pain to maintain. You have to be very careful to maintain sufficient overhead so when you add a new shard, migrations can take place without impacting the rest of the cluster. It’s also difficult to estimate resource usage and predict what is needed to continue to maintain performance.
    • Expensive hardware. To ensure queries are fast, we had to maintain huge amounts of RAM so that commonly accessed data is in memory. SSDs are needed for the rest of the data – tests we did showed that HDDs were much too slow.

    Finding a replacement

    In early 2016 we decided to evaluate alternatives. After extensive testing and evaluation of a range of options including Cassandra, DynamoDB, Redshift, MySQL and even flat files, we picked OpenTSDB running on Google Cloud Bigtable as the storage engine.

    • Managed service. Google Cloud Bigtable is fully managed. You simply choose the storage type (HDD or SSD) and how many nodes you want, and Google deals with everything else. We would no longer need to worry about hardware sizing, component failures, software upgrades or any other infrastructure management tasks.
    • OpenTSDB is actively maintained. All the features we wanted right now, and also want to build into the product are available as standard with OpenTSDB. It is actively developed so new things are regularly released, which would mean we could add features with minimal effort. We have also contributed fixes back to the project because it is open source.
    • Linear scalability. When you buy a Bigtable node, you get 10,000 reads/writes per second at 6ms 99th percentile latency. We can easily measure our throughput and calculate it on a per customer basis, so we know exactly when to scale the system. Deploying a new node takes 1 click and will be online within minutes. Contrast this with ordering new hardware, configuring it, deploying MongoDB replica sets, adding the shard and then waiting for data to rebalance. Bigtable gives us linear scalability of both cost and performance.
    • Specialist datastore. MongoDB is a good general purpose database, but Bigtable is optimised specifically for our data format. It learns usage patterns, distributing data around the cluster to optimise performance. It’s much more efficient for this type of data so we can see significant performance and cost improvements.

    The migration

    The first challenge for the migration was that it needed to communicate across providers – moving from Softlayer to Google Cloud. We tested a few options but since Server Density is built using HTTP microservices and every service is independent, we decided to implement it entirely on Google Cloud, exposing the APIs over HTTPS restricted to our IPs. Payloads still come into Softlayer and are queued in Kafka, but they are then posted over the internet from Softlayer to the new metrics service running on Google Cloud. Client reads are the same.

    We thought this might cause performance problems but in testing, we only saw a slight latency increase because we picked a Google region close to our primary Softlayer environment. We are in the process of migrating all our infrastructure to Google Cloud so this will only be a temporary situation anyway.

    Our goal was to deploy the new system with zero downtime. We achieved this by implementing dual writes so Kafka queues up a write to the new system as well as the old system. All writes from a certain date went to both systems and we ran a migration process to backfill the data from the old system into the new one. As the migration completed, we flipped a feature flag for each customer so it gradually moved everyone over to the new system.

    The new system looks like this:

    The new Server Density time series architecture

    Using the Google load balancers, we expose our metrics API which abstracts the OpenTSDB functionality so that it can be queried by our existing UI and API. OpenTSDB itself runs on Google Container Engine, connecting via the official drivers to Google Cloud Bigtable deployed across multiple zones for redundancy.

    What did we end up with?

    A linearly scalable system, high availability across multiple zones, new product features, lower operational overhead and lower costs.

    As a customer, you should notice faster loading graphs (especially if you’re using our API) right away. Over the next few months we’ll be releasing new features that are enabled by this move, the first you may have already noticed as unofficially available – our v2 agent can backfill data when it loses network connectivity or cannot post back for some reason!

  6. Comparing Server Density vs Datadog


    In 2009 I wrote the original version of the Server Density monitoring agent – sd-agent – designed to be lightweight and quick and easy to deploy. It was released under the FreeBSD open source license because I thought that if I was installing software onto my systems, I would at least want to have the ability to examine the source code! There weren’t any good SaaS server monitoring products, so I decided to build one.

    In 2010, Datadog forked sd-agent into dd-agent and started building their company around the agent. Since then, they have grown very quickly, raised a huge amount of investment and added a lot of functionality to their product.

    At the end of 2015 we released sd-agent v2, which was a merged version of dd-agent. We brought many of the improvements from the Datadog fork into the Server Density agent (although we decided to package plugins independently, rather than bundled into one distribution so you can maintain a lightweight installation and update components separately).

    We think Server Density is a great alternative to Datadog, and there are a few features in particular which make Server Density vs Datadog an interesting comparison.

    Tag based user permissions

    Server Density uses tags to allow you to choose which users can have access to specific resources. You can add as many users as you wish and by using tags, you can control whether they can see particular servers and availability monitors. Uses for this range from reselling monitoring to your customers through to giving particular access to development teams whilst the operations team maintain a complete overview.

    Server Density vs Datadog: Tag based user permissions

    Slackbot for chatops

    Our Slackbot allows you to ask questions about the state of your systems. Request graphs and check alerts from within Slack.

    Server Density vs Datadog: Server monitoring Slackbot

    Website and API availability monitoring

    By running monitoring nodes in locations all over the world, you can use Server Density to quickly configure HTTP and TCP availability checks to monitor website, application and API response time and uptime. We run the monitoring locations for you, so you can get an external perspective of your customer experience.

    We have monitoring locations in: Australia, Brazil, Chile, China, France, Germany, Hong Kong, Iceland, Ireland, Italy, Japan, The Netherlands, New Zealand, Russia, Singapore, South Africa, Spain, Sweden, UK and USA.


    We started the HumanOps community in 2016 to encourage the operations community to discuss the human aspects of running infrastructure. This has resulted in events around the world, including the UK, US, France, Germany, Poland and more. Companies such as Spotify, PagerDuty, Yelp and Facebook have contributed to sharing ideas and best practices for life on call, dealing with technical debt, fatigue and stress.

    Not only that, but we’re building features inspired by HumanOps, such as our Alert Costs functionality that reports on how much time alerts are wasting for your team.

    We’re building more functionality to help teams implement HumanOps principles in their own company – a journey unique to Server Density.

    Try Server Density

    These are the key features we think makes us stand out when comparing Server Density vs Datadog but the best way is to try the product yourself!

    Whether you’re after a less complex alternative like Firebox were, or whether you don’t want to have to deal with managing your own open source monitoring like furryLogic, Server Density is a great choice.

    Sign up for a free trial.

  7. Saving $500k per month buying your own hardware: cloud vs co-location


    Editor’s note: This is an updated version of an article originally published on GigaOm on 07/12/2013.

    A few weeks ago we compared cloud instances against dedicated servers. We also explored various scenarios where it can be significantly cheaper to use dedicated servers instead of cloud services.

    But that’s not the end of it. Since you are still paying on a monthly basis then if you project the costs out over 1 to 3 years, you end up paying much more than it would have cost to outright purchase the hardware. This is where buying and co-locating your own hardware becomes a more attractive option.

    Putting the numbers down: cloud vs co-location

    Let’s consider the case of a high throughput database hosted on suitable machines on cloud and dedicated servers and on a purchased/co-located server. For dedicated instances, Amazon has a separate fee structure and on Rackspace you effectively have to get their largest instance type.

    So, calculating those costs out for our database instance on an annual basis would look like this:

    Amazon EC2 c3.4xlarge dedicated heavy utilization reserved
    Pricing for 1-year term
    $4,785 upfront cost
    $0.546 effective hourly cost
    $2 per hour, per region additional cost
    $4,785 + ($0.546 + $2.00) * 24 * 365 = $27,087.96

    Rackspace OnMetal I/O
    Pricing for 1-year term
    $2.46575 hourly cost
    $0.06849 additional hourly cost for managed infrastructure
    Total Hourly Cost: $2.53424
    $2.53424 * 24 * 365 = $22,199.94


    Given the annual cost of these instances, it makes sense to consider dedicated hardware where you rent the resources and the provider is responsible for upkeep. Here, at Server Density, we use Softlayer, now owned by IBM, and have dedicated hardware for our database nodes. IBM is becoming very competitive with Amazon and Rackspace so let’s add a similarly spec’d dedicated server from SoftLayer, at list prices. To match a similar spec we can choose the Monthly Bare Metal Dual Processor (Xeon E5-2620 – 2.0Ghz, 32GB RAM, 500GB storage). This bears a monthly cost of $491 or $5,892/year.

    Dedicated servers summary

    Rackspace Cloud Amazon EC2 Softlayer Dedicated
    $22,199.54 $27,087.96 $5,892

    Let’s also assume purchase and colocation of a Dell PowerEdge R430 (two 8-core processors, 32GB RAM, 1TB SATA disk drive).

    The R430 one-time list price is $3,774.45 – some 36% off the price of the SoftLayer server at $5,892/year. Of course there might be some more usage expenses such as power and bandwidth, depending on where you choose to colocate your server. Power usage in particular is difficult to calculate because you’d need to stress test the server, figure out the maximum draw and run real workloads to see what your normal usage is.

    Running our own hardware

    We have experimented with running our own hardware in London. In order to draw some conclusions we decided to use our 1U Dell server that has specs very similar to Dell R430 above. With everyday usage, our server’s power needs range close to 0.6A. For best results we stress tested it with everything maxed, for a total of 1.2A.

    Hosting this with the ISP who supplies our office works out at $140/month or $1,680/year. This makes the total annual cost figures look as follows:

    Rackspace Cloud Amazon EC2 Softlayer Dedicated Co-location
    $22,199.54 $27,087.96 $5,892 $5,454.45/year 1, then $1,680/year

    With Rackspace, Amazon and SoftLayer you’d have to pay the above price every year. With co-location, on the other hand, after the first year the annual cost drops to $1,680 because you already own the hardware. What’s more, the hardware can also be considered an asset yielding tax benefits.

    Large scale implementation

    While we were still experimenting on a small scale, I spoke to Mark Schliemann, who back then was VP of Technical Operations at They’d been running a hybrid environment and they had recently moved the majority of their environment off AWS and into a colo facility with Nimbix. Still, they kept using AWS for processing batch jobs (the perfect use case for elastic cloud resources).

    Moz worked on detailed cost comparisons to factor in the cost of the hardware leases (routers, switches, firewalls, load balancers, SAN/NAS storage & VPN), virtualization platforms, misc software, monitoring software/services, connectivity/bandwidth, vendor support, colo, and even travel costs. Using this to calculate their per server costs meant that on AWS they would spend $3,200/m vs. $668/m with their own hardware. Their calculations resulted in costs of $8,096 vs. $38,400 at AWS, projecting out 1 year.

    Optimizing utilization is much more difficult on the cloud because of the fixed instance sizes. Moz found they were much more efficient running their own systems virtualized because they could create the exact instance sizes they needed. Cloud providers often increase CPU allocation alongside memory whereas most use cases tend to rely on either one or the other. Running your own environment allows you to optimize this balance, and this was one of the key ways Moz improved their utilization metrics. This has helped them become more efficient with their spending.

    Here is what Mark told me: “Right now we are able to demonstrate that our colo is about 1/5th the cost of Amazon, but with RAM upgrades to our servers to increase capacity we are confident we can drive this down to something closer to 1/7th the cost of Amazon.”

    Co-location has its benefits, once you’re established

    Co-location looks like a winner but there are some important caveats:

    • First and foremost, you need in-house expertise because you need to build and rack your own equipment and design the network. Networking hardware can be expensive, and if things go wrong your team needs to have the capacity and skills to resolve any problems. This could involve support contracts with vendors and/or training your own staff. However, it does not usually require hiring new people because the same team that deals with cloud architecture, redundancy, failover, APIs, programming, etc, can also work on the ops side of things running your own environment.
    • The data centers chosen have to be easily accessible 24/7 because you may need to visit at unusual times. This means having people on-call and available to travel, or paying remote hands at the data center high hourly fees to fix things.
    • You have to purchase the equipment upfront which means large capital outlay (although this can be mitigated by leasing.)

    So what does this mean for the cloud? On a pure cost basis, buying your own hardware and colocating is significantly cheaper. Many will say that the real cost is hidden in staffing requirements but that’s not the case because you still need a technical team to build your cloud infrastructure.

    At a basic level, compute and storage are commodities. The way the cloud providers differentiate is with supporting services. Amazon has been able to iterate very quickly on innovative features, offering a range of supporting products like DNS, mail, queuing, databases, auto scaling and the like. Rackspace was slower to do this but has already started to offer similar features.

    Flexibility of cloud needs to be highlighted again too. Once you buy hardware, you’re stuck with it for the long term, but the point of the example above was that you had a known workload.

    Considering the hybrid model

    Perhaps a hybrid model makes sense, then? This is where I believe a good middle ground is. I know I saw Moz making good use of such a model. You can service your known workloads with dedicated servers and then connect to the public cloud when you need extra flexibility. Data centers like Equinix offer Direct Connect services into the big cloud providers for this very reason, and SoftLayer offers its own public cloud to go alongside dedicated instances. Rackspace is placing bets in all camps with public cloud, traditional managed hosting, a hybrid of the two, and support services for OpenStack.

    And when should you consider switching? Nnamdi Orakwue, Dell VP of Cloud until late 2015, said companies often start looking at alternatives when their monthly AWS bill hits $50,000 but is even this too high?

  8. Datacenter efficiency and its effect on Humans

    Comments Off on Datacenter efficiency and its effect on Humans

    Did you know?

    About 2 percent of world energy expenditure goes into datacenters. That’s according to Anne Curie, co-founder of Microscaling Systems who spoke at the most recent HumanOps event here in London.

    That 2 percent is on par with the aviation industry who, as Curie points out, gets plenty of slack very publicly about being a serious polluter—even if the aviation industry is incredibly more efficient than the datacenter industry average.

    Curie starts her talk with some good news. To a large extend, all the tech progress achieved over the last 20 years went into improving the lives of developers and ops people alike. The cloud takes away the pain of deploying new machines, while higher level languages like Ruby and Python make development exponentially quicker and painless.

    We optimize for speed of deployment and we optimize for developer productivity. We use an awful lot of Moore’s Law gains in order to do that.

    Anne Curie

    Enter datacenter efficiency

    But there is a caveat to all that progress. Suddenly all of that motivation you had for using your servers more efficiently is gone because somebody else is maintaining those servers for you. You don’t have to worry about where they are, you don’t have to lug them, you don’t even have to order them or find space for them.

    Anne Curie offers some fascinating insights on what all this progress means for humans, their systems, and the environment overall.

    Want to find out more? Watch Anne Curie’s talk. And if you want the full transcript (it’s a keeper), go ahead and use the download link right below this post.

    What is HumanOps again?

    HumanOps is a collection of principles that advance our focus away from systems, and towards humans. It starts from a basic conviction, namely that technology affects the wellbeing of humans just as humans affect the reliable operation of technology.

    Alert Costs is one such feature. Built right into Server Density, Alert Costs measures the impact of alerts in actual human hours. Armed with this knowledge, a sysadmin can then look for ways to reduce interruptions, mitigate alert fatigue, and improve everyone’s on-call shift.

    Find out more about Alert Costs, and see you on our next HumanOps event.

  9. Automatic timezone conversion in JavaScript


    Editor’s note: This is an updated version of an article originally published here on 21/01/2010.

    It’s been awhile since JavaScript charts and graphs became the go-to industry norm for data visualization. In fact we decided to build our own graphing engine for Server Density several years ago. That’s because we needed some functionality that was not possible with the Flash charts we used earlier. Plus, it allowed us to customize the experience to better fit our own design.

    Since then we’ve been revamping the entire engine. Our latest charts take advantage of various modern JS features such as toggling line series, pinning extended info and more.

    Automatic timezone conversion in JavaScript 1

    Switching to a new graphing engine was no painless journey of course. JS comes with its own challenges, one of which is automatic timezone conversion.

    Timezones are a pain

    Timezone conversion is one of the issues you should always expect to deal with when building JS applications targeted at clients in varying timezones. Here is what we had to deal with.

    Our new engine supports user preferences with timezones. We do all the timezone calculations server-side and pass JSON data to the Javascript graphs, with the timestamps for each point already converted.

    However, it turns out that the JavaScript Date object does its own client-side timezone conversion based on the user’s system timezone settings. This means that if the default date on the graph is 10:00 GMT and your local system timezone is Paris, then JavaScript will automatically change that to 11:00 GMT.

    This only works when the timestamp passed is in GMT. So it presents a problem when we have already done the timezone conversion server-side, i.e. the conversion will be calculated twice – first on the server, then again on the client.

    We could allow JavaScript to handle timezones and perform all the conversions. However, this would result in messed up links, because we used data points to redirect the user to the actual snapshots.

    Snapshots are provided in Unix timestamp format, so even if the JS did the conversion, the snapshot timestamp would still be incorrect. To completely remove the server side conversion and rely solely on JS would require more changes and a lot more JS within the interface.

    UTC-based workaround

    As such, we modified our getDate function to return the values in UTC—at least it is UTC as far as JS is concerned but in reality we’d have already done the conversion on the server. This effectively disables the JavaScript timezone conversion.

    The following code snippet converts the Unix timestamp in JavaScript provided by the server into a date representation that we can use to display in the charts:

    getDate: function(timestamp)
    // Multiply by 1000 because JS works in milliseconds instead of the UNIX seconds
    var date = new Date(timestamp * 1000);
    var year = date.getUTCFullYear();
    var month = date.getUTCMonth() + 1; // getMonth() is zero-indexed, so we'll increment to get the correct month number
    var day = date.getUTCDate();
    var hours = date.getUTCHours();
    var minutes = date.getUTCMinutes();
    var seconds = date.getUTCSeconds();
    month = (month < 10) ? '0' + month : month;
    day = (day < 10) ? '0' + day : day;
    hours = (hours < 10) ? '0' + hours : hours;
    minutes = (minutes < 10) ? '0' + minutes : minutes;
    seconds = (seconds < 10) ? '0' + seconds: seconds;
    return year + '-' + month + '-' + day + ' ' + hours + ':' + minutes;

    So this is how we handle timezone with JavaScript for the Server Density graphing engine. What is your experience with timezones in JavaScript?

  10. How GOV.UK Reduced their Incidents and Alerts

    Comments Off on How GOV.UK Reduced their Incidents and Alerts

    Did you watch last week’s HumanOps video—the one with Spotify? How about the one with Barclays?

    Keep reading gentle reader, this is not some Friends episode potboiler joke. We just can’t help getting pumped up with all the amazing HumanOps work that’s happening out there. Independent 3rd party events are now taking place around the world (San Francisco and Poznan most recently).

    So we decided to host another one closer to home in London.

    The event will take place at the Facebook HQ (get your invite). And for those of you who are not around London in November, fear not. We’ll fill you in right here at the Server Density blog.

    In the meantime, let’s take a look at the recent GOV.UK HumanOps talk. GOV.UK is the UK government’s digital portal. Millions of people access GOV.UK every single day whenever they need to interact with the UK government.

    Bob Walker, Head of Web Operations, spoke about their recent efforts to reduce their incidents and alerts (a core tenet of HumanOps). What follows is the key take-aways from his talk. You can also watch the entire video or download it in PDF format and read at your own time (see right below the article).

    GOV.UK does HumanOps

    After extensive rationalisation, GOV.UK have reached a stage where only 6 types of incidents can alert (wake them up) out of hours. The rest can wait until next morning.

    GOV.UK mirrors their website across disparate geographical locations and operates a managed CDN at the front. As a result, even if parts of their infrastructure fail, most of their website should remain available.

    Once issues are resolved, GOV.UK carries out incident reviews (their own flavour of postmortems). In reiterating the importance of blameless postmortems, bob said:

    Every Wednesday at 11:00AM they test their paging system. The purpose of this exercise is to not only test their monitoring system but also to ensure people have configured their phones to receive alerts!

    Want to find out more? Watch Bob Walker’s talk. And if you want the full transcript, go ahead and use the download link right below this post.

    See you in a HumanOps event!

Articles you care about. Delivered.

Help us speak your language. What is your primary tech stack?

Maybe another time