Author Archives: David Mytton

About David Mytton

David Mytton is the founder of Server Density. He has been programming in PHP and Python for over 10 years, regularly speaks about MongoDB (including running the London MongoDB User Group), co-founded the Open Rights Group and can often be found cycling in London or drinking tea in Japan. Follow him on Twitter and Google+.
  1. DevOps On-Call: How we Handle our Schedules


    I was on call 24/7/365.

    That’s right. When we first launched our server monitoring service in 2009, it was just me that knew how the systems worked. I was on-call all the time.

    You know the drill. Code with one hand. Respond to production alerts with the other.

    In time, and as more engineers joined the team, we became a bit more deliberate about how we handle production incidents. This post outlines the thinking behind our current escalation process and some details on how we distribute our on-call workload between team members.

    DevOps On-Call

    Developing our product and responding to production alerts when the product falters, are two distinct yet very much intertwined activities. And so they should be.

    If we were to insulate our development efforts from how things perform in production, our priorities would get disconnected. Code resilience is at least as important as building new features. In that respect, it makes great sense to expose our developers and designers to production issues. As proponents of DevOps we think this is a good idea.

    Wait, what?

    But isn’t this counterproductive? The state of mind involved in writing code couldn’t be more different from that of responding to production alerts.

    When an engineer (and anyone for that matter) is working on a complex task the worst thing you can do is expose them to random alerts. It takes more than 15 minutes, on average, to regain intense focus after being interrupted.

    How we do it

    It takes consideration and significant planning to get this productivity balance right. In fact, it’s an ongoing journey. Especially in small and growing teams like ours.

    With that in mind, here is our current process for dealing with production alerts:

    First Tier

    During work hours, all alerts go to an operations engineer. This provides a much needed quiet time for our product team to do what they do best.

    Outside work hours alerts could could go to anyone (ops or product alike). We rotate between team members every seven days. At our current size, each engineer gets one week on call and eight weeks off. Everyone gets a fair crack of the whip.

    Second Tier

    An escalated issue will most probably involve production circumstances unrelated to code. For that reason, second level on-call duty rotates between operation engineers only, as they have a deeper knowledge of our hardware and network infrastructure.

    Our PagerDuty Setup

    For escalations and scheduling we use PagerDuty. If an engineer doesn’t respond within the time limit (15 minutes of increasingly frequent notifications via SMS and phone) there will always be someone else available to take the call.

    Our ops engineers are responsible for dealing with any manual schedule overrides. If someone is ill, on holiday or is traveling then we ask for volunteers to rearrange the on-call slots.


    After an out-of-hours alert the responder gets the following 24 hours off from on-call. This helps with the social/health implications of being woken up multiple nights in a row.

    Keeping Everyone in the Loop

    We hold weekly meetings with operations and product leads. The intention is to keep everyone on top of the overall product health, and to help our product team prioritise their development efforts.

    While I don’t have on-call duties anymore (client meetings and frequent travel while on call just doesn’t make sense any more) I can still monitor our alerts on the Server Density mobile app (Android and iOS) which has a prominent place on my home screen together with apps from the other monitoring tools we use.

    What about you? How do you handle your devops on-call schedules?

  2. Escaping Rabbit Holes with Rubber Duck Debugging and more

    Leave a Comment

    In 1940, a team of Harvard University researchers discovered a moth stuck in a relay of their Mark II computer. A tiny bug was blocking the operation of a supercomputer.

    Theories abound as to where the term “debugging” originates from. Humans have been debugging since the beginning of times. At its core, debugging is problem solving. Solving the problem of catching fish, for example, when the climate changed and freshwater streams froze. Or figuring out how to rescue the Apollo 13 crew using near-zero electricity.

    It’s not easy

    Debugging, i.e. problem solving, is the most complex of all intellectual functions. Yes, humans have been doing it for a long time but that doesn’t mean we’re good at it, or that we enjoy doing it.

    In fact we often avoid it.

    Problem solving doesn’t come free. The human brain represents no less than 20% of total body energy expenditure. And since humans are hardwired for survival and energy conservation, we tend to relegate thinking to a last resort activity.

    Getting stuck in a rabbit hole

    Even when we think we’re problem solving, chances are we’re not. We naturally default on automatic activities that don’t require actual problem-solving, and don’t overtax our grey matter in any way (see functional fixedness).

    In other words it’s easier to dig than to think. Digging is a manual, repetitive activity. Do enough of that without thinking and we find yourselves in a hole that gets deeper and deeper. The deeper it gets, the more invested we become.

    Rabbit hole activity is when we spend a disproportionate amount of time on a task. The importance of what we’re doing (our sunk costs) is then warped and our logic distorted.

    Last week we poured several development hours down such a rabbit hole, trying to fix a wait/repeat issue with one of the alerts of our monitoring software.

    We started troubleshooting the core logic of our code, assuming there was something wrong with internal state migrations. The actual bug turned out elsewhere (the UI was updating a field that it should not update). So our initial hunch turned out to be misplaced. There is nothing wrong with that.

    What is not ideal is the amount of time we spent under the wrong assumption. The misapplication of our resources began when we stopped questioning our hypothesis, i.e. when we invested ourselves in it.

    We’re not impervious to getting stuck. We do however have systems in place to help us deal with it when it happens. Here a are five of them:

    1. Restore Flow – Context Switch

    We’ve all felt it. The feeling of being in the right place. When you can’t type fast enough. When inspiration “flows” and you forget about time.

    When the opposite happens, i.e. when we’re stuck, it’s because the solution or breakthrough we’re after is not there, spatially or mentally. We’re going nowhere. We’re not moving.

    When that happens it’s often best to get up and go. A walk in nearby Chiswick Commons often does the trick for us. We leave our thought patterns behind and let our brain roam farther.

    Context switching helps. Every 6 weeks everyone stops scheduled work and spends a whole week working on a side project of their choice. This purposeful distraction is all about leaving our problems for a period of time. The difficulty has often evaporated by the time we’re back.

    Finally, we’d be remiss not to mention sleep, even if it sounds obvious. Sleep not only rejuvenates our brain but it also causes what neuroscientists call incubation effect. It’s almost as if our brain debugs for us while we sleep.

    2. Explain the problem

    “Simplicity is the ultimate sophistication.”

    Leonardo Da Vinci

    When we explain things to people we tend to slow down. Why? Because the person we’re talking to is removed from our situation. They haven’t caught up yet.

    For them to understand our problem, we set it forth in the simplest possible terms: What is it we’re trying to solve? We state our qualifiers: Why are we spending time on this problem? Why is it important? We also tell them what we’ve tried so far.

    If it sounds like hard work it’s because it is. The rigour behind good questions is often enough for solutions to magically present themselves. Articulating good questions pushes our brain into problem solving mode.

    And that paves the way to the next technique. . .

    3. Rubber Duck Debugging – Ask the Duck!

    This infamous rubber duck came to life in 1999 along with the publication of the Pragmatic Programmer.

    Rubber ducks are cute. Problem is, they’re not very smart. In order for them to understand the nature of the bug we’re dealing with, our explanation needs to be extra thorough. As per previous method, we need to slow down and simplify things for them.

    Staring into the guileless smiling innocence of our duck, we often find ourselves wondering: is there an easier way? Does this have to be done at all?

    This is one of our rubber ducks, here at Server Density.

    Our very own rubber duck

    4. Peer Reviews and Pair Programming

    There are times when talking to an inanimate object is not feasible (open plan office?). Or the solution to the problem might lie outside our domain or scope.

    Nothing happens in isolation, and there is something to be said about teamwork. Having a coding buddy can alleviate some of the horrors of getting stuck. Heaven forbid, it might even make debugging fun.

    Peer reviews (many eyes on the code) is a great way to work with fellow developers and make sure your code is bug free. It’s also a nice way to learn and develop professionally.

    5. Plan Ahead

    Decide how much individual effort (time) you intend to invest on debugging a particular issue, before you move on to something else.

    Sometimes we’re dealing with special types of bugs. Like the elusive heisenbugs, or fractal bugs that point to ever more bugs. It’s easy for a bug to turn into a productivity black hole.

    When you reach the end of the allotted time, it’s best to move on and tackle something else. Don’t let one task (bug) swallow other priorities and jeopardise the progress of your project.

    Once you’ve context-switched, worked on something else, and had a break, you’re better equipped to revisit the bug and determine your next steps (and priorities) with a clear mind.

    By the way, it’s worth mentioning debugging tools here. Things like central logging, error monitoring et cetera, can make a huge difference on how quickly you solve a bug. We will discuss this topic on a future post.


    Debugging efforts often go awry and we find ourselves lost in productivity rabbit holes. It happens when our mind trades hard problem solving for easier, repetitive activities that lead nowhere.

    Learn to spot when you’re getting stuck. Know the signs and get better at climbing out of those rabbit holes. Context switching, slowing down, rubber duck debugging, and planning ahead, are proven methods that get you back on course.

  3. Building Your own Chaos Monkey


    In 2012 Netflix introduced one of the coolest sounding names into the Cloud vernacular.

    What Chaos Monkey does is simple. It runs on Amazon Web Services and its sole purpose is to wipe out production instances in a random manner.

    The rationale behind those deliberate failures is a solid one.

    Setting Chaos Monkey loose on your infrastructure—and dealing with the aftermath—helps strengthen your app. As you recover, learn and improve on a regular basis, you’re better equipped to face real failures without significant, if any, customer impact.

    Our monkey

    Since we don’t use AWS or Java, we decided to build our own lightweight simian in the form of a simple Python script. The end-result is the same. We set it loose on our systems and watch as it randomly seeks and destroys production instances.

    What follows is our observations from those self-inflicted incidents, followed by some notes on what to consider when using a Chaos Monkey on your infrastructure.

    Monkey Island - 3 headed monkey

    Design Considerations

    1. Trigger chaos events during business hours

    It’s never nice to wake up your engineers with unnecessary on-call events in the middle of the night. Real failures can and do happen 24/7. When it comes to Chaos Monkey, however, it’s best to trigger failures when people are around to respond and fix them.

    2. Decide how much mystery you want

    When our Python script triggers a chaos event, we get a message in our HipChat room and everyone is on the look out for strange things.

    The message doesn’t specify what the failure is. We still need to triage the alerts and determine where the failures lie, just as we would in the event of a real outage. All this “soft” warning does is lessen the chance of failures going unnoticed.

    3. Have several failure modes

    Killing instances is a good way to simulate failures but it doesn’t cover all possible contingencies. At Server Density we use the SoftLayer API to trigger full and partial failures alike.

    A server power-down, for example, causes a full failure. Disabling networking interfaces, on the other hand, causes partial failures where the host may continue to run (and perhaps even send reports to our monitoring service).

    4. Don’t trigger sequential events

    If there’s ever a bad time to set your Chaos Monkey loose, that’s during the aftermath of previous chaos event. Especially if the bugs you discovered are yet to be fixed.

    We recommend you wait a few hours before introducing the next failure. Unless you want your team firefighting all day long.

    5. Play around with event probability

    Real world incidents have a tendency to transpire when you least expect them. So should your chaos events. Make them infrequent. Make them random. Space them out, by days even. That’s the best way to test your on-call readiness.

    Initial findings

    We’ve been triggering chaos events for some time now. None of the issues we’ve discovered so far were caused by the server software. In fact, scenarios like failovers in load balancers (Nginx) and databases (MongoDB) worked very well.

    Every single bug we found was in our own code. Most had to do with how our app interacts with databases in failover mode, and with libraries we’ve not written.

    In our most recent Chaos run we experienced some inexplicable performance delays during two consecutive MongoDB server failovers. Rebooting the servers was not a viable long term fix as it results in a long downtime (>5 minutes).

    It took us several days of investigation until we realised we were not invoking the mongoDB drivers properly.

    The app delays caused by the Chaos run happened during work hours. We were able to look at the issue immediately, rather than wait until an on-call engineer gets notified and is able to respond, in which case the investigation would’ve been harder.

    Such discoveries help us report bugs and improve the resiliency of our software. Of course, it also means additional engineering hours and effort to get things right.


    The Chaos Monkey is an excellent tool to test how your infrastructure behaves under unknown failure conditions. By triggering and dealing with random system failures, you help your product and service harden up and become resilient. This has obvious benefits to your uptime metrics and overall quality of service.

    And if the whole exercise has such a cool name attached to it, then all the better.

    Editor’s note: This post was originally published on 21st November, 2013 and has been completely revamped for accuracy and comprehensiveness.


  4. 10 Ways to Secure Your Webapp


    While there is no such thing as 100% secure, you can take specific measures to mitigate against a wide range of attacks and secure your webapp as much as possible.

    In this post we discuss some of the steps we’ve taken as part of our efforts to secure our server monitoring tool.

    1. Cover the Basics

    Before considering any of the suggestions listed here, make sure you’ve covered the basics. Those include industry best practices like protecting against SQL injection, filtering, session handling, and XSRF attacks.

    Also check out the OWASP cheat sheets and top 10 lists to ensure you’re covered.

    2. Use SSL only

    When we launched Server Density in 2009, we offered HTTPS for monitoring agent postbacks but didn’t go as far as to block standard HTTP altogether.

    Later on, when we made the switch to HTTPS-only, the change was nowhere near as onerous as we thought it would be.

    SSL is often viewed as a performance bottleneck but that isn’t really true. In most situations, we see no reason not to force SSL for all connections right from the start.

    Server Density v2 uses a new URL. As part of this, we can force SSL for new agent deployments and access to the web UI alike. We still support the old domain endpoint under non-SSL but will eventually be retiring it.

    To get an excellent report on how good your implementation is, run your URL against the Qualys SSL server test. Here is ours:

    SSL scan for our webapp

    3. Support SSL with Perfect Forward Secrecy

    Every connection to an SSL URL is encrypted using a single private key. If someone obtains that key they can decrypt and access the traffic of that URL.

    Perfect forward secrecy addresses this risk by negotiating a new key with every session. A compromise of one key would therefore only affect the data in that one session.

    To do this, you need to allow certain cipher suites in your web server configuration.

    ECDHE-RSA-AES128-SHA:AES128-SHA:RC4-SHA is compatible with most browsers (for more background and implementation details check out this post).

    We terminate SSL at our nginx load balancers and implement SSL using these settings:

    You can easily tell if you’re connected using perfect forward secrecy. In Chrome, just click on the lock icon preceding the URL and look for ECDHE_RSA under the Connection tab:

    TLS security

    4. Use Strict Transport Security

    Forcing SSL should be combined with HTTP Strict Transport Security. Otherwise you run a risk of users entering your domain without specifying a protocol.

    For example, typing rather than and then being redirected to HTTPS. This redirect opens a security hole because there’s a short time when communication is still over HTTP.

    You can address this by sending an STS header with your response. This forces the browser to do the HTTP to HTTPS conversion without issuing a request at all. Instead, it sends the header together with a time setting that the browser stores, before checking again:

    Our header is set for 10 years and includes all subdomains because each account gets their own URL, for example:

    5. Submit STS Settings to Browser Vendors

    Even with STS headers in place there’s still a potential hole, because those headers are only sent after the first request.

    One way to address this is by submitting your URL to browser vendors so they can force the browser to only ever access your URL over SSL.

    You can read more about how this works and submit your URL for inclusion in Chrome. Firefox seeds from the Chrome list.

    6. Enforce a Content Security Policy

    Of the top 10 most common security vulnerabilities, cross site scripting (XSS) is number 3. This is where remote code is injected and executed on your site, usually through incorrect (or non-existing) filtering.

    A good way to combat this is to whitelist the specific remote resources you want to allow. If a script URL is not matched by this list then browsers will block it.

    It’s much easier to implement this on a new product because you can start out by blocking everything. You then open specific URLs as and when you add functionality.

    Using browser developer tools you can easily see which remote hosts are being called. The CSP we use is:

    We have to specifically allow unsafe-eval here, as a number of third party libraries require this. You might not use any third party libraries—or the libraries you do use may not require unsafe eval—in which case you should not allow unsafe-eval.

    script-src is a directive that controls a set of script-related privileges for a specific page. For more information on connect-src, script-src and frame-src this is a good introduction on CSP.

    Be careful with wildcarding on domains which can have any content hosted on them. For example wildcarding * would allow anyone to host any script. This is Amazon’s CDN which everyone can upload files to!

    Also note that Content-Security-Policy is the standard header but Firefox and IE only support X-Content-Security-Policy. See the OWASP documentation for more information about the header names and directives.

    7. Enable HTTP security headers

    You can enable some additional security features in certain browsers by setting the appropriate response headers. While not widely supported, they are still worth considering:

    8. Setup passwords, “remember me” and login resets properly

    This is the main gateway to your webapp, so make sure you implement all stages of logging-in properly. It only takes a short amount of time to research and design a secure process:

    • Registration and login should use salting and cryptographic functions (such as bcrypt) to store passwords, not plain text or MD5 hashing.
    • Password reset should use an out-of-band method to trigger resets, for example: requiring a username then emailing a one-time, expiring link to the on-record email address where the user can then choose a new password. Here is more guidance and a checklist.
    • Remember me” functionality should use secure tokens to recognise the user, and not storing their credentials in cookies.

    You can review your authentication process against this OWASP cheat sheet.

    9. Offer Multi Factor Authentication

    If your webapp is anything more than a trivial consumer product, you should implement—and encourage your users to use—multi factor authentication.

    This requires them to authenticate using something they carry with them (token), before they can log in. An attacker would therefore need both this token (phone, RSA SecurID etc) and user credentials before they obtain access.

    We use the Google Authenticator standard because it has authentication apps available for all platforms, and has libraries for pretty much every platform.

    It is quite onerous to install a custom, proprietary MFA app so we don’t recommend you implement your own system.

    Be sure to re-authenticate for things like adding/removing MFA tokens. We require re-authentication for all user profile changes.

    We do however have a timeout in place during which users won’t have to re-authenticate. This timeout applies for simple actions like changing passwords (adding or removing tokens requires authentication even during the timeout).

    To sum up, MFA is crucial for any serious application as it’s the only way to protect against account hijacking.

    10. Schedule Security Audits

    We inspect security as part of our code review and deployment process (many eyes on the code). We also have regular reviews from external security consultants.

    We recommend having one firm do an audit, implement their fixes, and then have another firm audit those changes.


    Security is all about identifying and mitigating possible risks of attack. The operative word here is mitigation, since new threats are always emerging.

    This is an ongoing exercise. Be sure to conduct regular reviews of all existing measures, check for new defence mechanisms and keep abreast of security announcements.

  5. Is Security a Growth Catalyst for DevOps?

    1 Comment

    Security comes from the Latin route sēcūrus. It means free from care. Some adjectives associated with this word are untroubled, fearless, and composed.

    Security provides a safe space for humans to stretch their imagination and be as creative as they can. It allows for growth.

    It also allows for focus. For small companies like ours, security unfetters our potential to improve our product and serve our customers.

    Good security is not an add-on, a feature or a separate effort. It is an essential building block of our work. And that should be reflected in everything we do, including our people, our infrastructure, our technologies and our product.

    Let’s start with people.

    The Role of People

    If you think technology can solve your security problems, then you don’t understand the problems and you don’t understand the technology.”

    Bruce Schneier.

    All fourteen collisions with Google’s self driving cars were caused by human error, according to Google. The drivers involved in those accidents were all distracted. It turns out that humans are the weakest link when it comes to safe systems.

    There are a number of ways we approach (and mitigate) this risk. To begin with, we try and have as many “eyes on the code” as possible.

    As part of our code review and deployment process we test each other’s code and try to break it. We are a small and tightly knit team, which is great. But we don’t know it all.

    To reduce the risk of blind spots and confirmation bias (we are only human!), we work with independent security consultants who inspect our product (and code) on a regular basis.

    Another resource we are looking into (but haven’t leveraged yet) is the specialised skillsets of the crowd. There are some compelling platforms for bug bounty and bug reporting out there. Large companies, like Google and Tesla, and smaller ones, like LastPass and Drupal, have used this for awhile.

    Now let’s turn our attention to technology, and how we can secure it.

    Multi Factor Authentication

    Multi Factor Authentication (MFA) requires the user to authenticate using something they physically have with them before they can log in. It’s the only way to protect against account hijacking.

    We use MFA internally as much as we can. For example, we enforce Google authenticator for Gmail, Google Drive and all our Google Apps.

    We also encourage all our customers to activate MFA for their Server Density account:

    Screen Shot 2015-08-16 at 6.36.50 pm


    Our computers are full-disk encrypted (we use Filevault, PGP Full Disk Encryption or Espionage, depending on the OS). We also encrypt some of our email communications with GnuPG, one of the tools that Edward Snowden used to protect his communications about the NSA.

    Up to Date Software

    We make sure we are always running the latest bug fixed versions of all installed software we use. This includes web browsers, messaging clients, OS components and the OS itself.

    Web Browser

    We like Google Chrome for its tight integration with Google Apps but also for its auto-update feature which keeps the browser secure.

    We are not big on browser add-ons. Click-to-play is an exception as it helps us prevent browser plugin vulnerabilities (Flash and Java in particular). We also use this Chrome extension to protect against phishing on our Google accounts.

    We also recommend Fluffify, our very own Chrome extension. It won’t make you any more secure, but it will keep you sane.


    The second law of thermodynamics states that entropy always increases with time. When it comes to guessing passwords however, time always increases with entropy.

    Password entropy is a measurement of how unpredictable a password is.

    Our passwords are at least 20 characters long. They comprise a mix of upper and lower case characters including numbers, letters and symbols. They are also unique for each system, which means if one system is compromised, others will not follow suit.

    We keep offsite and easily accessible backups of all our passwords (using tools like 1Password) to allow for easy reset of all account passwords in the event of a breach.

    We never share passwords. Each of us has our very own set of credentials. This helps us deal with red-flag scenarios. Like revoking employee privileges when they leave. Or auditing who accessed a particular server or database.

    Least Privilege

    According to the principle of least privilege, every process or user should only be able to access the resources they need. User administration is a key component of our product:

    users sd

    Secure Data Flows

    For Server Density to work we ask our customers to install a lightweight agent on their server. All this does is collect various system metrics and constantly report back.

    A deliberate restriction is that data only can only travel one way: from the client server to ours. That rules out any possibility for remote execution.

    From that point everything is encrypted. In fact, encrypted post backs are the only option.

    We use ports that are usually already open (HTTPS port 443) which means there is no need to configure anything new. No root access required either. And because our agent is open source, our customers have full visibility of what is running at all times.


    Amateurs hack systems, professionals hack people.

    Bruce Schneier

    We don’t think security is a mere feature, and it shouldn’t be treated as such. At its best, security is an essential building block of the product, the team, and everything a company does.

    From sending data, provisioning access to their systems and storing internal passwords, DevOps teams should take all reasonable precautions to keep confidential data safe and available.

    Having secure systems affords companies the stability and peace of mind they need to be creative, grow, and serve their customers.

    What about you? What industry best practices do you follow?

  6. How and why we use DevOps checklists


    In his book The Checklist Manifesto, Atul Gawande tells the story of the first pre-flight checklist, created by Boeing following the fatal crash of a B-17 in 1935.

    According to the investigation, the pilots forgot to disengage a critical wing adjustment mechanism before take off. Evidence that even veteran pilots could miss key steps or do things in the wrong order. With hundreds of lives at stake it was necessary to design around this constraint.

    The checklist does exactly that. It compensates for the “limits of human memory and attention.”

    Indeed, Gawande — a doctor himself — writes how key steps in medical procedures were routinely missed, resulting in infections and preventable fatalities. The adoption of checklists reduced those occurrences, and they are now used in a wide range of healthcare settings.

    Checklists for DevOps

    Not unlike healthcare and aviation, sysadmins are often tasked with systems that touch many lives. Here at Server Density we appreciate the complexities of the systems we run. We also recognise the limits of the people who run them — us. That’s why we use checklists for much of what we do.


    Only so much that human memory can remember. Source:

    Checklists are particularly effective in situations where there is:


    There is only so much that human memory can remember, reproduce and execute upon, in a reliable manner.

    Stress and Fatigue

    Incidents may happen at awkward times, like early in the morning when mistakes are more likely. Sysadmins are vulnerable to stress and fatigue like everyone else.


    You’d expect a seasoned engineer to intuitively know how to deal with a wide range of contingencies. That is a good thing. Experience and tenure, however, could also encourage people to rely on “gut instincts”, to “wing it” and “shoot from the hip.” In complex situations those attitudes could prove hazardous.

    A checklist is a good way to mitigate those problems because it helps us define our response in advance and make it available to everyone. We therefore ensure that every member of our team is taking the right steps in the right order, each and every time.

    Checklists at Server Density

    Here is one of our own checklists. It defines what our on-call first responders do when a critical incident occurs (we also wrote a guide on how we handle incidents outages and downtime):

    Incident Response Checklist

    As “common sense” and obvious as the steps may be, they carry great importance for the health of our infrastructure. So we spell them out.

    As you would expect, we take our uptime metrics seriously. We’ve got some pretty capable folks taking care of our servers. And we use checklists. Very prescriptive ones like the one below. This one details the steps our on-call people follow when faced with a server failover:

    Load Balancer Checklist

    We sed/awk/grep all day long. Our checklists assume we do it for the very first time. At 2:00 in the morning we might have trouble finding the lights, let alone the Puppet master for our configuration.

    Here is another example. We use this checklist when a server we monitor stops sending data:

    No Data Checklist

    DevOps checklists are as unique as the teams that use them. Each team has their own recipe for doing things and as technology stacks evolve, so do the checklists required to run them.

    We aim to have a checklist for every scenario. From restoring a backup to production, deploying fixes, switching primary data centres, and database consistency checks, to responding to traffic spikes, security breaches, critical alerts, and a long list of other contingencies.

    Google Docs is a key part of our on-call playbook, so that’s where we store our checklists for everyone to access and update as needed.

    Checklists are not static

    Relying on checklists does not mean we are intractable about how we do things. For us, creating checklists is an excellent opportunity to take a step back and review the entirety of our stack.

    DevOps checklists work best when we schedule time to update and improve on them.


    When it comes to server monitoring, we believe checklists are an important step towards reliable systems. They help our team respond to issues in a consistent and timely manner. This translates to increased uptime and a better capacity to serve our customers.

    What about you. Are you using checklists?

  7. What’s new in Server Density – July 2015

    Leave a Comment

    This is our regular post to keep you up to date with the latest releases to our server monitoring product, Server Density.

    Tags as a recipient

    We launched tags several months ago to allow you to set permissions for different users, but they are the foundation for many more features we’ll be releasing over the coming months. The first of these is tags as a recipient. This allows you to have alerts delivered to all members of a tag, rather than having to set up each of your users on every alert configuration individually.

    For example, if you have an “on call” team, you can add all the users to that tag, then set the tag as the recipient for an alert. Each user can have different notification options and any changes you make will apply to all alerts the tag is a recipient for. This is particularly useful if your team changes e.g. new members or staff leaving – you only have to make the change once on the tag and it’ll apply to all alerts.

    Tags as a recipient

    This is available now on device and service level alerts. It’s not available for group level alerts because our next release will be replacing those with alerts on a tag (so servers and services can have multiple tags, with inheritance across multiple tags).

    Learn how to set these up in our support guide.

    Service monitoring error details

    We often get reports that our availability monitoring is reporting “false positives” when compared to competing products which actually turn out to be real errors that we’ve detected where others have not! To back up our claims, we have now exposed full details of any errors we see along with their history for all of your service checks.

    You can browse recent errors, search and filter by location and see errors as they are detected. This will help debug any problems we detect with our availability monitoring.

    Service monitoring errors

    Ongoing fixes

    We’re always working on improvements and fixes and often deploy code 5-10 times a day! So if you find any problems or have ideas for improvements, please get in touch so we can continue to improve.

  8. What’s new in Server Density – May 2015

    Leave a Comment

    This is our regular post to keep you up to date with the latest releases to our server monitoring product, Server Density.

    Process statistics

    The main release for this month has been our in-depth process level statistics. Server Density has had the ability to alert on process existence and resource usage for some time but now that data is visualised for each server.

    Each server overview has a top processes widget that gives you a breakdown of the most intensive processes and how many running instances there are:

    Top processes

    This is also extended to the snapshot view which you can reach by clicking on a data point on any graph or from the Snapshot tab when viewing a particular server.

    Processes snapshot

    Processes snapshot

    sd-agent 1.14.0

    A new version of the monitoring agent for Linux, FreeBSD and Mac has been released with a range of bug fixes. This is intended to be the final release of the v1 agent. We’ll soon be releasing sd-agent v2 which will include features such as SNMP, statsd and second by second monitoring.

    Ongoing fixes

    We’re always working on improvements and fixes and often deploy code 5-10 times a day! So if you find any problems or have ideas for improvements, please get in touch so we can continue to improve.

  9. What’s new in Server Density – Apr 2015

    Leave a Comment

    This is our regular post to keep you up to date with the latest releases to our server monitoring product, Server Density.

    New alert config UI

    We released a new configuration interface for managing alerts, the result of several months of work involving design and usability tests. Try it out on your account now and read about the work behind the scenes.

    The new UI

    New support site

    Our support website has been redesigned, all the articles have been updated and you can now log in to submit/view old tickets. We provide live chat, email and phone support to all customers Monday to Friday, 10am to 6pm UK time.

    Set tags from the installer script

    Our quick agent installer shell script now allows you to specify tags when deploying the agent.

    Updated iPhone app

    A new release of the iPhone app for alerting improves various elements of the interface and fully supports fluid layouts, including iPad, iPhone 6 and iPhone 6 Plus.

    Default alerts

    All new devices and services added via the web UI will now get default alerts configured for “no data received” (devices), service is down and HTTP status code is not 200 (services).

    Ongoing fixes

    We’re always working on improvements and fixes and often deploy code 5-10 times a day! So if you find any problems or have ideas for improvements, please get in touch so we can continue to improve.

  10. What’s new in Server Density – Jan 2015

    Leave a Comment

    This is our regular monthly post to keep you up to date with the latest releases to our server monitoring product, Server Density.

    Latest value widget

    A new widget is available on the dashboard which will show you the latest, current value for any metric. It will also display the average value over the time period the dashboard is configured for e.g. the 24 hour average or 1 hour average, with a sparkline graph in the background.

    Latest value widget

    New official plugins for entropy, inodes, ProFTP, Zombies and Zookeeper

    We’re in the process of retiring our old plugin directory and rewriting many old community plugins into officially supported and updated versions. These are available on Github and we’re accepting pull requests for improvements and changes, as well as brand new plugins.

    The goal is to make it easier to install by just dropping the file into your agent plugin directory and ensure these plugins are kept up to date and fully supported by us.

    New API documentation + Dashboard API

    We’ve updated and expanded our API documentation with a new template and example calls for Python, Ruby and Curl.

    In addition, there’s now an API endpoint for managing dashboards and the widgets on them. This allows you to programmatically create new dashboards and new widgets, perhaps as part of your provisioning process for new environments.

    Server Density v1 shutdown in March

    Last month we announced the shutdown date of 24th March for Server Density v1. Users still on v1 are being sent regular reminders to migrate, which only takes a few minutes and does not cost anything extra.

    What’s coming next?

    Over the last few months we have been working on moving our alerts processing backend from Celery + MongoDB to Storm + Kafka, which sets the foundations for a range of new alerting functionality we’ll be releasing from March. Tagging is a key part of this functionality, which was released in December. Before then, we’ll be releasing more plugins and full process lists within the UI.