Author Archives: David Mytton

About David Mytton

David Mytton is the founder of Server Density. He has been programming in PHP and Python for over 10 years, regularly speaks about MongoDB (including running the London MongoDB User Group), co-founded the Open Rights Group and can often be found cycling in London or drinking tea in Japan. Follow him on Twitter and Google+.
  1. Life on Call: Productivity and Wellbeing around the Clock

    Leave a Comment

    Picture this:

    They hand you a portable gadget that makes Gordon Gekko’s cellphone look hip. You must carry it 24/7. Every hour—day and night—you have to switch it on and listen to a series of three-digit code numbers. If you hear your number you need to race to the nearest telephone to find out where you’re needed. How does that sound?

    Launched in 1950 by physicians in New York, the beeper marked a seminal point. Work didn’t quite end when you left the office anymore.

    SaaS on-call

    In the early noughties, you needed north of a million dollars before your startup wrote its first line of code.

    All that money was used to buy things that we now get for free. That’s because much of our infrastructure has since become a commodity we don’t have to worry about.

    Doesn’t that mean less moving parts to fix? And less need for on-call work? Probably not. Near-zero infrastructure means near-absent barriers to entry. It means more competition. Less friction also means greater speed. Some call it streamlined, agile, elastic. Others call it frantic. Point is, features are often tested in production these days.

    On-call work has always been about reacting and mitigating. A physician will not invent a cure while on-call. They can only treat symptoms. Same goes with DevOps teams. An engineer will not fix code while on-call. They will do something more blunt, like restart a server. To restart a server they don’t walk to the back-of-house. Instead, they start an SSH session. If it still doesn’t work they raise a ticket. All from the (dis)comfort of their living room sofa.


    DevOps and sysadmin teams have not exactly cornered the market for on-call work. Far from it. One in five EU employees are actually working on-call.

    Even if you think your work does not require on-call duties, think again. When was the last time you checked your work email? The occasional notification may sound harmless. It’s not unheard of, however, for emails to arrive past midnight, followed by text messages asking why they were not answered.

    The anxiety of on-call work stems from the perceived lack of control. It doesn’t matter if the phone rings or not. Being on-call and not being called is, in fact, more stressful than a “busy” shift, according to this article. It is this non-stop vigilance, having to keep checking for possible “threats” that is unhealthy.

    We are not here to demonize on-call. As engineers in an industry that requires this type of work, we just think it pays to be well informed of potential pitfalls.

    Goodwill is Currency

    When the alert strikes at stupid o’clock, chances are you’ll be fixing someone else’s problem. It’s broken. It’s not your fault. And yet here you are, in a dark living room, squatting on a hard futon like a Gollum, cleaning somebody else’s mess.

    On-call work is a barometer of goodwill in a team. There is nothing revelational about this. Teamwork is essential.

    What happens when you (or a family member) is not feeling well? What if you need to take two hours off? Who is going to cover for you if everyone is “unreachable” and “offline”?

    The absence of goodwill makes on-call duty exponentially harder.

    Focus on Quality

    Better quality, by definition, means lower incident rates. Over time, it nurtures confidence in our own systems. I.e. we don’t expect things to break as often. The fear of impending incidents takes a dip, and our on-call shift gets less frightful.

    To encourage code resilience, it makes sense to expose everyone— including devs and designers—to production issues. In light of that, here at Server Density we run regular war games with everyone who participates in our on-call rotation. So when it comes to solving production alerts, our ops and dev teams have a similar knowledge base.

    We also write and use simple checklists that spell things out for us. Every single step. As if we’ve never done this before. At 2:00 in the morning we might have trouble finding the lights, let alone the Puppet master for our configuration.

    Our “Life On Call” Setup

    Each engineer at Server Density picks their own gear. In general we use a phone to get the alerts and a tablet or laptop to resolve them.

    For alerting we use PagerDuty. For multi-factor authentication we run Duo and for collaboration we have HipChat. We also have a Twitter client to notify us when PagerDuty is not available (doesn’t happen often).

    Upon receiving an alert we switch to a larger display in order to fix and resolve it. 80% of our incidents can be dealt with using a tablet. All we need is an SSH client and full browser. The tablet form factor is easier on the back and can be taken to more places than a laptop.


    An overnight alert is like an auditory punch in the face. At least you’ve signed up for this, right? What about your partner? What have they done to deserve this?

    To avoid straining relationships it pays to be proactive. Where do you plan to be when on-call? Will you have reception? If an alert strikes, will you have 4G—preferably Wi-Fi—in order to resolve it? What about family obligations? Will you be that parent at the school event, who sits in a corner hiding behind a laptop for two hours?


    At best, working on-call is nothing to write home about. At worst, well, it kind of sucks.

    Since it’s part of what we do though, it pays to be well informed and prepared. Focusing on code-resilience, nurturing teamwork, and setting the right expectations with colleagues and family, are some ways we try and take the edge off it.

    What about you? How do you approach on-call? What methodologies do you have in place?

  2. Server Naming Conventions and Best Practices


    Early last year we wrote about the server naming conventions and best practices we use here at Server Density. The post was inspired in part by an issue in one of our failover datacenters in San Jose.

    Our workaround involved using Puppet to modify the internal DNS resolvers of the affected servers. Directing an automated change to the affected servers (our San Jose datacenter servers) turned out to be painless. Why? Because our naming convention includes a functional abbreviation for location:

    Impressed with how easy deploying the fix was, we decided to talk about it in a post that has since proved quite popular. Frankly, we didn’t expect server naming conventions to be such a gripping topic. So we set out to understand why.

    As it turns out, there are two main reasons why server names strike a chord with people.

    1. Naming Things is Hard

    “There are only two hard problems in Computer Science: cache invalidation and naming things.”

    Phil Karlton

    Naming servers can get very tough, very quickly. That’s true, partly because deploying them has become incredibly easy. You can have a new server up and running in as little as 55 seconds.


    It’s not unheard of for sysadmins to be responsible for dozens, hundreds, perhaps even thousands of servers these days. The cognitive burden involved with naming and managing rapidly escalating swarms of devices is beyond what humans are accustomed to.

    Most people only ever get to name a few things in their life.

    Chi chi van raisin

    And yet, that’s what sysadmins find themselves doing for much of their day.

    2. Servers Are not What They Used to Be

    Not long ago, servers occupied physical space. We could see and touch them.

    That’s no longer the case. The elastic nature of deploying infrastructure affords us an order of magnitude more servers that we can’t even see.

    Attempting to stretch old-school naming conventions (planet names and Greek Gods) to the limitless scope of this brave new world is proving to be difficult, if not impossible.

    Our habits and practices hail from a time when caring for each individual box was part of our job. When that doesn’t work, or suffice, we experience cognitive dissonance and confusion.

    “We’ve had flame wars on internal team mailing lists arguing how it should be done to no result,” wrote one sysadmin on Reddit.

    There is no such thing as a golden rule—much less a standard—on how to name servers.

    A sysadmin said they name servers “randomly, based on whatever whim we are on that day”. Other methods were even more state of the art: “I just roll my face on the keyboard. That way there’s sure to be no duplicates.”

    We spoke to Matt Simmons from Standalone Sysadmin to get his expert opinion on this transition.

    “A computer infrastructure largely exists in one of two worlds,” he says. “Either you have so few machines that you individually deal with them, and they’re treated as pets, or you have so many that you can’t individually deal with them, and you treat them like cattle.”

    Servers as Pets – our Old Scheme

    Names give meaning. They allow us to understand and communicate. When we are battling with the same few boxes day in day out, it makes sense to give them personable, endearing names.

    From Aztec Gods and painkillers, to Sopranos and Babar the Elephant, there is no shortage of charming candidates.

    When dealing with a limited amount of servers, “pet” naming conventions work really well. They are short, memorable, and cute. They’re also completely decoupled from the server’s role, which makes it harder for hackers to guess their name (security through obscurity).

    Back when we had a smaller number of servers we based our naming scheme on characters from His Dark Materials by Philip Pullman. A master database server was Lyra and the slave was Pan.

    Much of the formal guidance we found online caters for similar scenarios, i.e. finite numbers of servers. An old RFC offers some astute guidance:

    Words like “moron” or “twit” are good names if no one else is going to see them. But if you ever give someone a demo on your machine, you may find that they are distracted by seeing a nasty word on your screen. (Maybe their spouse called them that this morning.)

    Servers as Cattle – our New Scheme

    “The engineer that came up with our naming scheme has left the company. Nobody knows what is hosted on Prometheus anymore.” Sound familiar?

    There’s only so many heroes in A-Team and Arrested Development. Dog Breeds will only get you that far. There comes a point when naming servers has to be more about function than form.

    We moved to our current naming structure a few years ago. This allows us to quickly identify key information about our servers. It also helps us filter by role, provider or specific locations. Here is an example:

    hcluster3 : this describes what the server is used for. In this case, it’s of cluster 3, which hosts our alerting and notification service (our monitoring app is built using a service orientated architecture). Other examples could be mtx2 (our time series metrics storage cluster) or sdcom (servers which power our website).

    web1 : this is a web server (Apache or nginx) and is number 1 in the cluster. We have multiple load balanced web servers.

    sjc : this is the datacenter location code, San Jose in this case. We also have locations like wdc (Washington DC) or tyo (Tokyo).

    sl : this is the facility vendor name, Softlayer in this case. We also have vendors like rax (Rackspace) and aws (Amazon Web Services).

    The advantage of this naming convention is that it scales as we grow. We can append and increment the numbers as needed.

    As per above, it’s also easy to modify servers based on criteria (role, provider, location). In our Puppet /etc/resolv.conf template we can do things like:

    <% if (domain =\~ /sl/) -%>
    <% if (domain =\~ / -%>
    # google DNS - temp until SL fixed
    <% else %>
    # Internal Softlayer DNS
    <% end -%>

    One disadvantage with long server names is that they can be unwieldy.

    When compared to “pet” names, “cattle” server names are hard to remember and even harder to type in CLIs. They also need to be updated when servers are moved to different geographies or their roles change.

    Security-wise they’re often seen as less robust than their “pet” name equivalents. That’s because they make it just one step easier for hackers, by helping them deduce the names of servers they want to access.


    The transition to cloud computing has caused a dramatic increase of servers that sysadmins are tasked to administer (and provide names for).

    A good naming convention should make it easy to deploy, identify and filter through your server pool. If you’re only planning to have a handful of servers, then coming up with real names (servers as pets) might suffice.

    For anything remotely scaleable and if identifying your servers is key, then consider something more practical and functional (servers as cattle).

  3. Security Principles and Practices: How to Approach Security

    Comments Off on Security Principles and Practices: How to Approach Security

    October is Security Month here at Server Density. To mark the occasion we’ve partnered with our friends at Detectify to create a short series of security dispatches for you.

    In our previous three articles we looked at some essential security checks for your web applications, APIs and servers. But once the obvious vulnerabilities are considered, what happens next? How can we stay proactive and, most importantly, how do we become security conscious?

    What follows is a set of underlying security principles and practices you should look into.

    Minimise your Attack Surface

    An attack surface is the sum of the different points (attack vectors) from where an unauthorized user can inject or steal data from a given environment. Eliminating possible attack vectors is the first place to start when securing your systems.

    This means closing down every possible interface you’re not using. Let’s take web apps for example. Ports 80 and 443 should be the only ones open to the outside world. SSH port 22 (preferably changed to something else) should be accessible to a restricted subset of permitted IPs and only developers / administrators should have access. The obvious idea is to limit the scope for outside attackers to creep in.

    Here’s is an example scenario: You run a website which has the following two restrictions: i) Only developers have admin access, and ii) SSH access is only available through a VPN connection. For a break-in to happen, an intruder would therefore need to compromise the credentials of your developer, and they would also need access to your VPN and SSH keys. The attack would have to be highly coordinated.

    What’s more, any potential intrusion might not yield that much (internal systems may employ “defense in depth” and “least privilege” practices). It’s unlikely an attacker would spend the time and resources to jump through all those hoops (for uncertain gain), purely because there are easier targets out there.

    Most attacks are opportunistic. Which is why layers of security are important. Breaching one layer just gets you down to the next one rather than compromising the whole system. The rule of thumb is, attackers go after the easiest targets first. Your systems should, therefore, be as locked down as as possible. This includes servers, workstations, phones, portables, et cetera. As the attack surface diminishes, so does the likelihood of hacking attempts.

    If you don’t know what to look out for, third party services can help you determine how breachable your systems are. For example:

    • Detectify can evaluate your web applications
    • Nessus can scope your network-layer security
    • Penetration testers (pentesters) can assess your end-to-end security profile

    You then need to put the effort in and plug the issues that come up.

    Internal Practices and Company Culture

    The strongest of perimeters can’t protect against internal human error. Take “errors of commision,” for example. An employee quits their job, goes to a competitor and leaks intel. How do you anticipate and prevent that?

    Then there is a long list of “errors of omission”. People have businesses to run, busy lives to lead, important things to do. Staying secure is not always top-of-mind and we let things slide.  For example, are employees reminded to encrypt their laptops and portables? When was the last time you monitored your server activity? What systems do you have in place to negate the need to “remember”? Who handles security in your team? Who is accountable?

    Humans are the weakest link when it comes to safe systems. Your internal systems (and practices) need to account for that. Security needs to be a fundamental part of how you work and collaborate on projects.

    “Given enough eyeballs, all bugs are shallow”

    Linus Law

    Your internal practices should facilitate as many “eyes on the code” as possible. This can be done with peer reviews and code buddy schemes. To complement your team efforts, there are some compelling platforms for bug bounty and bug reporting you can tap into. [NB: Crowd skillsets are not—strictly speaking—an internal constituent of company culture. Admitting we don’t know it all and asking for help, however, is.]

    What Motivates Hackers?

    Some of them are out to prove a point. Others are criminal gangs looking for financial gains such as extortion and credit card theft. Then there is industrial espionage, botnets and a whole host of ugly stuff. The threat landscape is highly diverse. Ultimately all it takes is a single misstep for an attacker to get the keys to the kingdom.

    It therefore pays to think like a hacker. Why would someone want to hack your server? What data lives there? What is the easiest way in? What could the attacker do once inside?

    “The Enemy Knows the System”

    According to Kerckhoffs’s principle every secret creates a potential failure point. If you’re relying on “security through obscurity” to stay safe, then your systems are as safe as your secrets (see human factor above).

    A secure authentication policy, for example, does not depend on secrecy. Even if a password was compromised (how easy is it to impart a 20 character randomised password?) an attacker would still need a separate token to gain access (MFA).

    Further Reading

    If there is one underlying theme in our security dispatches so far, is this: Security is an incredibly fast moving field, with plenty of complexity and trade-offs involved.

    Getting up to speed and staying on top of the latest security trends and threats is a key requirement in maintaining secure systems and infrastructure.

    Reddit’s /r/netsec is great starting point. Hacker News tend to highlight the most evil vulnerabilities. There’s a bunch of very skilled security researchers on Twitter. Some indicative profiles are @SophosLabs, @TheHackersNews and @mikko.

    Some blogs we like are:

  4. Server Security Checklist: Are you at risk?


    October is Security Month here at Server Density. To mark the occasion we’ve partnered with our friends at Detectify to create a short series of security dispatches for you.

    Last week we covered some essential API security checks. In this third installment we turn our focus to server security.

    Securing your server is at least as important as securing your website and API. You can’t have a secure website (or API) if the infrastructure they rely on is not solid. What follows is a server security checklist with 5 risks you need to consider.

    1. Security Updates

    Many vulnerabilities have a zero-day status, i.e. the vulnerability is discovered (and disclosed) before a software vendor gets a chance to plug the hole. A race to a patch then ensues (see shellshock example). It’s often just a matter of a few hours before a public vulnerability turns into a malicious automated exploit. Which means, it pays to be “on the button” when it comes to getting your security updates.

    You may want to consider automatic security updates (here is how to do this for Ubuntu, Fedora, Red Hat & Centos 6, and Centos 7). However: note that automatic updates can cause problems if they happen when you’re not expecting them or if they cause compatibility issues. For example automatic update of MySQL will cause MySQL to restart, which will kill all open connections.

    We recommend you configure your package manager to only download the upgrades—i.e. without auto-installation—and then send regular notifications for your review. Our policy is to monitor security lists and apply updates at a set time every week. Unless the vulnerability is critical and we have to react immediately.

    2. Access Rights

    Rationalising access rights is a key security step. It prevents users and services from performing unintended actions. This includes everything from removing the “root” account to login using SSH, to disabling the shells used for default accounts that’s not normally accessed. For example:

    • Does PostgreSQL really need /bin/bash?
    • Can privileged actions be accomplished through sudo?
    • Are cron jobs locked down so that only specific users may access them?

    3. SSH Brute force

    A very common point of attack is for bots to brute force accounts through SSH. Some things to look at:

    • As per previous section, it’s essential to disable remote login for the root account as it’s the most common account to be attacked.
    • As these bots focus on passwords specifically, you can reduce the attack surface by employing public/private keys instead of passwords.
    • You can go a step further by changing the SSH port from the default 22 to something else. The new port can, of course, be revealed with a port scanner (you may consider port knocking add-ins for this), however internet wide sweeping bots are opportunistic and don’t go that far.
    • A more drastic measure is to block all traffic and whitelist specific IPs. Ask yourself, does the entire internet need access to your servers?

    As a general note, security through obscurity is never a good goal, so be conscious of introducing unwarranted complexity.

    4. File System Permissions

    Consider this scenario. Someone finds a remote code execution vulnerability in some PHP script of a web app. The script is served by the www-data user. Consequently, any code injected by the hacker will also be executed by www-data. If they decide to plant a backdoor for persistence, the easiest thing to do is write another PHP file with malicious code and place at the root of the web site.

    This could never happen if www-data had no write access. By restricting what each user and service can do (least privilege principle) you limit any potential damage from a compromised account (defense in depth).

    File system permissions need to be granular. Some examples to consider:

    • Does www-user need to write files in the webroot?
    • Do you use a separate user for pulling files from your git repository? (We strongly advise against running your website from a github checkout.)
    • Does www-user need to list files in the webroot?

    We suggest you spend some time reviewing your file system permissions on a regular basis.

    5. Server Monitoring

    Any anomalous server activity can indicate a breach. For example:

    • A peak in entries for an error_log can be the result of an attacker trying to fuzz a system for vulnerabilities.
    • A sudden, but constant increment in network traffic can be the result of an on-going DDoS (Distributed Denial of Service) attack.
    • An increase in CPU usage or disk IO can indicate the exfiltration of data. Such as the time when Logica got hacked, affecting the tax authorities of Sweden and Denmark. Ouch.
    • An increase in disk usage is yet another sign. Once a server is compromised, hackers use as an IRC and torrent server (adult content and pirated movies), et cetera.

    When something does go wrong and your server is affected, time is of the essence. That’s where reliable alerting and server monitoring (that’s us!) come in handy.

    What’s Next

    In our next security dispatch we’re going to take a broader look at how to become proactive about security, and discuss ways to instill a security-conscious culture in your organisation. To make sure you don’t miss a beat, . You should also read the other articles from our security month, including the website security checks you should be considering, and how to secure your API.

  5. 5 API Security Risks and How to Mitigate them

    Comments Off on 5 API Security Risks and How to Mitigate them

    Update 15th Oct 2015: Part 3 is here.

    October is Security Month here at Server Density. To mark the occasion we’ve partnered with our friends at Detectify to create a short series of security dispatches for you.

    Last week we covered some essential Website Security checks. In this second instalment, we turn our focus on API security risks.

    Best of Both Worlds

    Openness and security are two opposing priorities. Intelligent API design is a balancing act between the two. How do you open up your application and integrate with the outside world without presenting an attack surface that jeopardizes your security?

    A good API creates possibilities, but it also creates boundaries. What follows are 5 design pitfalls you need to be aware of when securing your API.

    1. Lack of TLS/SSL

    Encryption at the transport layer is the first step towards secure APIs. Without the use of proper transport security, an eavesdropper will be able to read and tamper with your data (Man In The Middle attack).

    Acquiring a TLS certificate is inexpensive and straightforward. We wrote about transport layer security (HTTPS) in last week’s dispatch, and we’ve also touched on it here.

    2. Encryption does not imply Trust

    In order for encrypted communication to commence, a web client requires an SSL certificate that needs to be validated. This validation process is not always straightforward and if not planned properly it creates potential certificate validation loopholes.

    If exploited, this vulnerability allows hackers to use fake certificates and traffic interception tools to obtain usernames, passwords, API keys and—most crucially—steal user data.

    Here is how it works. An attacker forges a malicious certificate—anyone with an internet connection can issue ”self-signed” SSL certificates—and gets the clients to trust it. For example, a bogus certificate could have a name that closely resembles a trusted name, making it harder for an unsuspecting web client to tell the difference. Once this “weak validation” takes place the attacker gains read / write access to user data, in what is otherwise an encrypted connection. Instapaper, for example, recently discovered a certificate validation vulnerability in their app.

    Make sure the clients are properly validating certificates (pertaining to the specific sites they access) with a trusted certification authority. You can also look at key pinning as an additional measure (we do this for our iOS app). This process associates a host with a particular certificate or key, so any change in those—when the client is attempting to connect—will trigger a red flag.

    3. SOAP and XML

    SOAP is a messaging protocol that relies on XML as its underlying data format.

    The main problem with SOAP is that it’s been around for far too long. It’s also based on a complex data layer, XML. Taken together, it is a complex stack mired by numerous attack vectors including XML encryption issues, external entity attacks (XXE), and denial of service (Billion Laughs), among others.

    Part of the problem is that SOAP tends to stay in production for a long time because numerous systems rely on it, and little to no effort is spent investigating the security implications of such arrangements.

    The good news is, server-side vulnerabilities are just as easily spotted in a SOAP endpoint as in any other part of a web app.

    So make sure you don’t overlook SOAP when auditing your security. A professional 3rd party can search for vulnerable endpoints throughout your stack and advise on how to patch them.

    If you’re starting out now, you may also want to consider JSON/REST as an alternative. Over the last few years, this protocol has prevailed over the more complicated SOAP/XML for most scenarios, except perhaps legacy systems and corporate environments. We chose JSON for our server monitoring app.

    4. Business Logic Flaws

    Official API calls are designed to provide access to a subset of endpoints, i.e. data is supposed to be touched in a very specific manner. That’s the raison d’etre of APIs. To create structure and boundaries.

    Attackers, however, can try alternative routes and calls to obtain data outside those boundaries. They do this by exploiting Business Logic Flaws in the design of the API. A few noteworthy organisations that fell victim to business logic flaws attacks are Facebook, Nokia, and Vimeo.

    The best way to prevent such unintended loopholes is to manually audit your API’s. A good general practice is to expose the minimum amount of data possible (principle of least privilege).

    When mission-critical information is at stake you may need the help of 3rd party experts that can help spot any loopholes. An affordable solution is to crowdsource the pentesting of APIs to companies such as BugCrowd, HackerOne, Synack or Cobalt.

    5. Insecure Endpoints

    API endpoints are often overlooked from a security standpoint. They live on for a long time after deployment, which makes developers and sysadmins less inclined to tinker with for fear of breaking legacy systems relying on those APIs (think enterprises, banks, etc). Endpoint hardening measures (hashes, key signing, shared secrets to name a few) are, therefore, easier to incorporate at the early stages of API development.

    What’s Next

    Our next security dispatch will look at some of the top server security checks you need to be aware of. To make sure you don’t miss a beat, . You should also read the other articles from our security month, including the website security checks you should be considering, and how to secure your servers.

  6. 5 Website Security Checks: Are you at risk?

    Comments Off on 5 Website Security Checks: Are you at risk?

    October is Security Month here at Server Density, and to mark the occasion we’ve partnered with our friends at Detectify to create a short series of security dispatches for you.

    As we’ve written before, humans are the weakest link when it comes to safe systems, and there are a number of best practices that help us mitigate that risk. In this series, however, we will focus purely on technology.

    Website Security Checks

    According to the 2015 WhiteHat Security Statistics Report, an overwhelming majority of websites suffered from at least one serious vulnerability. Even when those vulnerabilities were addressed, their time-to-fix was unacceptably long. A great number of website owners are not aware of—let alone equipped to deal with—online security threats.

    What follows is a collection of website security checks you can start with. It covers a number of known threats you need to prepare for, and secure your website against.

    1. Lack of HTTPS

    Traditional HTTP is not encrypted, and therefore, it is not secure. It allows an attacker to perform a man-in-the-middle attack, placing user credentials, cookies and other sensitive data at risk.

    DigitalOcean have written a great set of instructions on how to acquire and install an SSL certificate, and you can read our suggestions here. SSL certificates are fairly inexpensive and can be issued within minutes. And if you’re an experienced sysadmin you can harden your website with stronger ciphers too.

    2. Cross-Site Scripting

    XSS is the most frequently occurring security threat in web applications. It allows attackers to inject malicious scripts on web pages, that affect all subsequent visitors.

    Modern frameworks do a fairly good job at preventing XSS. This means legacy applications are the ones most exposed to this risk. You can mitigate XSS using libraries like DOMPurify. OWASP offers some comprehensive instructions on how to deal with XSS.

    3. SQL Injection

    This is a critical vulnerability affecting database servers. Attackers exploit any chinks in data entry mechanisms (i.e. username and password boxes) to tamper with SQL queries and break into the backend database of a website. This opens the possibility for data exfiltration and remote code execution.

    While it may not be as prevalent as it used to be, some very high profile leaks (Bell Canada, Wall Street Journal, SAP among others) happened through SQL injection attacks. Check Point recently compiled a list of SQL injection trends.

    Using parameterized queries and stored procedures can reduce the likelihood of attack. They work by helping the database to distinguish between user data and SQL code.

    Web Application Firewalls (WAFs) can further reduce the risk of SQL injections by applying blacklists to known attack patterns using regex or similar techniques. This blocks most automated scanners (bots) and provides some low hanging fruit protection against opportunist attacks. Targeted attacks, however, will probably bypass these filters.

    Finally, the presence of an ORM layer may help with SQL injection as it negates the need (or opportunity) to write actual SQL. However, this extra layer between code and database carries a CPU overhead. ORM is also known to generate complex and unoptimized SQL queries.

    4. Cross-Site Request Forgery

    A CSRF attack forces a user’s web browser to perform an unwanted action on a site the user is authenticated in. HTML forms that don’t offer integrity validation are mostly at risk.

    You can prevent such vulnerabilities by applying CSRF tokens to forms that undertake authenticated actions (such as updating a user profile or a password change). All modern frameworks have settings for mitigating the risk of CSRF.

    5. Outdated Software

    Updating your software is one of the most straightforward ways to protect your website and your users. A single unpatched vulnerability may be all it takes for an attacker to compromise your server. Make sure your WordPress stack—including themes and plugins—is entirely up to date.

    Upgrading to the latest bug-fixed versions of all your software should be a scheduled recurring activity.

    Next Steps

    In our next security dispatch we will look at some of the top API security checks you need to be aware of. To make sure you don’t miss a beat, . You should also read the other articles from our security month, including the API security holes you should be considering, and how to secure your servers.

  7. Using MongoDB as a Time Series Database


    We’ve used MongoDB as a time series database since 2009.

    MongoDB helps us scale for the expanding volumes of data we collect in our server monitoring service. Over the years, we went from processing 40GB per month, to more than 250TB.

    By the way, while it has served us well over the years, I’m not necessarily advocating MongoDB as the best possible database for time-series data. It’s what we’ve used so far (and we’re always evaluating alternatives).

    On that basis, I’ve written a few times about how we use MongoDB (here is a recent look at the tech behind our time series graphs). As part of our upcoming Story of the Payload campaign (stay tuned!) I thought I’d revisit our cluster setup with a detailed look at its inner workings.

    The hardware

    Over the years, we’ve experimented with a range of infrastructure choices. Those include Google Compute Engine, AWS, SSDs versus spinning disks, VMWare, and our transition from managed cloud to Softlayer where we are today. We standardize on Ubuntu Linux and the cluster is configured as follows:

    Ubuntu Linux 12.04 LTS

    We run every server on an LTS release and upgrade to the next LTS on a fixed schedule. We can speed up specific upgrades if our team needs access to newer features or bundled libraries.

    Bare metal servers

    We experimented with VMs in the past and found host contention to be an issue, even with guaranteed disk I/O performance (products like AWS EBS or Compute Engine SSDs offer this option, unlike Softlayer that doesn’t).

    Solid State Disks

    We have multiple SSDs, and house each database—including the journal—on its own disk.

    Everything is managed with Puppet

    We used to write our own manifests. The official Forge MongoDB module has since become a better option so we are migrating to it.

    As for the servers themselves, they have the following specs:

    • x2 2GHz Intel Xeon-SandyBridge (E5-2650-OctoCore) (16 cores total)
    • 16x16GB Kingston 16GB DDR3
    • x1 100GB Micron RealSSD P300 (for the MongoDB journal)
    • x2 800GB Intel S3700 Series (one per database)

    The MongoDB cluster

    Our current environment has been in use for 18 months. During this time we scaled both vertically (adding more RAM) and horizontally (adding more shards). Here are some details and specs:

    • x3 data node replica sets plus 1 arbiter per shard.
    • x2 nodes in the primary data centre at Washington DC, and a failover node at San Jose, CA. The arbiter is housed in a third data centre in Dallas, TX.
    • x5 shards with distribution based on item ID (server, for example) and metric. This splits up a customer’s record across multiple shards for maximum availability. MongoDB deals with balancing using hash-based sharding.
    • The average workload is around 6000 writes/sec which equates to about 500,000,000 new documents per day.
    • We use the MongoDB Cloud Backup service which offers real-time offsite backups. It acts as a replica node for each replica set. It receives a (compacted and compressed) copy of every write operation. Current throughput sits at a sustained 42 Mbps.
    • We use the Google Compute Engine and MongoDB Cloud Backup service API to restore our backups and verify them against our production cluster, twice per day.
    • We keep a copy of the backup in Google’s Cloud Storage in Europe as a final disaster recovery option. We store copies twice per day, going back for 10 days.

    The Data

    The write workload of the cluster consists of inserts and updates, for the most part.

    For the lowest granularity level of data, we use an append-only schema where new data is inserted and never updated. These writes take approx 2-3ms.

    For the hourly average type of metrics (we keep those forever – check out our monitoring graphs) we allocate a document per day, per metric, per item. This document gets updated with the sum and count. From that, we calculate the mean average when we query the data. These writes typically complete within 500 ms.

    We optimise for writes in place, use field modifiers and avoid growing documents by pre-allocation. Even so, there is a large overhead associated with updating documents.

    When querying the data (drawing graphs for example) the average response time as experienced by the user is 189 ms. Median response is 39 ms, the 95th percentile is 532 ms, and 99th percentile is 1400 ms. The majority of that time is used by application code as it constructs a response for the API. The slowest of queries are the ones comprising multiple items and metrics over a wide time range. If we were to exclude our application code, the average MongoDB query time is 0.0067ms.


    So that’s the MongoDB cluster setup we have here at Server Density. We would like to hear from you now. How do you use MongoDB? What does you cluster look like, and how has it evolved over time?

  8. DevOps On-Call: How we Handle our Schedules


    I was on call 24/7/365.

    That’s right. When we first launched our server monitoring service in 2009, it was just me that knew how the systems worked. I was on-call all the time.

    You know the drill. Code with one hand. Respond to production alerts with the other.

    In time, and as more engineers joined the team, we became a bit more deliberate about how we handle production incidents. This post outlines the thinking behind our current escalation process and some details on how we distribute our on-call workload between team members.

    DevOps On-Call

    Developing our product and responding to production alerts when the product falters, are two distinct yet very much intertwined activities. And so they should be.

    If we were to insulate our development efforts from how things perform in production, our priorities would get disconnected. Code resilience is at least as important as building new features. In that respect, it makes great sense to expose our developers and designers to production issues. As proponents of DevOps we think this is a good idea.

    Wait, what?

    But isn’t this counterproductive? The state of mind involved in writing code couldn’t be more different from that of responding to production alerts.

    When an engineer (and anyone for that matter) is working on a complex task the worst thing you can do is expose them to random alerts. It takes more than 15 minutes, on average, to regain intense focus after being interrupted.

    How we do it

    It takes consideration and significant planning to get this productivity balance right. In fact, it’s an ongoing journey. Especially in small and growing teams like ours.

    With that in mind, here is our current process for dealing with production alerts:

    First Tier

    During work hours, all alerts go to an operations engineer. This provides a much needed quiet time for our product team to do what they do best.

    Outside work hours alerts could could go to anyone (ops or product alike). We rotate between team members every seven days. At our current size, each engineer gets one week on call and eight weeks off. Everyone gets a fair crack of the whip.

    Second Tier

    An escalated issue will most probably involve production circumstances unrelated to code. For that reason, second level on-call duty rotates between operation engineers only, as they have a deeper knowledge of our hardware and network infrastructure.

    Our PagerDuty Setup

    For escalations and scheduling we use PagerDuty. If an engineer doesn’t respond within the time limit (15 minutes of increasingly frequent notifications via SMS and phone) there will always be someone else available to take the call.

    Our ops engineers are responsible for dealing with any manual schedule overrides. If someone is ill, on holiday or is traveling then we ask for volunteers to rearrange the on-call slots.


    After an out-of-hours alert the responder gets the following 24 hours off from on-call. This helps with the social/health implications of being woken up multiple nights in a row.

    Keeping Everyone in the Loop

    We hold weekly meetings with operations and product leads. The intention is to keep everyone on top of the overall product health, and to help our product team prioritise their development efforts.

    While I don’t have on-call duties anymore (client meetings and frequent travel while on call just doesn’t make sense any more) I can still monitor our alerts on the Server Density mobile app (Android and iOS) which has a prominent place on my home screen together with apps from the other monitoring tools we use.

    What about you? How do you handle your devops on-call schedules?

  9. Escaping Rabbit Holes with Rubber Duck Debugging and more

    Comments Off on Escaping Rabbit Holes with Rubber Duck Debugging and more

    In 1940, a team of Harvard University researchers discovered a moth stuck in a relay of their Mark II computer. A tiny bug was blocking the operation of a supercomputer.

    Theories abound as to where the term “debugging” originates from. Humans have been debugging since the beginning of times. At its core, debugging is problem solving. Solving the problem of catching fish, for example, when the climate changed and freshwater streams froze. Or figuring out how to rescue the Apollo 13 crew using near-zero electricity.

    It’s not easy

    Debugging, i.e. problem solving, is the most complex of all intellectual functions. Yes, humans have been doing it for a long time but that doesn’t mean we’re good at it, or that we enjoy doing it.

    In fact we often avoid it.

    Problem solving doesn’t come free. The human brain represents no less than 20% of total body energy expenditure. And since humans are hardwired for survival and energy conservation, we tend to relegate thinking to a last resort activity.

    Getting stuck in a rabbit hole

    Even when we think we’re problem solving, chances are we’re not. We naturally default on automatic activities that don’t require actual problem-solving, and don’t overtax our grey matter in any way (see functional fixedness).

    In other words it’s easier to dig than to think. Digging is a manual, repetitive activity. Do enough of that without thinking and we find yourselves in a hole that gets deeper and deeper. The deeper it gets, the more invested we become.

    Rabbit hole activity is when we spend a disproportionate amount of time on a task. The importance of what we’re doing (our sunk costs) is then warped and our logic distorted.

    Last week we poured several development hours down such a rabbit hole, trying to fix a wait/repeat issue with one of the alerts of our monitoring software.

    We started troubleshooting the core logic of our code, assuming there was something wrong with internal state migrations. The actual bug turned out elsewhere (the UI was updating a field that it should not update). So our initial hunch turned out to be misplaced. There is nothing wrong with that.

    What is not ideal is the amount of time we spent under the wrong assumption. The misapplication of our resources began when we stopped questioning our hypothesis, i.e. when we invested ourselves in it.

    We’re not impervious to getting stuck. We do however have systems in place to help us deal with it when it happens. Here a are five of them:

    1. Restore Flow – Context Switch

    We’ve all felt it. The feeling of being in the right place. When you can’t type fast enough. When inspiration “flows” and you forget about time.

    When the opposite happens, i.e. when we’re stuck, it’s because the solution or breakthrough we’re after is not there, spatially or mentally. We’re going nowhere. We’re not moving.

    When that happens it’s often best to get up and go. A walk in nearby Chiswick Commons often does the trick for us. We leave our thought patterns behind and let our brain roam farther.

    Context switching helps. Every 6 weeks everyone stops scheduled work and spends a whole week working on a side project of their choice. This purposeful distraction is all about leaving our problems for a period of time. The difficulty has often evaporated by the time we’re back.

    Finally, we’d be remiss not to mention sleep, even if it sounds obvious. Sleep not only rejuvenates our brain but it also causes what neuroscientists call incubation effect. It’s almost as if our brain debugs for us while we sleep.

    2. Explain the problem

    “Simplicity is the ultimate sophistication.”

    Leonardo Da Vinci

    When we explain things to people we tend to slow down. Why? Because the person we’re talking to is removed from our situation. They haven’t caught up yet.

    For them to understand our problem, we set it forth in the simplest possible terms: What is it we’re trying to solve? We state our qualifiers: Why are we spending time on this problem? Why is it important? We also tell them what we’ve tried so far.

    If it sounds like hard work it’s because it is. The rigour behind good questions is often enough for solutions to magically present themselves. Articulating good questions pushes our brain into problem solving mode.

    And that paves the way to the next technique. . .

    3. Rubber Duck Debugging – Ask the Duck!

    This infamous rubber duck came to life in 1999 along with the publication of the Pragmatic Programmer.

    Rubber ducks are cute. Problem is, they’re not very smart. In order for them to understand the nature of the bug we’re dealing with, our explanation needs to be extra thorough. As per previous method, we need to slow down and simplify things for them.

    Staring into the guileless smiling innocence of our duck, we often find ourselves wondering: is there an easier way? Does this have to be done at all?

    This is one of our rubber ducks, here at Server Density.

    Our very own rubber duck

    4. Peer Reviews and Pair Programming

    There are times when talking to an inanimate object is not feasible (open plan office?). Or the solution to the problem might lie outside our domain or scope.

    Nothing happens in isolation, and there is something to be said about teamwork. Having a coding buddy can alleviate some of the horrors of getting stuck. Heaven forbid, it might even make debugging fun.

    Peer reviews (many eyes on the code) is a great way to work with fellow developers and make sure your code is bug free. It’s also a nice way to learn and develop professionally.

    5. Plan Ahead

    Decide how much individual effort (time) you intend to invest on debugging a particular issue, before you move on to something else.

    Sometimes we’re dealing with special types of bugs. Like the elusive heisenbugs, or fractal bugs that point to ever more bugs. It’s easy for a bug to turn into a productivity black hole.

    When you reach the end of the allotted time, it’s best to move on and tackle something else. Don’t let one task (bug) swallow other priorities and jeopardise the progress of your project.

    Once you’ve context-switched, worked on something else, and had a break, you’re better equipped to revisit the bug and determine your next steps (and priorities) with a clear mind.

    By the way, it’s worth mentioning debugging tools here. Things like central logging, error monitoring et cetera, can make a huge difference on how quickly you solve a bug. We will discuss this topic on a future post.


    Debugging efforts often go awry and we find ourselves lost in productivity rabbit holes. It happens when our mind trades hard problem solving for easier, repetitive activities that lead nowhere.

    Learn to spot when you’re getting stuck. Know the signs and get better at climbing out of those rabbit holes. Context switching, slowing down, rubber duck debugging, and planning ahead, are proven methods that get you back on course.

  10. Building Your own Chaos Monkey


    In 2012 Netflix introduced one of the coolest sounding names into the Cloud vernacular.

    What Chaos Monkey does is simple. It runs on Amazon Web Services and its sole purpose is to wipe out production instances in a random manner.

    The rationale behind those deliberate failures is a solid one.

    Setting Chaos Monkey loose on your infrastructure—and dealing with the aftermath—helps strengthen your app. As you recover, learn and improve on a regular basis, you’re better equipped to face real failures without significant, if any, customer impact.

    Our monkey

    Since we don’t use AWS or Java, we decided to build our own lightweight simian in the form of a simple Python script. The end-result is the same. We set it loose on our systems and watch as it randomly seeks and destroys production instances.

    What follows is our observations from those self-inflicted incidents, followed by some notes on what to consider when using a Chaos Monkey on your infrastructure.

    Monkey Island - 3 headed monkey

    Design Considerations

    1. Trigger chaos events during business hours

    It’s never nice to wake up your engineers with unnecessary on-call events in the middle of the night. Real failures can and do happen 24/7. When it comes to Chaos Monkey, however, it’s best to trigger failures when people are around to respond and fix them.

    2. Decide how much mystery you want

    When our Python script triggers a chaos event, we get a message in our HipChat room and everyone is on the look out for strange things.

    The message doesn’t specify what the failure is. We still need to triage the alerts and determine where the failures lie, just as we would in the event of a real outage. All this “soft” warning does is lessen the chance of failures going unnoticed.

    3. Have several failure modes

    Killing instances is a good way to simulate failures but it doesn’t cover all possible contingencies. At Server Density we use the SoftLayer API to trigger full and partial failures alike.

    A server power-down, for example, causes a full failure. Disabling networking interfaces, on the other hand, causes partial failures where the host may continue to run (and perhaps even send reports to our monitoring service).

    4. Don’t trigger sequential events

    If there’s ever a bad time to set your Chaos Monkey loose, that’s during the aftermath of previous chaos event. Especially if the bugs you discovered are yet to be fixed.

    We recommend you wait a few hours before introducing the next failure. Unless you want your team firefighting all day long.

    5. Play around with event probability

    Real world incidents have a tendency to transpire when you least expect them. So should your chaos events. Make them infrequent. Make them random. Space them out, by days even. That’s the best way to test your on-call readiness.

    Initial findings

    We’ve been triggering chaos events for some time now. None of the issues we’ve discovered so far were caused by the server software. In fact, scenarios like failovers in load balancers (Nginx) and databases (MongoDB) worked very well.

    Every single bug we found was in our own code. Most had to do with how our app interacts with databases in failover mode, and with libraries we’ve not written.

    In our most recent Chaos run we experienced some inexplicable performance delays during two consecutive MongoDB server failovers. Rebooting the servers was not a viable long term fix as it results in a long downtime (>5 minutes).

    It took us several days of investigation until we realised we were not invoking the mongoDB drivers properly.

    The app delays caused by the Chaos run happened during work hours. We were able to look at the issue immediately, rather than wait until an on-call engineer gets notified and is able to respond, in which case the investigation would’ve been harder.

    Such discoveries help us report bugs and improve the resiliency of our software. Of course, it also means additional engineering hours and effort to get things right.


    The Chaos Monkey is an excellent tool to test how your infrastructure behaves under unknown failure conditions. By triggering and dealing with random system failures, you help your product and service harden up and become resilient. This has obvious benefits to your uptime metrics and overall quality of service.

    And if the whole exercise has such a cool name attached to it, then all the better.

    Editor’s note: This post was originally published on 21st November, 2013 and has been completely revamped for accuracy and comprehensiveness.


Articles you care about. Delivered.

Maybe another time