Author Archives: David Mytton

About David Mytton

David Mytton is the founder of Server Density. He has been programming in PHP and Python for over 10 years, regularly speaks about MongoDB (including running the London MongoDB User Group), co-founded the Open Rights Group and can often be found cycling in London or drinking tea in Japan. Follow him on Twitter and Google+.
  1. 5 Website Security Checks: Are you at risk?

    Leave a Comment

    October is Security Month here at Server Density, and to mark the occasion we’ve partnered with our friends at Detectify to create a short series of security dispatches for you.

    As we’ve written before, humans are the weakest link when it comes to safe systems, and there are a number of best practices that help us mitigate that risk. In this series, however, we will focus purely on technology.

    Website Security Checks

    According to the 2015 WhiteHat Security Statistics Report, an overwhelming majority of websites suffered from at least one serious vulnerability. Even when those vulnerabilities were addressed, their time-to-fix was unacceptably long. A great number of website owners are not aware of—let alone equipped to deal with—online security threats.

    What follows is a collection of website security checks you can start with. It covers a number of known threats you need to prepare for, and secure your website against.

    1. Lack of HTTPS

    Traditional HTTP is not encrypted, and therefore, it is not secure. It allows an attacker to perform a man-in-the-middle attack, placing user credentials, cookies and other sensitive data at risk.

    DigitalOcean have written a great set of instructions on how to acquire and install an SSL certificate, and you can read our suggestions here. SSL certificates are fairly inexpensive and can be issued within minutes. And if you’re an experienced sysadmin you can harden your website with stronger ciphers too.

    2. Cross-Site Scripting

    XSS is the most frequently occurring security threat in web applications. It allows attackers to inject malicious scripts on web pages, that affect all subsequent visitors.

    Modern frameworks do a fairly good job at preventing XSS. This means legacy applications are the ones most exposed to this risk. You can mitigate XSS using libraries like DOMPurify. OWASP offers some comprehensive instructions on how to deal with XSS.

    3. SQL Injection

    This is a critical vulnerability affecting database servers. Attackers exploit any chinks in data entry mechanisms (i.e. username and password boxes) to tamper with SQL queries and break into the backend database of a website. This opens the possibility for data exfiltration and remote code execution.

    While it may not be as prevalent as it used to be, some very high profile leaks (Bell Canada, Wall Street Journal, SAP among others) happened through SQL injection attacks. Check Point recently compiled a list of SQL injection trends.

    Using parameterized queries and stored procedures can reduce the likelihood of attack. They work by helping the database to distinguish between user data and SQL code.

    Web Application Firewalls (WAFs) can further reduce the risk of SQL injections by applying blacklists to known attack patterns using regex or similar techniques. This blocks most automated scanners (bots) and provides some low hanging fruit protection against opportunist attacks. Targeted attacks, however, will probably bypass these filters.

    Finally, the presence of an ORM layer may help with SQL injection as it negates the need (or opportunity) to write actual SQL. However, this extra layer between code and database carries a CPU overhead. ORM is also known to generate complex and unoptimized SQL queries.

    4. Cross-Site Request Forgery

    A CSRF attack forces a user’s web browser to perform an unwanted action on a site the user is authenticated in. HTML forms that don’t offer integrity validation are mostly at risk.

    You can prevent such vulnerabilities by applying CSRF tokens to forms that undertake authenticated actions (such as updating a user profile or a password change). All modern frameworks have settings for mitigating the risk of CSRF.

    5. Outdated Software

    Updating your software is one of the most straightforward ways to protect your website and your users. A single unpatched vulnerability may be all it takes for an attacker to compromise your server. Make sure your WordPress stack—including themes and plugins—is entirely up to date.

    Upgrading to the latest bug-fixed versions of all your software should be a scheduled recurring activity.

    Next Steps

    In our next security dispatch we will look at some of the top API security checks you need to be aware of. To make sure you don’t miss a beat, . You should also read the other articles from our security month, including the API security holes you should be considering, and how to secure your servers.

  2. Using MongoDB as a Time Series Database


    We’ve used MongoDB as a time series database since 2009.

    MongoDB helps us scale for the expanding volumes of data we collect in our server monitoring service. Over the years, we went from processing 40GB per month, to more than 250TB.

    By the way, while it has served us well over the years, I’m not necessarily advocating MongoDB as the best possible database for time-series data. It’s what we’ve used so far (and we’re always evaluating alternatives).

    On that basis, I’ve written a few times about how we use MongoDB (here is a recent look at the tech behind our time series graphs). As part of our upcoming Story of the Payload campaign (stay tuned!) I thought I’d revisit our cluster setup with a detailed look at its inner workings.

    The hardware

    Over the years, we’ve experimented with a range of infrastructure choices. Those include Google Compute Engine, AWS, SSDs versus spinning disks, VMWare, and our transition from managed cloud to Softlayer where we are today. We standardize on Ubuntu Linux and the cluster is configured as follows:

    Ubuntu Linux 12.04 LTS

    We run every server on an LTS release and upgrade to the next LTS on a fixed schedule. We can speed up specific upgrades if our team needs access to newer features or bundled libraries.

    Bare metal servers

    We experimented with VMs in the past and found host contention to be an issue, even with guaranteed disk I/O performance (products like AWS EBS or Compute Engine SSDs offer this option, unlike Softlayer that doesn’t).

    Solid State Disks

    We have multiple SSDs, and house each database—including the journal—on its own disk.

    Everything is managed with Puppet

    We used to write our own manifests. The official Forge MongoDB module has since become a better option so we are migrating to it.

    As for the servers themselves, they have the following specs:

    • x2 2GHz Intel Xeon-SandyBridge (E5-2650-OctoCore) (16 cores total)
    • 16x16GB Kingston 16GB DDR3
    • x1 100GB Micron RealSSD P300 (for the MongoDB journal)
    • x2 800GB Intel S3700 Series (one per database)

    The MongoDB cluster

    Our current environment has been in use for 18 months. During this time we scaled both vertically (adding more RAM) and horizontally (adding more shards). Here are some details and specs:

    • x3 data node replica sets plus 1 arbiter per shard.
    • x2 nodes in the primary data centre at Washington DC, and a failover node at San Jose, CA. The arbiter is housed in a third data centre in Dallas, TX.
    • x5 shards with distribution based on item ID (server, for example) and metric. This splits up a customer’s record across multiple shards for maximum availability. MongoDB deals with balancing using hash-based sharding.
    • The average workload is around 6000 writes/sec which equates to about 500,000,000 new documents per day.
    • We use the MongoDB Cloud Backup service which offers real-time offsite backups. It acts as a replica node for each replica set. It receives a (compacted and compressed) copy of every write operation. Current throughput sits at a sustained 42 Mbps.
    • We use the Google Compute Engine and MongoDB Cloud Backup service API to restore our backups and verify them against our production cluster, twice per day.
    • We keep a copy of the backup in Google’s Cloud Storage in Europe as a final disaster recovery option. We store copies twice per day, going back for 10 days.

    The Data

    The write workload of the cluster consists of inserts and updates, for the most part.

    For the lowest granularity level of data, we use an append-only schema where new data is inserted and never updated. These writes take approx 2-3ms.

    For the hourly average type of metrics (we keep those forever – check out our monitoring graphs) we allocate a document per day, per metric, per item. This document gets updated with the sum and count. From that, we calculate the mean average when we query the data. These writes typically complete within 500 ms.

    We optimise for writes in place, use field modifiers and avoid growing documents by pre-allocation. Even so, there is a large overhead associated with updating documents.

    When querying the data (drawing graphs for example) the average response time as experienced by the user is 189 ms. Median response is 39 ms, the 95th percentile is 532 ms, and 99th percentile is 1400 ms. The majority of that time is used by application code as it constructs a response for the API. The slowest of queries are the ones comprising multiple items and metrics over a wide time range. If we were to exclude our application code, the average MongoDB query time is 0.0067ms.


    So that’s the MongoDB cluster setup we have here at Server Density. We would like to hear from you now. How do you use MongoDB? What does you cluster look like, and how has it evolved over time?

  3. DevOps On-Call: How we Handle our Schedules


    I was on call 24/7/365.

    That’s right. When we first launched our server monitoring service in 2009, it was just me that knew how the systems worked. I was on-call all the time.

    You know the drill. Code with one hand. Respond to production alerts with the other.

    In time, and as more engineers joined the team, we became a bit more deliberate about how we handle production incidents. This post outlines the thinking behind our current escalation process and some details on how we distribute our on-call workload between team members.

    DevOps On-Call

    Developing our product and responding to production alerts when the product falters, are two distinct yet very much intertwined activities. And so they should be.

    If we were to insulate our development efforts from how things perform in production, our priorities would get disconnected. Code resilience is at least as important as building new features. In that respect, it makes great sense to expose our developers and designers to production issues. As proponents of DevOps we think this is a good idea.

    Wait, what?

    But isn’t this counterproductive? The state of mind involved in writing code couldn’t be more different from that of responding to production alerts.

    When an engineer (and anyone for that matter) is working on a complex task the worst thing you can do is expose them to random alerts. It takes more than 15 minutes, on average, to regain intense focus after being interrupted.

    How we do it

    It takes consideration and significant planning to get this productivity balance right. In fact, it’s an ongoing journey. Especially in small and growing teams like ours.

    With that in mind, here is our current process for dealing with production alerts:

    First Tier

    During work hours, all alerts go to an operations engineer. This provides a much needed quiet time for our product team to do what they do best.

    Outside work hours alerts could could go to anyone (ops or product alike). We rotate between team members every seven days. At our current size, each engineer gets one week on call and eight weeks off. Everyone gets a fair crack of the whip.

    Second Tier

    An escalated issue will most probably involve production circumstances unrelated to code. For that reason, second level on-call duty rotates between operation engineers only, as they have a deeper knowledge of our hardware and network infrastructure.

    Our PagerDuty Setup

    For escalations and scheduling we use PagerDuty. If an engineer doesn’t respond within the time limit (15 minutes of increasingly frequent notifications via SMS and phone) there will always be someone else available to take the call.

    Our ops engineers are responsible for dealing with any manual schedule overrides. If someone is ill, on holiday or is traveling then we ask for volunteers to rearrange the on-call slots.


    After an out-of-hours alert the responder gets the following 24 hours off from on-call. This helps with the social/health implications of being woken up multiple nights in a row.

    Keeping Everyone in the Loop

    We hold weekly meetings with operations and product leads. The intention is to keep everyone on top of the overall product health, and to help our product team prioritise their development efforts.

    While I don’t have on-call duties anymore (client meetings and frequent travel while on call just doesn’t make sense any more) I can still monitor our alerts on the Server Density mobile app (Android and iOS) which has a prominent place on my home screen together with apps from the other monitoring tools we use.

    What about you? How do you handle your devops on-call schedules?

  4. Escaping Rabbit Holes with Rubber Duck Debugging and more

    Leave a Comment

    In 1940, a team of Harvard University researchers discovered a moth stuck in a relay of their Mark II computer. A tiny bug was blocking the operation of a supercomputer.

    Theories abound as to where the term “debugging” originates from. Humans have been debugging since the beginning of times. At its core, debugging is problem solving. Solving the problem of catching fish, for example, when the climate changed and freshwater streams froze. Or figuring out how to rescue the Apollo 13 crew using near-zero electricity.

    It’s not easy

    Debugging, i.e. problem solving, is the most complex of all intellectual functions. Yes, humans have been doing it for a long time but that doesn’t mean we’re good at it, or that we enjoy doing it.

    In fact we often avoid it.

    Problem solving doesn’t come free. The human brain represents no less than 20% of total body energy expenditure. And since humans are hardwired for survival and energy conservation, we tend to relegate thinking to a last resort activity.

    Getting stuck in a rabbit hole

    Even when we think we’re problem solving, chances are we’re not. We naturally default on automatic activities that don’t require actual problem-solving, and don’t overtax our grey matter in any way (see functional fixedness).

    In other words it’s easier to dig than to think. Digging is a manual, repetitive activity. Do enough of that without thinking and we find yourselves in a hole that gets deeper and deeper. The deeper it gets, the more invested we become.

    Rabbit hole activity is when we spend a disproportionate amount of time on a task. The importance of what we’re doing (our sunk costs) is then warped and our logic distorted.

    Last week we poured several development hours down such a rabbit hole, trying to fix a wait/repeat issue with one of the alerts of our monitoring software.

    We started troubleshooting the core logic of our code, assuming there was something wrong with internal state migrations. The actual bug turned out elsewhere (the UI was updating a field that it should not update). So our initial hunch turned out to be misplaced. There is nothing wrong with that.

    What is not ideal is the amount of time we spent under the wrong assumption. The misapplication of our resources began when we stopped questioning our hypothesis, i.e. when we invested ourselves in it.

    We’re not impervious to getting stuck. We do however have systems in place to help us deal with it when it happens. Here a are five of them:

    1. Restore Flow – Context Switch

    We’ve all felt it. The feeling of being in the right place. When you can’t type fast enough. When inspiration “flows” and you forget about time.

    When the opposite happens, i.e. when we’re stuck, it’s because the solution or breakthrough we’re after is not there, spatially or mentally. We’re going nowhere. We’re not moving.

    When that happens it’s often best to get up and go. A walk in nearby Chiswick Commons often does the trick for us. We leave our thought patterns behind and let our brain roam farther.

    Context switching helps. Every 6 weeks everyone stops scheduled work and spends a whole week working on a side project of their choice. This purposeful distraction is all about leaving our problems for a period of time. The difficulty has often evaporated by the time we’re back.

    Finally, we’d be remiss not to mention sleep, even if it sounds obvious. Sleep not only rejuvenates our brain but it also causes what neuroscientists call incubation effect. It’s almost as if our brain debugs for us while we sleep.

    2. Explain the problem

    “Simplicity is the ultimate sophistication.”

    Leonardo Da Vinci

    When we explain things to people we tend to slow down. Why? Because the person we’re talking to is removed from our situation. They haven’t caught up yet.

    For them to understand our problem, we set it forth in the simplest possible terms: What is it we’re trying to solve? We state our qualifiers: Why are we spending time on this problem? Why is it important? We also tell them what we’ve tried so far.

    If it sounds like hard work it’s because it is. The rigour behind good questions is often enough for solutions to magically present themselves. Articulating good questions pushes our brain into problem solving mode.

    And that paves the way to the next technique. . .

    3. Rubber Duck Debugging – Ask the Duck!

    This infamous rubber duck came to life in 1999 along with the publication of the Pragmatic Programmer.

    Rubber ducks are cute. Problem is, they’re not very smart. In order for them to understand the nature of the bug we’re dealing with, our explanation needs to be extra thorough. As per previous method, we need to slow down and simplify things for them.

    Staring into the guileless smiling innocence of our duck, we often find ourselves wondering: is there an easier way? Does this have to be done at all?

    This is one of our rubber ducks, here at Server Density.

    Our very own rubber duck

    4. Peer Reviews and Pair Programming

    There are times when talking to an inanimate object is not feasible (open plan office?). Or the solution to the problem might lie outside our domain or scope.

    Nothing happens in isolation, and there is something to be said about teamwork. Having a coding buddy can alleviate some of the horrors of getting stuck. Heaven forbid, it might even make debugging fun.

    Peer reviews (many eyes on the code) is a great way to work with fellow developers and make sure your code is bug free. It’s also a nice way to learn and develop professionally.

    5. Plan Ahead

    Decide how much individual effort (time) you intend to invest on debugging a particular issue, before you move on to something else.

    Sometimes we’re dealing with special types of bugs. Like the elusive heisenbugs, or fractal bugs that point to ever more bugs. It’s easy for a bug to turn into a productivity black hole.

    When you reach the end of the allotted time, it’s best to move on and tackle something else. Don’t let one task (bug) swallow other priorities and jeopardise the progress of your project.

    Once you’ve context-switched, worked on something else, and had a break, you’re better equipped to revisit the bug and determine your next steps (and priorities) with a clear mind.

    By the way, it’s worth mentioning debugging tools here. Things like central logging, error monitoring et cetera, can make a huge difference on how quickly you solve a bug. We will discuss this topic on a future post.


    Debugging efforts often go awry and we find ourselves lost in productivity rabbit holes. It happens when our mind trades hard problem solving for easier, repetitive activities that lead nowhere.

    Learn to spot when you’re getting stuck. Know the signs and get better at climbing out of those rabbit holes. Context switching, slowing down, rubber duck debugging, and planning ahead, are proven methods that get you back on course.

  5. Building Your own Chaos Monkey


    In 2012 Netflix introduced one of the coolest sounding names into the Cloud vernacular.

    What Chaos Monkey does is simple. It runs on Amazon Web Services and its sole purpose is to wipe out production instances in a random manner.

    The rationale behind those deliberate failures is a solid one.

    Setting Chaos Monkey loose on your infrastructure—and dealing with the aftermath—helps strengthen your app. As you recover, learn and improve on a regular basis, you’re better equipped to face real failures without significant, if any, customer impact.

    Our monkey

    Since we don’t use AWS or Java, we decided to build our own lightweight simian in the form of a simple Python script. The end-result is the same. We set it loose on our systems and watch as it randomly seeks and destroys production instances.

    What follows is our observations from those self-inflicted incidents, followed by some notes on what to consider when using a Chaos Monkey on your infrastructure.

    Monkey Island - 3 headed monkey

    Design Considerations

    1. Trigger chaos events during business hours

    It’s never nice to wake up your engineers with unnecessary on-call events in the middle of the night. Real failures can and do happen 24/7. When it comes to Chaos Monkey, however, it’s best to trigger failures when people are around to respond and fix them.

    2. Decide how much mystery you want

    When our Python script triggers a chaos event, we get a message in our HipChat room and everyone is on the look out for strange things.

    The message doesn’t specify what the failure is. We still need to triage the alerts and determine where the failures lie, just as we would in the event of a real outage. All this “soft” warning does is lessen the chance of failures going unnoticed.

    3. Have several failure modes

    Killing instances is a good way to simulate failures but it doesn’t cover all possible contingencies. At Server Density we use the SoftLayer API to trigger full and partial failures alike.

    A server power-down, for example, causes a full failure. Disabling networking interfaces, on the other hand, causes partial failures where the host may continue to run (and perhaps even send reports to our monitoring service).

    4. Don’t trigger sequential events

    If there’s ever a bad time to set your Chaos Monkey loose, that’s during the aftermath of previous chaos event. Especially if the bugs you discovered are yet to be fixed.

    We recommend you wait a few hours before introducing the next failure. Unless you want your team firefighting all day long.

    5. Play around with event probability

    Real world incidents have a tendency to transpire when you least expect them. So should your chaos events. Make them infrequent. Make them random. Space them out, by days even. That’s the best way to test your on-call readiness.

    Initial findings

    We’ve been triggering chaos events for some time now. None of the issues we’ve discovered so far were caused by the server software. In fact, scenarios like failovers in load balancers (Nginx) and databases (MongoDB) worked very well.

    Every single bug we found was in our own code. Most had to do with how our app interacts with databases in failover mode, and with libraries we’ve not written.

    In our most recent Chaos run we experienced some inexplicable performance delays during two consecutive MongoDB server failovers. Rebooting the servers was not a viable long term fix as it results in a long downtime (>5 minutes).

    It took us several days of investigation until we realised we were not invoking the mongoDB drivers properly.

    The app delays caused by the Chaos run happened during work hours. We were able to look at the issue immediately, rather than wait until an on-call engineer gets notified and is able to respond, in which case the investigation would’ve been harder.

    Such discoveries help us report bugs and improve the resiliency of our software. Of course, it also means additional engineering hours and effort to get things right.


    The Chaos Monkey is an excellent tool to test how your infrastructure behaves under unknown failure conditions. By triggering and dealing with random system failures, you help your product and service harden up and become resilient. This has obvious benefits to your uptime metrics and overall quality of service.

    And if the whole exercise has such a cool name attached to it, then all the better.

    Editor’s note: This post was originally published on 21st November, 2013 and has been completely revamped for accuracy and comprehensiveness.


  6. 10 Ways to Secure Your Webapp


    While there is no such thing as 100% secure, you can take specific measures to mitigate against a wide range of attacks and secure your webapp as much as possible.

    In this post we discuss some of the steps we’ve taken as part of our efforts to secure our server monitoring tool.

    1. Cover the Basics

    Before considering any of the suggestions listed here, make sure you’ve covered the basics. Those include industry best practices like protecting against SQL injection, filtering, session handling, and XSRF attacks.

    Also check out the OWASP cheat sheets and top 10 lists to ensure you’re covered.

    2. Use SSL only

    When we launched Server Density in 2009, we offered HTTPS for monitoring agent postbacks but didn’t go as far as to block standard HTTP altogether.

    Later on, when we made the switch to HTTPS-only, the change was nowhere near as onerous as we thought it would be.

    SSL is often viewed as a performance bottleneck but that isn’t really true. In most situations, we see no reason not to force SSL for all connections right from the start.

    Server Density v2 uses a new URL. As part of this, we can force SSL for new agent deployments and access to the web UI alike. We still support the old domain endpoint under non-SSL but will eventually be retiring it.

    To get an excellent report on how good your implementation is, run your URL against the Qualys SSL server test. Here is ours:

    SSL scan for our webapp

    3. Support SSL with Perfect Forward Secrecy

    Every connection to an SSL URL is encrypted using a single private key. If someone obtains that key they can decrypt and access the traffic of that URL.

    Perfect forward secrecy addresses this risk by negotiating a new key with every session. A compromise of one key would therefore only affect the data in that one session.

    To do this, you need to allow certain cipher suites in your web server configuration.

    ECDHE-RSA-AES128-SHA:AES128-SHA:RC4-SHA is compatible with most browsers (for more background and implementation details check out this post).

    We terminate SSL at our nginx load balancers and implement SSL using these settings:

    You can easily tell if you’re connected using perfect forward secrecy. In Chrome, just click on the lock icon preceding the URL and look for ECDHE_RSA under the Connection tab:

    TLS security

    4. Use Strict Transport Security

    Forcing SSL should be combined with HTTP Strict Transport Security. Otherwise you run a risk of users entering your domain without specifying a protocol.

    For example, typing rather than and then being redirected to HTTPS. This redirect opens a security hole because there’s a short time when communication is still over HTTP.

    You can address this by sending an STS header with your response. This forces the browser to do the HTTP to HTTPS conversion without issuing a request at all. Instead, it sends the header together with a time setting that the browser stores, before checking again:

    Our header is set for 10 years and includes all subdomains because each account gets their own URL, for example:

    5. Submit STS Settings to Browser Vendors

    Even with STS headers in place there’s still a potential hole, because those headers are only sent after the first request.

    One way to address this is by submitting your URL to browser vendors so they can force the browser to only ever access your URL over SSL.

    You can read more about how this works and submit your URL for inclusion in Chrome. Firefox seeds from the Chrome list.

    6. Enforce a Content Security Policy

    Of the top 10 most common security vulnerabilities, cross site scripting (XSS) is number 3. This is where remote code is injected and executed on your site, usually through incorrect (or non-existing) filtering.

    A good way to combat this is to whitelist the specific remote resources you want to allow. If a script URL is not matched by this list then browsers will block it.

    It’s much easier to implement this on a new product because you can start out by blocking everything. You then open specific URLs as and when you add functionality.

    Using browser developer tools you can easily see which remote hosts are being called. The CSP we use is:

    We have to specifically allow unsafe-eval here, as a number of third party libraries require this. You might not use any third party libraries—or the libraries you do use may not require unsafe eval—in which case you should not allow unsafe-eval.

    script-src is a directive that controls a set of script-related privileges for a specific page. For more information on connect-src, script-src and frame-src this is a good introduction on CSP.

    Be careful with wildcarding on domains which can have any content hosted on them. For example wildcarding * would allow anyone to host any script. This is Amazon’s CDN which everyone can upload files to!

    Also note that Content-Security-Policy is the standard header but Firefox and IE only support X-Content-Security-Policy. See the OWASP documentation for more information about the header names and directives.

    7. Enable HTTP security headers

    You can enable some additional security features in certain browsers by setting the appropriate response headers. While not widely supported, they are still worth considering:

    8. Setup passwords, “remember me” and login resets properly

    This is the main gateway to your webapp, so make sure you implement all stages of logging-in properly. It only takes a short amount of time to research and design a secure process:

    • Registration and login should use salting and cryptographic functions (such as bcrypt) to store passwords, not plain text or MD5 hashing.
    • Password reset should use an out-of-band method to trigger resets, for example: requiring a username then emailing a one-time, expiring link to the on-record email address where the user can then choose a new password. Here is more guidance and a checklist.
    • Remember me” functionality should use secure tokens to recognise the user, and not storing their credentials in cookies.

    You can review your authentication process against this OWASP cheat sheet.

    9. Offer Multi Factor Authentication

    If your webapp is anything more than a trivial consumer product, you should implement—and encourage your users to use—multi factor authentication.

    This requires them to authenticate using something they carry with them (token), before they can log in. An attacker would therefore need both this token (phone, RSA SecurID etc) and user credentials before they obtain access.

    We use the Google Authenticator standard because it has authentication apps available for all platforms, and has libraries for pretty much every platform.

    It is quite onerous to install a custom, proprietary MFA app so we don’t recommend you implement your own system.

    Be sure to re-authenticate for things like adding/removing MFA tokens. We require re-authentication for all user profile changes.

    We do however have a timeout in place during which users won’t have to re-authenticate. This timeout applies for simple actions like changing passwords (adding or removing tokens requires authentication even during the timeout).

    To sum up, MFA is crucial for any serious application as it’s the only way to protect against account hijacking.

    10. Schedule Security Audits

    We inspect security as part of our code review and deployment process (many eyes on the code). We also have regular reviews from external security consultants.

    We recommend having one firm do an audit, implement their fixes, and then have another firm audit those changes.


    Security is all about identifying and mitigating possible risks of attack. The operative word here is mitigation, since new threats are always emerging.

    This is an ongoing exercise. Be sure to conduct regular reviews of all existing measures, check for new defence mechanisms and keep abreast of security announcements.

  7. Is Security a Growth Catalyst for DevOps?

    1 Comment

    Security comes from the Latin route sēcūrus. It means free from care. Some adjectives associated with this word are untroubled, fearless, and composed.

    Security provides a safe space for humans to stretch their imagination and be as creative as they can. It allows for growth.

    It also allows for focus. For small companies like ours, security unfetters our potential to improve our product and serve our customers.

    Good security is not an add-on, a feature or a separate effort. It is an essential building block of our work. And that should be reflected in everything we do, including our people, our infrastructure, our technologies and our product.

    Let’s start with people.

    The Role of People

    If you think technology can solve your security problems, then you don’t understand the problems and you don’t understand the technology.”

    Bruce Schneier.

    All fourteen collisions with Google’s self driving cars were caused by human error, according to Google. The drivers involved in those accidents were all distracted. It turns out that humans are the weakest link when it comes to safe systems.

    There are a number of ways we approach (and mitigate) this risk. To begin with, we try and have as many “eyes on the code” as possible.

    As part of our code review and deployment process we test each other’s code and try to break it. We are a small and tightly knit team, which is great. But we don’t know it all.

    To reduce the risk of blind spots and confirmation bias (we are only human!), we work with independent security consultants who inspect our product (and code) on a regular basis.

    Another resource we are looking into (but haven’t leveraged yet) is the specialised skillsets of the crowd. There are some compelling platforms for bug bounty and bug reporting out there. Large companies, like Google and Tesla, and smaller ones, like LastPass and Drupal, have used this for awhile.

    Now let’s turn our attention to technology, and how we can secure it.

    Multi Factor Authentication

    Multi Factor Authentication (MFA) requires the user to authenticate using something they physically have with them before they can log in. It’s the only way to protect against account hijacking.

    We use MFA internally as much as we can. For example, we enforce Google authenticator for Gmail, Google Drive and all our Google Apps.

    We also encourage all our customers to activate MFA for their Server Density account:

    Screen Shot 2015-08-16 at 6.36.50 pm


    Our computers are full-disk encrypted (we use Filevault, PGP Full Disk Encryption or Espionage, depending on the OS). We also encrypt some of our email communications with GnuPG, one of the tools that Edward Snowden used to protect his communications about the NSA.

    Up to Date Software

    We make sure we are always running the latest bug fixed versions of all installed software we use. This includes web browsers, messaging clients, OS components and the OS itself.

    Web Browser

    We like Google Chrome for its tight integration with Google Apps but also for its auto-update feature which keeps the browser secure.

    We are not big on browser add-ons. Click-to-play is an exception as it helps us prevent browser plugin vulnerabilities (Flash and Java in particular). We also use this Chrome extension to protect against phishing on our Google accounts.

    We also recommend Fluffify, our very own Chrome extension. It won’t make you any more secure, but it will keep you sane.


    The second law of thermodynamics states that entropy always increases with time. When it comes to guessing passwords however, time always increases with entropy.

    Password entropy is a measurement of how unpredictable a password is.

    Our passwords are at least 20 characters long. They comprise a mix of upper and lower case characters including numbers, letters and symbols. They are also unique for each system, which means if one system is compromised, others will not follow suit.

    We keep offsite and easily accessible backups of all our passwords (using tools like 1Password) to allow for easy reset of all account passwords in the event of a breach.

    We never share passwords. Each of us has our very own set of credentials. This helps us deal with red-flag scenarios. Like revoking employee privileges when they leave. Or auditing who accessed a particular server or database.

    Least Privilege

    According to the principle of least privilege, every process or user should only be able to access the resources they need. User administration is a key component of our product:

    users sd

    Secure Data Flows

    For Server Density to work we ask our customers to install a lightweight agent on their server. All this does is collect various system metrics and constantly report back.

    A deliberate restriction is that data only can only travel one way: from the client server to ours. That rules out any possibility for remote execution.

    From that point everything is encrypted. In fact, encrypted post backs are the only option.

    We use ports that are usually already open (HTTPS port 443) which means there is no need to configure anything new. No root access required either. And because our agent is open source, our customers have full visibility of what is running at all times.


    Amateurs hack systems, professionals hack people.

    Bruce Schneier

    We don’t think security is a mere feature, and it shouldn’t be treated as such. At its best, security is an essential building block of the product, the team, and everything a company does.

    From sending data, provisioning access to their systems and storing internal passwords, DevOps teams should take all reasonable precautions to keep confidential data safe and available.

    Having secure systems affords companies the stability and peace of mind they need to be creative, grow, and serve their customers.

    What about you? What industry best practices do you follow?

  8. How and why we use DevOps checklists


    In his book The Checklist Manifesto, Atul Gawande tells the story of the first pre-flight checklist, created by Boeing following the fatal crash of a B-17 in 1935.

    According to the investigation, the pilots forgot to disengage a critical wing adjustment mechanism before take off. Evidence that even veteran pilots could miss key steps or do things in the wrong order. With hundreds of lives at stake it was necessary to design around this constraint.

    The checklist does exactly that. It compensates for the “limits of human memory and attention.”

    Indeed, Gawande — a doctor himself — writes how key steps in medical procedures were routinely missed, resulting in infections and preventable fatalities. The adoption of checklists reduced those occurrences, and they are now used in a wide range of healthcare settings.

    Checklists for DevOps

    Not unlike healthcare and aviation, sysadmins are often tasked with systems that touch many lives. Here at Server Density we appreciate the complexities of the systems we run. We also recognise the limits of the people who run them — us. That’s why we use checklists for much of what we do.

    Checklists are particularly effective in situations where there is:


    There is only so much that human memory can remember, reproduce and execute upon, in a reliable manner.

    Stress and Fatigue

    Incidents may happen at awkward times, like early in the morning when mistakes are more likely. Sysadmins are vulnerable to stress and fatigue like everyone else.


    You’d expect a seasoned engineer to intuitively know how to deal with a wide range of contingencies. That is a good thing. Experience and tenure, however, could also encourage people to rely on “gut instincts”, to “wing it” and “shoot from the hip.” In complex situations those attitudes could prove hazardous.

    A checklist is a good way to mitigate those problems because it helps us define our response in advance and make it available to everyone. We therefore ensure that every member of our team is taking the right steps in the right order, each and every time.

    Checklists at Server Density

    Here is one of our own checklists. It defines what our on-call first responders do when a critical incident occurs (we also wrote a guide on how we handle incidents outages and downtime):

    Incident Response Checklist

    As “common sense” and obvious as the steps may be, they carry great importance for the health of our infrastructure. So we spell them out.

    As you would expect, we take our uptime metrics seriously. We’ve got some pretty capable folks taking care of our servers. And we use checklists. Very prescriptive ones like the one below. This one details the steps our on-call people follow when faced with a server failover:

    Load Balancer Checklist

    We sed/awk/grep all day long. Our checklists assume we do it for the very first time. At 2:00 in the morning we might have trouble finding the lights, let alone the Puppet master for our configuration.

    Here is another example. We use this checklist when a server we monitor stops sending data:

    No Data Checklist

    DevOps checklists are as unique as the teams that use them. Each team has their own recipe for doing things and as technology stacks evolve, so do the checklists required to run them.

    We aim to have a checklist for every scenario. From restoring a backup to production, deploying fixes, switching primary data centres, and database consistency checks, to responding to traffic spikes, security breaches, critical alerts, and a long list of other contingencies.

    Google Docs is a key part of our on-call playbook, so that’s where we store our checklists for everyone to access and update as needed.

    Checklists are not static

    Relying on checklists does not mean we are intractable about how we do things. For us, creating checklists is an excellent opportunity to take a step back and review the entirety of our stack.

    DevOps checklists work best when we schedule time to update and improve on them.


    When it comes to server monitoring, we believe checklists are an important step towards reliable systems. They help our team respond to issues in a consistent and timely manner. This translates to increased uptime and a better capacity to serve our customers.

    What about you. Are you using checklists?

  9. What’s new in Server Density – July 2015

    Leave a Comment

    This is our regular post to keep you up to date with the latest releases to our server monitoring product, Server Density.

    Tags as a recipient

    We launched tags several months ago to allow you to set permissions for different users, but they are the foundation for many more features we’ll be releasing over the coming months. The first of these is tags as a recipient. This allows you to have alerts delivered to all members of a tag, rather than having to set up each of your users on every alert configuration individually.

    For example, if you have an “on call” team, you can add all the users to that tag, then set the tag as the recipient for an alert. Each user can have different notification options and any changes you make will apply to all alerts the tag is a recipient for. This is particularly useful if your team changes e.g. new members or staff leaving – you only have to make the change once on the tag and it’ll apply to all alerts.

    Tags as a recipient

    This is available now on device and service level alerts. It’s not available for group level alerts because our next release will be replacing those with alerts on a tag (so servers and services can have multiple tags, with inheritance across multiple tags).

    Learn how to set these up in our support guide.

    Service monitoring error details

    We often get reports that our availability monitoring is reporting “false positives” when compared to competing products which actually turn out to be real errors that we’ve detected where others have not! To back up our claims, we have now exposed full details of any errors we see along with their history for all of your service checks.

    You can browse recent errors, search and filter by location and see errors as they are detected. This will help debug any problems we detect with our availability monitoring.

    Service monitoring errors

    Ongoing fixes

    We’re always working on improvements and fixes and often deploy code 5-10 times a day! So if you find any problems or have ideas for improvements, please get in touch so we can continue to improve.

  10. What’s new in Server Density – May 2015

    Comments Off on What’s new in Server Density – May 2015

    This is our regular post to keep you up to date with the latest releases to our server monitoring product, Server Density.

    Process statistics

    The main release for this month has been our in-depth process level statistics. Server Density has had the ability to alert on process existence and resource usage for some time but now that data is visualised for each server.

    Each server overview has a top processes widget that gives you a breakdown of the most intensive processes and how many running instances there are:

    Top processes

    This is also extended to the snapshot view which you can reach by clicking on a data point on any graph or from the Snapshot tab when viewing a particular server.

    Processes snapshot

    Processes snapshot

    sd-agent 1.14.0

    A new version of the monitoring agent for Linux, FreeBSD and Mac has been released with a range of bug fixes. This is intended to be the final release of the v1 agent. We’ll soon be releasing sd-agent v2 which will include features such as SNMP, statsd and second by second monitoring.

    Ongoing fixes

    We’re always working on improvements and fixes and often deploy code 5-10 times a day! So if you find any problems or have ideas for improvements, please get in touch so we can continue to improve.