That’s right. When we first launched our server monitoring service in 2009, it was just me that knew how the systems worked. I was on-call all the time.
You know the drill. Code with one hand. Respond to production alerts with the other.
In time, and as more engineers joined the team, we became a bit more deliberate about how we handle production incidents. This post outlines the thinking behind our current escalation process and some details on how we distribute our on-call workload between team members.
Developing our product and responding to production alerts when the product falters, are two distinct yet very much intertwined activities. And so they should be.
If we were to insulate our development efforts from how things perform in production, our priorities would get disconnected. Code resilience is at least as important as building new features. In that respect, it makes great sense to expose our developers and designers to production issues. As proponents of DevOps we think this is a good idea.
But isn’t this counterproductive? The state of mind involved in writing code couldn’t be more different from that of responding to production alerts.
When an engineer (and anyone for that matter) is working on a complex task the worst thing you can do is expose them to random alerts. It takes more than 15 minutes, on average, to regain intense focus after being interrupted.
How we do it
It takes consideration and significant planning to get this productivity balance right. In fact, it’s an ongoing journey. Especially in small and growing teams like ours.
With that in mind, here is our current process for dealing with production alerts:
During work hours, all alerts go to an operations engineer. This provides a much needed quiet time for our product team to do what they do best.
Outside work hours alerts could could go to anyone (ops or product alike). We rotate between team members every seven days. At our current size, each engineer gets one week on call and eight weeks off. Everyone gets a fair crack of the whip.
An escalated issue will most probably involve production circumstances unrelated to code. For that reason, second level on-call duty rotates between operation engineers only, as they have a deeper knowledge of our hardware and network infrastructure.
Our PagerDuty Setup
For escalations and scheduling we use PagerDuty. If an engineer doesn’t respond within the time limit (15 minutes of increasingly frequent notifications via SMS and phone) there will always be someone else available to take the call.
Our ops engineers are responsible for dealing with any manual schedule overrides. If someone is ill, on holiday or is traveling then we ask for volunteers to rearrange the on-call slots.
After an out-of-hours alert the responder gets the following 24 hours off from on-call. This helps with the social/health implications of being woken up multiple nights in a row.
Keeping Everyone in the Loop
We hold weekly meetings with operations and product leads. The intention is to keep everyone on top of the overall product health, and to help our product team prioritise their development efforts.
While I don’t have on-call duties anymore (client meetings and frequent travel while on call just doesn’t make sense any more) I can still monitor our alerts on the Server Density mobile app (Android and iOS) which has a prominent place on my home screen together with apps from the other monitoring tools we use.
What about you? How do you handle your devops on-call schedules?
In 1940, a team of Harvard University researchers discovered a moth stuck in a relay of their Mark II computer. A tiny bug was blocking the operation of a supercomputer.
Theories abound as to where the term “debugging” originates from. Humans have been debugging since the beginning of times. At its core, debugging is problem solving. Solving the problem of catching fish, for example, when the climate changed and freshwater streams froze. Or figuring out how to rescue the Apollo 13 crew using near-zero electricity.
Even when we think we’re problem solving, chances are we’re not. We naturally default on automatic activities that don’t require actual problem-solving, and don’t overtax our grey matter in any way (see functional fixedness).
In other words it’s easier to dig than to think. Digging is a manual, repetitive activity. Do enough of that without thinking and we find yourselves in a hole that gets deeper and deeper. The deeper it gets, the more invested we become.
Rabbit hole activity is when we spend a disproportionate amount of time on a task. The importance of what we’re doing (our sunk costs) is then warped and our logic distorted.
Last week we poured several development hours down such a rabbit hole, trying to fix a wait/repeat issue with one of the alerts of our monitoring software.
We started troubleshooting the core logic of our code, assuming there was something wrong with internal state migrations. The actual bug turned out elsewhere (the UI was updating a field that it should not update). So our initial hunch turned out to be misplaced. There is nothing wrong with that.
What is not ideal is the amount of time we spent under the wrong assumption. The misapplication of our resources began when we stopped questioning our hypothesis, i.e. when we invested ourselves in it.
We’re not impervious to getting stuck. We do however have systems in place to help us deal with it when it happens. Here a are five of them:
1. Restore Flow – Context Switch
We’ve all felt it. The feeling of being in the right place. When you can’t type fast enough. When inspiration “flows” and you forget about time.
When the opposite happens, i.e. when we’re stuck, it’s because the solution or breakthrough we’re after is not there, spatially or mentally. We’re going nowhere. We’re not moving.
When that happens it’s often best to get up and go. A walk in nearby Chiswick Commons often does the trick for us. We leave our thought patterns behind and let our brain roam farther.
Finally, we’d be remiss not to mention sleep, even if it sounds obvious. Sleep not only rejuvenates our brain but it also causes what neuroscientists call incubation effect. It’s almost as if our brain debugs for us while we sleep.
2. Explain the problem
“Simplicity is the ultimate sophistication.”
Leonardo Da Vinci
When we explain things to people we tend to slow down. Why? Because the person we’re talking to is removed from our situation. They haven’t caught up yet.
For them to understand our problem, we set it forth in the simplest possible terms: What is it we’re trying to solve? We state our qualifiers: Why are we spending time on this problem? Why is it important? We also tell them what we’ve tried so far.
If it sounds like hard work it’s because it is. The rigour behind good questions is often enough for solutions to magically present themselves. Articulating good questions pushes our brain into problem solving mode.
Rubber ducks are cute. Problem is, they’re not very smart. In order for them to understand the nature of the bug we’re dealing with, our explanation needs to be extra thorough. As per previous method, we need to slow down and simplify things for them.
Staring into the guileless smiling innocence of our duck, we often find ourselves wondering: is there an easier way? Does this have to be done at all?
There are times when talking to an inanimate object is not feasible (open plan office?). Or the solution to the problem might lie outside our domain or scope.
Nothing happens in isolation, and there is something to be said about teamwork. Having a coding buddy can alleviate some of the horrors of getting stuck. Heaven forbid, it might even make debugging fun.
Peer reviews (many eyes on the code) is a great way to work with fellow developers and make sure your code is bug free. It’s also a nice way to learn and develop professionally.
5. Plan Ahead
Decide how much individual effort (time) you intend to invest on debugging a particular issue, before you move on to something else.
Sometimes we’re dealing with special types of bugs. Like the elusive heisenbugs, or fractal bugs that point to ever more bugs. It’s easy for a bug to turn into a productivity black hole.
When you reach the end of the allotted time, it’s best to move on and tackle something else. Don’t let one task (bug) swallow other priorities and jeopardise the progress of your project.
Once you’ve context-switched, worked on something else, and had a break, you’re better equipped to revisit the bug and determine your next steps (and priorities) with a clear mind.
By the way, it’s worth mentioning debugging tools here. Things like central logging, error monitoring et cetera, can make a huge difference on how quickly you solve a bug. We will discuss this topic on a future post.
Debugging efforts often go awry and we find ourselves lost in productivity rabbit holes. It happens when our mind trades hard problem solving for easier, repetitive activities that lead nowhere.
Learn to spot when you’re getting stuck. Know the signs and get better at climbing out of those rabbit holes. Context switching, slowing down, rubber duck debugging, and planning ahead, are proven methods that get you back on course.
What Chaos Monkey does is simple. It runs on Amazon Web Services and its sole purpose is to wipe out production instances in a random manner.
The rationale behind those deliberate failures is a solid one.
Setting Chaos Monkey loose on your infrastructure—and dealing with the aftermath—helps strengthen your app. As you recover, learn and improve on a regular basis, you’re better equipped to face real failures without significant, if any, customer impact.
Since we don’t use AWS or Java, we decided to build our own lightweight simian in the form of a simple Python script. The end-result is the same. We set it loose on our systems and watch as it randomly seeks and destroys production instances.
What follows is our observations from those self-inflicted incidents, followed by some notes on what to consider when using a Chaos Monkey on your infrastructure.
1. Trigger chaos events during business hours
It’s never nice to wake up your engineers with unnecessary on-call events in the middle of the night. Real failures can and do happen 24/7. When it comes to Chaos Monkey, however, it’s best to trigger failures when people are around to respond and fix them.
2. Decide how much mystery you want
When our Python script triggers a chaos event, we get a message in our HipChat room and everyone is on the look out for strange things.
The message doesn’t specify what the failure is. We still need to triage the alerts and determine where the failures lie, just as we would in the event of a real outage. All this “soft” warning does is lessen the chance of failures going unnoticed.
3. Have several failure modes
Killing instances is a good way to simulate failures but it doesn’t cover all possible contingencies. At Server Density we use the SoftLayer API to trigger full and partial failures alike.
A server power-down, for example, causes a full failure. Disabling networking interfaces, on the other hand, causes partial failures where the host may continue to run (and perhaps even send reports to our monitoring service).
4. Don’t trigger sequential events
If there’s ever a bad time to set your Chaos Monkey loose, that’s during the aftermath of previous chaos event. Especially if the bugs you discovered are yet to be fixed.
We recommend you wait a few hours before introducing the next failure. Unless you want your team firefighting all day long.
5. Play around with event probability
Real world incidents have a tendency to transpire when you least expect them. So should your chaos events. Make them infrequent. Make them random. Space them out, by days even. That’s the best way to test your on-call readiness.
We’ve been triggering chaos events for some time now. None of the issues we’ve discovered so far were caused by the server software. In fact, scenarios like failovers in load balancers (Nginx) and databases (MongoDB) worked very well.
Every single bug we found was in our own code. Most had to do with how our app interacts with databases in failover mode, and with libraries we’ve not written.
In our most recent Chaos run we experienced some inexplicable performance delays during two consecutive MongoDB server failovers. Rebooting the servers was not a viable long term fix as it results in a long downtime (>5 minutes).
It took us several days of investigation until we realised we were not invoking the mongoDB drivers properly.
The app delays caused by the Chaos run happened during work hours. We were able to look at the issue immediately, rather than wait until an on-call engineer gets notified and is able to respond, in which case the investigation would’ve been harder.
Such discoveries help us report bugs and improve the resiliency of our software. Of course, it also means additional engineering hours and effort to get things right.
The Chaos Monkey is an excellent tool to test how your infrastructure behaves under unknown failure conditions. By triggering and dealing with random system failures, you help your product and service harden up and become resilient. This has obvious benefits to your uptime metrics and overall quality of service.
And if the whole exercise has such a cool name attached to it, then all the better.
Editor’s note: This post was originally published on 21st November, 2013 and has been completely revamped for accuracy and comprehensiveness.
While there is no such thing as 100% secure, you can take specific measures to mitigate against a wide range of attacks and secure your webapp as much as possible.
In this post we discuss some of the steps we’ve taken as part of our efforts to secure our server monitoring tool.
1. Cover the Basics
Before considering any of the suggestions listed here, make sure you’ve covered the basics. Those include industry best practices like protecting against SQL injection, filtering, session handling, and XSRF attacks.
When we launched Server Density in 2009, we offered HTTPS for monitoring agent postbacks but didn’t go as far as to block standard HTTP altogether.
Later on, when we made the switch to HTTPS-only, the change was nowhere near as onerous as we thought it would be.
SSL is often viewed as a performance bottleneck but that isn’t really true. In most situations, we see no reason not to force SSL for all connections right from the start.
Server Density v2 uses a new URL. As part of this, we can force SSL for new agent deployments and access to the web UI alike. We still support the old domain endpoint under non-SSL but will eventually be retiring it.
To get an excellent report on how good your implementation is, run your URL against the Qualys SSL server test. Here is ours:
3. Support SSL with Perfect Forward Secrecy
Every connection to an SSL URL is encrypted using a single private key. If someone obtains that key they can decrypt and access the traffic of that URL.
Perfect forward secrecy addresses this risk by negotiating a new key with every session. A compromise of one key would therefore only affect the data in that one session.
To do this, you need to allow certain cipher suites in your web server configuration.
ECDHE-RSA-AES128-SHA:AES128-SHA:RC4-SHA is compatible with most browsers (for more background and implementation details check out this post).
We terminate SSL at our nginx load balancers and implement SSL using these settings:
You can easily tell if you’re connected using perfect forward secrecy. In Chrome, just click on the lock icon preceding the URL and look for ECDHE_RSA under the Connection tab:
4. Use Strict Transport Security
Forcing SSL should be combined with HTTP Strict Transport Security. Otherwise you run a risk of users entering your domain without specifying a protocol.
For example, typing example.com rather than https://example.com and then being redirected to HTTPS. This redirect opens a security hole because there’s a short time when communication is still over HTTP.
You can address this by sending an STS header with your response. This forces the browser to do the HTTP to HTTPS conversion without issuing a request at all. Instead, it sends the header together with a time setting that the browser stores, before checking again:
We have to specifically allow unsafe-eval here, as a number of third party libraries require this. You might not use any third party libraries—or the libraries you do use may not require unsafe eval—in which case you should not allow unsafe-eval.
Be careful with wildcarding on domains which can have any content hosted on them. For example wildcarding *.cloudfront.net would allow anyone to host any script. This is Amazon’s CDN which everyone can upload files to!
Password reset should use an out-of-band method to trigger resets, for example: requiring a username then emailing a one-time, expiring link to the on-record email address where the user can then choose a new password. Here is more guidance and a checklist.
“Remember me” functionality should use secure tokens to recognise the user, and not storing their credentials in cookies.
If your webapp is anything more than a trivial consumer product, you should implement—and encourage your users to use—multi factor authentication.
This requires them to authenticate using something they carry with them (token), before they can log in. An attacker would therefore need both this token (phone, RSA SecurID etc) and user credentials before they obtain access.
We use the Google Authenticator standard because it has authentication apps available for all platforms, and has libraries for pretty much every platform.
It is quite onerous to install a custom, proprietary MFA app so we don’t recommend you implement your own system.
Be sure to re-authenticate for things like adding/removing MFA tokens. We require re-authentication for all user profile changes.
We do however have a timeout in place during which users won’t have to re-authenticate. This timeout applies for simple actions like changing passwords (adding or removing tokens requires authentication even during the timeout).
To sum up, MFA is crucial for any serious application as it’s the only way to protect against account hijacking.
Security comes from the Latin route sēcūrus. It means free from care. Some adjectives associated with this word are untroubled, fearless, and composed.
Security provides a safe space for humans to stretch their imagination and be as creative as they can. It allows for growth.
It also allows for focus. For small companies like ours, security unfetters our potential to improve our product and serve our customers.
Good security is not an add-on, a feature or a separate effort. It is an essential building block of our work. And that should be reflected in everything we do, including our people, our infrastructure, our technologies and our product.
Let’s start with people.
The Role of People
“If you think technology can solve your security problems, then you don’t understand the problems and you don’t understand the technology.”
All fourteen collisions with Google’s self driving cars were caused by human error, according to Google. The drivers involved in those accidents were all distracted. It turns out that humans are the weakest link when it comes to safe systems.
There are a number of ways we approach (and mitigate) this risk. To begin with, we try and have as many “eyes on the code” as possible.
As part of our code review and deployment process we test each other’s code and try to break it. We are a small and tightly knit team, which is great. But we don’t know it all.
To reduce the risk of blind spots and confirmation bias (we are only human!), we work with independent security consultants who inspect our product (and code) on a regular basis.
Another resource we are looking into (but haven’t leveraged yet) is the specialised skillsets of the crowd. There are some compelling platforms for bug bounty and bug reporting out there. Large companies, like Google and Tesla, and smaller ones, like LastPass and Drupal, have used this for awhile.
Now let’s turn our attention to technology, and how we can secure it.
Multi Factor Authentication
Multi Factor Authentication (MFA) requires the user to authenticate using something they physically have with them before they can log in. It’s the only way to protect against account hijacking.
We use MFA internally as much as we can. For example, we enforce Google authenticator for Gmail, Google Drive and all our Google Apps.
We also encourage all our customers to activate MFA for their Server Density account:
Our computers are full-disk encrypted (we use Filevault, PGP Full Disk Encryption or Espionage, depending on the OS). We also encrypt some of our email communications with GnuPG, one of the tools that Edward Snowden used to protect his communications about the NSA.
Up to Date Software
We make sure we are always running the latest bug fixed versions of all installed software we use. This includes web browsers, messaging clients, OS components and the OS itself.
We like Google Chrome for its tight integration with Google Apps but also for its auto-update feature which keeps the browser secure.
We are not big on browser add-ons. Click-to-play is an exception as it helps us prevent browser plugin vulnerabilities (Flash and Java in particular). We also use this Chrome extension to protect against phishing on our Google accounts.
We also recommend Fluffify, our very own Chrome extension. It won’t make you any more secure, but it will keep you sane.
The second law of thermodynamics states that entropy always increases with time. When it comes to guessing passwords however, time always increases with entropy.
Our passwords are at least 20 characters long. They comprise a mix of upper and lower case characters including numbers, letters and symbols. They are also unique for each system, which means if one system is compromised, others will not follow suit.
We keep offsite and easily accessible backups of all our passwords (using tools like 1Password) to allow for easy reset of all account passwords in the event of a breach.
We never share passwords. Each of us has our very own set of credentials. This helps us deal with red-flag scenarios. Like revoking employee privileges when they leave. Or auditing who accessed a particular server or database.
According to the principle of least privilege, every process or user should only be able to access the resources they need. User administration is a key component of our product:
Secure Data Flows
For Server Density to work we ask our customers to install a lightweight agent on their server. All this does is collect various system metrics and constantly report back.
A deliberate restriction is that data only can only travel one way: from the client server to ours. That rules out any possibility for remote execution.
From that point everything is encrypted. In fact, encrypted post backs are the only option.
We use ports that are usually already open (HTTPS port 443) which means there is no need to configure anything new. No root access required either. And because our agent is open source, our customers have full visibility of what is running at all times.
According to the investigation, the pilots forgot to disengage a critical wing adjustment mechanism before take off. Evidence that even veteran pilots could miss key steps or do things in the wrong order. With hundreds of lives at stake it was necessary to design around this constraint.
Indeed, Gawande — a doctor himself — writes how key steps in medical procedures were routinely missed, resulting in infections and preventable fatalities. The adoption of checklists reduced those occurrences, and they are now used in a wide range of healthcare settings.
Checklists for DevOps
Not unlike healthcare and aviation, sysadmins are often tasked with systems that touch many lives. Here at Server Density we appreciate the complexities of the systems we run. We also recognise the limits of the people who run them — us. That’s why we use checklists for much of what we do.
Checklists are particularly effective in situations where there is:
There is only so much that human memory can remember, reproduce and execute upon, in a reliable manner.
Stress and Fatigue
Incidents may happen at awkward times, like early in the morning when mistakes are more likely. Sysadmins are vulnerable to stress and fatigue like everyone else.
You’d expect a seasoned engineer to intuitively know how to deal with a wide range of contingencies. That is a good thing. Experience and tenure, however, could also encourage people to rely on “gut instincts”, to “wing it” and “shoot from the hip.” In complex situations those attitudes could prove hazardous.
A checklist is a good way to mitigate those problems because it helps us define our response in advance and make it available to everyone. We therefore ensure that every member of our team is taking the right steps in the right order, each and every time.
Checklists at Server Density
Here is one of our own checklists. It defines what our on-call first responders do when a critical incident occurs (we also wrote a guide on how we handle incidents outages and downtime):
As “common sense” and obvious as the steps may be, they carry great importance for the health of our infrastructure. So we spell them out.
As you would expect, we take our uptime metrics seriously. We’ve got some pretty capable folks taking care of our servers. And we use checklists. Very prescriptive ones like the one below. This one details the steps our on-call people follow when faced with a server failover:
We sed/awk/grep all day long. Our checklists assume we do it for the very first time. At 2:00 in the morning we might have trouble finding the lights, let alone the Puppet master for our configuration.
Here is another example. We use this checklist when a server we monitor stops sending data:
DevOps checklists are as unique as the teams that use them. Each team has their own recipe for doing things and as technology stacks evolve, so do the checklists required to run them.
We aim to have a checklist for every scenario. From restoring a backup to production, deploying fixes, switching primary data centres, and database consistency checks, to responding to traffic spikes, security breaches, critical alerts, and a long list of other contingencies.
Google Docs is a key part of our on-call playbook, so that’s where we store our checklists for everyone to access and update as needed.
Checklists are not static
Relying on checklists does not mean we are intractable about how we do things. For us, creating checklists is an excellent opportunity to take a step back and review the entirety of our stack.
DevOps checklists work best when we schedule time to update and improve on them.
When it comes to server monitoring, we believe checklists are an important step towards reliable systems. They help our team respond to issues in a consistent and timely manner. This translates to increased uptime and a better capacity to serve our customers.
We launched tags several months ago to allow you to set permissions for different users, but they are the foundation for many more features we’ll be releasing over the coming months. The first of these is tags as a recipient. This allows you to have alerts delivered to all members of a tag, rather than having to set up each of your users on every alert configuration individually.
For example, if you have an “on call” team, you can add all the users to that tag, then set the tag as the recipient for an alert. Each user can have different notification options and any changes you make will apply to all alerts the tag is a recipient for. This is particularly useful if your team changes e.g. new members or staff leaving – you only have to make the change once on the tag and it’ll apply to all alerts.
This is available now on device and service level alerts. It’s not available for group level alerts because our next release will be replacing those with alerts on a tag (so servers and services can have multiple tags, with inheritance across multiple tags).
We often get reports that our availability monitoring is reporting “false positives” when compared to competing products which actually turn out to be real errors that we’ve detected where others have not! To back up our claims, we have now exposed full details of any errors we see along with their history for all of your service checks.
You can browse recent errors, search and filter by location and see errors as they are detected. This will help debug any problems we detect with our availability monitoring.
We’re always working on improvements and fixes and often deploy code 5-10 times a day! So if you find any problems or have ideas for improvements, please get in touch so we can continue to improve.
Each server overview has a top processes widget that gives you a breakdown of the most intensive processes and how many running instances there are:
This is also extended to the snapshot view which you can reach by clicking on a data point on any graph or from the Snapshot tab when viewing a particular server.
A new version of the monitoring agent for Linux, FreeBSD and Mac has been released with a range of bug fixes. This is intended to be the final release of the v1 agent. We’ll soon be releasing sd-agent v2 which will include features such as SNMP, statsd and second by second monitoring.
We’re always working on improvements and fixes and often deploy code 5-10 times a day! So if you find any problems or have ideas for improvements, please get in touch so we can continue to improve.
We released a new configuration interface for managing alerts, the result of several months of work involving design and usability tests. Try it out on your account now and read about the work behind the scenes.
New support site
Our support website has been redesigned, all the articles have been updated and you can now log in to submit/view old tickets. We provide live chat, email and phone support to all customers Monday to Friday, 10am to 6pm UK time.
This is our regular monthly post to keep you up to date with the latest releases to our server monitoring product, Server Density.
Latest value widget
A new widget is available on the dashboard which will show you the latest, current value for any metric. It will also display the average value over the time period the dashboard is configured for e.g. the 24 hour average or 1 hour average, with a sparkline graph in the background.
New official plugins for entropy, inodes, ProFTP, Zombies and Zookeeper
We’re in the process of retiring our old plugin directory and rewriting many old community plugins into officially supported and updated versions. These are available on Github and we’re accepting pull requests for improvements and changes, as well as brand new plugins.
The goal is to make it easier to install by just dropping the file into your agent plugin directory and ensure these plugins are kept up to date and fully supported by us.
New API documentation + Dashboard API
We’ve updated and expanded our API documentation with a new template and example calls for Python, Ruby and Curl.
Over the last few months we have been working on moving our alerts processing backend from Celery + MongoDB to Storm + Kafka, which sets the foundations for a range of new alerting functionality we’ll be releasing from March. Tagging is a key part of this functionality, which was released in December. Before then, we’ll be releasing more plugins and full process lists within the UI.