Author Archives: David Mytton

About David Mytton

David Mytton is the founder of Server Density. He has been programming in PHP and Python for over 10 years, regularly speaks about MongoDB (including running the London MongoDB User Group), co-founded the Open Rights Group and can often be found cycling in London or drinking tea in Japan. Follow him on Twitter and Google+.
  1. How to Monitor Apache

    Leave a Comment


    Editor’s note: An earlier version of this article was published on Oct, 2, 2014.

    Apache HTTP Server been around since 1995 and it’s deployed on the majority of web servers out there (although losing ground to NGINX).

    As a core constituent of the classic LAMP stack and a critical component of any web architecture, it is a good idea to monitor Apache thoroughly.

    Keep reading to find out how we monitor Apache here at Server Density.

    Enabling Apache monitoring with mod_status

    Most of the tools for monitoring Apache require the use of the mod_status module. This is included by default but it needs to be enabled. You will also need to specify an endpoint in your Apache config:

    <Location /server-status>
    
      SetHandler server-status
      Order Deny,Allow
      Deny from all
      Allow from 127.0.0.1
    
    </Location>
    

    This will make the status page available at http://localhost/server-status on your server (check out our guide). Be sure to enable the ExtendedStatus directive to get full access to all the stats.

    Monitoring Apache from the command line

    Once you have enabled the status page and verified it works, you can use the command line tools to monitor the traffic on your server in real time. This is useful for debugging issues and examining traffic as it happens.

    The apache-top tool is a popular method of achieving this. It is often available as a system package e.g. apt-get install apachetop but can also be downloaded from the source, as it is just a simple Python script.

    Apache monitoring and alerting – Apache stats

    apache-top is particularly good at i) real time debugging and ii) determining what’s happening on your server right now. When it comes to collecting statistics, however, apache-top will probably leave you wanting.

    This is where monitoring products such as Server Density come in handy. Our monitoring agent supports parsing the Apache server status output and can give you statistics on requests per second and idle/busy workers.

    Apache has several process models. The most common one is worker processes running idle waiting for service requests. As more requests come in, more workers are launched to handle them—up to a pre-configured limit. Once past that limit all requests are queued and visitors experience service delays. So it’s important to monitor not only raw requests per second but idle workers too.

    A good way to configure Apache alerts is by first determining what the baseline traffic of your application is and then setting alerts around it. For example, you can generate an alert if the stats are significantly higher (indicating a sudden traffic spike) or if the values drop significantly (indicating an issue that blocks traffic somewhere).

    You could also benchmark your server to figure out at what traffic level things start to slow down. This can then act as the upper limit for triggering alerts.

    Apache monitoring and alerting – server stats

    Monitoring Apache stats like requests per second and worker status is useful in keeping an eye on Apache performance, and indicates how overloaded your web server is. Ideally you will be running Apache on a dedicated instance so you don’t need to worry about contention with other apps.

    Web servers are CPU hungry. As traffic grows Apache workers take up more CPU time and are distributed across the available CPUs and cores.

    CPU % usage is not necessarily a useful metric to alert on because the values tend to be on a per CPU or per core basis whereas you probably have multiple instances of each. It’s more useful to monitor the average CPU utilisation across all CPUs or cores.

    Using a tool such as Server Density, you can visualise all this plus configure alerts that notify you when the CPU is overloaded – our guide to understanding these metrics and configuring CPU alerts should help.

    On Linux the CPU average discussed above is abstracted out to another system metric called load average. This is a decimal number rather than a percentage and allows you to view load from the perspective of the operating system i.e. how long processes have to wait for access to the CPU. The recommended threshold for load average therefore depends on how many CPUs and cores you have – our guide to load average will help you understand this further.

    Monitoring the remote status of Apache

    All those metrics monitor the internal status of Apache and the servers it runs on but it is also important to monitor the end user experience too.

    You can achieve that by using external status and response time tools. You need to know how well your Apache instance serves traffic from different locations around the world (wherever your customers are). Based on that, you can then determine at what stage you should add more hardware capacity.

    This is very easy to achieve with services like Server Density because of our in-built website monitoring. You can check the status of your public URLs and other endpoints from custom locations and get alerts when performance drops or when there is an outage.

    This is particularly useful when you need graphs to correlate Apache metrics with remote response times, especially if you are benchmarking your servers and want to know when a certain load average starts to affect end-user performance.

  2. Diversity is Good Business. Here is Why

    Leave a Comment


    Let’s get the obvious out of the way. The tech industry has a serious and chronic diversity problem. The very industry that’s supposed to spearhead new ideas, innovation and progress, is woefully behind the times where it matters most. The heterogeneity of its people.

    Tech workers are predominantly male and white, while non-white workers earn significantly less than their white counterparts. To make matters worse, an overwhelming majority of tech firms do not have gender diverse senior management at the helm. And while there has been some welcome transparency in the last few years (annual diversity reports and so on) it was not followed by any meaningful change in momentum. Minorities continue to be underrepresented and women continue to leave the tech industry in greater rates than their male peers.

    What this indicates is that we cannot deal with diversity in the same way we tackle most problems in tech. In other words . . .

    This is not a metrics problem

    We can’t approach diversity as a hiring quota challenge, hard as that challenge may be. The diversity issue goes deeper than that. It’s a culture problem that starts from schooling and education before it expresses itself everywhere else, including boardrooms, office corridors and water cooler corners.

    Within companies, diversity starts at the top.

    Leadership is where culture is born and shaped. As a corollary, any investments in hiring can easily go to waste if the company is not driven by culturally diverse values. What good is hiring more people if the workplace cannot integrate and retain their talents?

    And while we’re at it: what’s so good about diversity? Why do we want it? Is it because of an upcoming equal opportunity report? Are we paying lip service to diversity because that’s what everyone else is doing?

    Behind most of those questions lies an inherent aversion to diversity. As if tech companies have to mitigate diversity, tacitly dismissing it as another cost of doing business. This is not only short-sighted (diversity takes time and effort) but it is also counterproductive since diversity is associated with creativity, innovation, and real economic benefits.

    Diversity is Good Business

    Ideas generated by people from different backgrounds are informed by different experiences, worldviews, and values. It’s great when ideas get the chance to cross-pollinate like this. As James Altucher says, you combine two ideas to come up with a better idea. A more diverse workplace is therefore a more fertile place for ideas.

    Idea evolution works much faster than human evolution.

    James Altucher

    Now, here is the thing: ideas in diverse environments do not come easy. Why? Because diverse ideas tend to be different. Different (opposing) ideas have to be debated. They have to be weighed, discussed and decided upon. This lack of initial consensus, this creative friction does not come free. The rigour and discipline involved in negotiating and distilling insights and action plans from a broad and varied pool of ideas comes with an upfront cost. But it bears fruits down the line. The result of this requisite complexity translates in a more thought-out and “creatively hardened” product that has more chances of surviving against other ideas in the marketplace.

    In short, if you want to create new and better products—products that appeal to a broader audience—you should focus on creating a diverse company culture, starting from the top.

    Our diversity journey

    We live in an increasingly pluralistic society. The majority of our customers are outside the UK; they come from many different backgrounds. By having a more diverse team, we have a better chance of building something that appeals to our diverse customers.

    Server Density launched in 2009, and for much of our first few years it was just a few of us building stuff. Diversity did not become a priority until our team was several engineers strong. Most of them work remotely from various parts of Europe and the UK. Having multilingual folks from different geographies and cultures working in the same team is an incredible creative catalyst for everyone. Our product couldn’t be what it is today if we didn’t have all those different perspectives.

    In line with the overall industry, however, the percentage of female engineers in our team is lower than we would like. We took some time to study this challenge and observe what other companies have done. We wanted to address this now, while our company and culture were in their formative years, realising that any change would be exponentially harder to make a few years down the line.

    So here is what we did.

    Avoid gender-coded job ads

    It turns out that power words (driven, logic, outspoken) are more masculine and attract male candidates, while warmer ones (together, interpersonal, yield) encourage more women to apply. We now use online analysis tools to scan all our job ads and suggest changes before we publish them.

    Another problem, as illustrated by a Harvard Business Review article, is that women tend to avoid applying for roles they are not 100% qualified for, contrary to men who go ahead and apply anyway. To cater for that behaviour we try and remove as much self selection criteria as possible. We want to be the ones deciding if the candidate is qualified enough, not them. Even if it means more work and delays in filling up open positions.

    Avoid unconscious bias

    As part of the hiring process, we ask all our candidates to take a writing test followed by a coding exercise. When we review those, the name of the candidate is now hidden, in order to avoid unconscious bias in assessing those tests.

    Encourage a diverse culture

    The next, and harder, step involves fostering a culture that encourages diverse ideas. We thought long and hard about this. How do you make sure everyone gets a chance to steer the direction of our company and have a voice when it comes to what features we invest in?

    While we are still navigating those questions, we’ve already started making targeted adjustments in how we collaborate. We started running planning games, for example. Planning games is a regular forum where we plan our engineering efforts. Everyone has an equal voice in this meeting and we review and vote all ideas based on merit. We stand up and defend whatever it is we think. We support and encourage folks to participate.

    We also reviewed our employee handbook including all company policies. We made significant changes to ensure they are as inclusive as they can be. Many of our policies (equal opportunities, hiring/selection, complaints procedure, code of conduct and maternity/paternity leave) used to be informal. What we found was that by just having them written down and being able to point to them during our recruitment efforts has a tangible impact. It shows you’ve at least thought about it.

    So we codified our policies in a systematic manner, using pull requests so the proposed format could be discussed by everyone. As an example, if someone feels unable to escalate an issue to their manager, we now have alternative routes in place, including members of the board if needed, and in full confidence.

    Going forward

    As with most worthwhile things, the hardest step is the first one. Going from zero to one employees in underrepresented demographics is invariably undermined by the assumption that if you don’t have diversity it’s like this for a reason.

    In response to that, we rely on referrals quite heavily as a way to proactively reach out for competent candidates in diverse backgrounds. Obviously that is a short-term measure, and ideally we should gain traction with all demographics sooner rather than later. Having a diverse culture allows you to tap into a broader talent pool, internally and externally. As CTO of Busuu, Rob Elkin, put it, “We just want to make sure that the process for showing that someone should be part of the team is as open and fair as possible.”

    We have also started to sponsor and participate in various industry events that encourage diversity (e.g. COED Code). On top of that, we are looking to broaden where we place our engineering job ads. So far we’ve been publishing them on stackoverflow but we want to reach further and wider.

    In closing

    The Canadian cabinet consists of 30 ethnically and religiously diverse ministers, evenly split between women and men who are mostly aged under 50. While we don’t plan to relocate to Canada just yet, it certainly serves as a great example of leadership that is inclusive and representative of as many people as possible.

    At Server Density we don’t tackle diversity with a single-minded metrics driven approach. This is not a numbers problem as much as it is a culture problem. It’s not so much about putting a tick in a box as it is about i) understanding the challenge ii) internalising the benefits of diversity and iii) making strategic and nuanced changes in the way we lead our people.

    A truly diverse culture is not a compromise. It couldn’t be. It’s a long-term investment into the fundamentals of our team and our future prospects as a company.

  3. How to Write a Postmortem

    Leave a Comment


    When sufficiently elaborate systems begin to scale it’s only a matter of time for some sort of failure to happen.

    Failure sucks, at the time, but there are significant learnings to be had. Taking the time to extract every last bit of insight from failure, is an invaluable exercise. We’d be robbing ourselves of that gift if we skipped postmortems.

    So, despite the grim sounding name, we appreciate postmortems here at Server Density (and we’re in good company, it seems).

    Keep reading to find out why.

    Postmortems restore focus

    MerrillCoveyMatrix

    When faced with service interruptions, we drop everything in our hands and perform operational backflips 24×7 until the service is restored for all customers.

    This type of activity classifies as “important” and “urgent” (see quadrant 1 of the “Eisenhower Decision Matrix“).

    When the outage is over, however, we need to consciously shift our focus back to what’s “important” and “not urgent” (see quadrant 2). If we don’t then we risk spending time on distractions and busywork (quadrants 3 and 4).

    The discipline of writing things down requires us to take a pause, collect our thoughts and draft an impartial, sober, and fearless account of what happened, how we dealt with it, what we learned and what steps we’re taking to fix it.

    Postmortems restore confidence

    Right from the beginning, we decided we wanted to treat our customers the same way we wanted to be treated. Generally speaking, enterprise companies (Github, Google Cloud, Amazon, et cetera) have more engaged and invested technical audiences who want to know the details of what’s going on. Amazon, for example, offers some great postmortems. We wanted to offer something similar.

    Communicating detailed postmortems helps restore our credibility with our users. It demonstrates that someone is investing time on their product. That they care enough to sit down and think things through.

    When it comes to service interruption, over-communication is a good thing. As is transparency, i.e. acknowledging problems on time and throwing the public light of accountability on all remaining issues until they’re resolved. Going public provides all the incentives we need to fix problems.

    How we write postmortems

    Our postmortems start their lives as long posts on our internal Jira Incident Response page.

    Write-a-postmortem-2

    Internal outages might not affect our customers but they do take a toll on our engineering team (for example, server failovers waking someone up). We treat those with the same priority and focus. As advocates of Human Ops, we’re all about having the right systems in place so that operational issues don’t spill over into our personal time and impact our wellbeing.

    In case of an actual service outage, we replicate the same postmortem to our dedicated status page (we filter out obvious security specifics). Here is a case that started from Jira (see above) and graduated to our status page:

    Write-a-postmortem-3

    Postmortem timing

    While the crisis is still unfolding we publish short status updates at regular intervals. We stick to the facts, including scope of impact and possible workarounds. We update the status page even if it’s just to say “we’re still looking into it.”

    It usually takes a week, from issue resolution to the point when we’re ready to author a full postmortem. That timeframe affords us the opportunity to do a number of things:

    1. Rule out the possibility of follow-up incidents. Ensure the problem is fixed.
    2. Speak to all internal teams and external providers, compare notes with everyone and agree on what went wrong. Mind you, getting in touch with all the right people is not always easy. The outage might’ve occurred over the weekend or during local holidays or the engineer might be on their off-call day.
    3. Decide on a timeline for implementing strategic changes to our process, infrastructure, provider selection, product, et cetera.

    Postmortem content

    Postmortems are no different to other types of written communication. To be effective, their content needs a story and a timeline:

    1. What was the root cause? What turn of events led to the server failover? What roadworks cut what fiber? What DNS failures happened, and where? Keep in mind that a root cause may’ve set things in motion months before any outage took place.
    2. What steps did we take to identify and isolate the issue? How long did it take for us to triangulate it, and is there anything we could do to shorten that time?
    3. Who / what services bore the brunt of the outage?
    4. How did we fix it?
    5. What did we learn? How will those learnings advise our process, product, and strategy?

    Who writes a postmortem

    Our status updates are published by whoever is leading the incident response or happens to be on call. It’s usually either the ops or the support team.

    Once the issue is resolved, the same people will be expected to draft a postmortem on Jira for everyone to comment and discuss. Once that review is complete, as the CEO, I will then publish that postmortem onto our dedicated public page.

    Summary

    Successful outage resolutions go hand in hand with comprehensive postmortems. If you don’t take the time to document things properly, you rob your team from the opportunity to learn. This opens up the possibility of repeating the same mistakes. You also miss out on an opportunity to grow as a company.

    What about you? How do you deal with failure? Do you write a postmortem, and who is accountable for it?

  4. Secure your Accounts – Team Security Best Practices

    Leave a Comment

    In previous articles we looked at some key technical principles and security best practices for your infrastructure and application development.

    A much larger attack surface, however, is your team.

    People are susceptible to fraud, deception and human error, and that makes us the weakest link when it comes to safe systems. That is why it’s important to have multiple layers in your security in place. If one of them falls, the rest are still there to provide protection.

    Team Security best practices

    At Server Density, we maintain an ops security checklist which every new team member is required to complete, and then review on a monthly basis. This ensures we don’t miss out on the easy, low hanging fruits offered by our security tools.

    While it’s impossible to be 100% secure, there are a number of key team security practices you can adopt to dramatically improve your operational security.

    1. Two-Factor Authentication (2FA)

    Even if you do nothing else, multi-factor authentication is the single most important security tool you need to adopt. Even if you have a poor (or compromised) password or even if you use the same password for multiple accounts, implementing two-factor authentication could compensate for those shortcomings.

    Email is the first tool you should protect with 2FA. And for good reason. All password resets go to an email account, which means email truly is the gateway to your identity.

    Here is how it works. When you log in from new locations, new systems or even from your existing computer (after a time threshold, usually 30 days or so) you will need to verify yourself through an additional token authentication. This may be an app in your phone or a physical device you carry with you.

    2. Strong, Account-Specific Passwords

    secure your accounts

    Brute force attacks against well protected services such as your Google account are unlikely. That’s due to the rate limiting protections they most likely employ. A weak password could, however, be easy to guess without the need for brute force or any other hacking methods.

    If you use the same password for every online account, when that service is compromised and the unencrypted user database leaked, then your password will be compromised. There are many examples of hacked accounts due to online password dumps sourced from other services.

    The best way to protect against this is to have an auto generated password for each individual account. And because it’s impossible to remember hundreds of strong passwords, we suggest you employ a password manager such as 1Password or LastPass.

    Assisted by browser extensions and shortcuts, password managers can often speed your workflow even more. All you need to remember is a single password to unlock the password manager itself. What’s more, most password managers now have good integrations into the popular mobile OSs which means your mobile workflow is just as fast.

    Note that, while most password managers are cloud based, it’s prudent to keep your own backup of your password database.

    3. Secure Connections – SSL

    Accessing web sites and email over encrypted connections sounds like an overkill, right? I mean, who is going to snoop my email? Who cares enough to do that?

    This misconception provides a false sense of safety to many people. Here is the thing. You don’t need to be a celebrity or a VIP for your online identity to be valuable to hackers.

    Most Man In The Middle (MiTM) attacks intercept your connection in order to inject malware. The sole purpose of this malware is to convert your computer into a botnet (or harvest data using keyloggers). Perhaps less common in Western countries, this is a widely exploited attack vector in China.

    4. System Updates

    There is a good reason your phone and computer keeps pestering you for updates. Staying current with the latest OS and app releases helps protect against known vulnerabilities. Google’s Chrome browser has, of course, spoiled us with its silent, automatic updates.

    Make sure you keep all your software, including core OSs up to date. If you’re not ready for all the new features of new software versions, then at least install all patches and point releases. You can therefore avoid being targeted by malware that takes advantage of known security holes.

    5. Travel Securely with a VPN

    When connecting to any wi-fi network outside your control (airport, cafe, library), you open yourself up to a vast range of possible attacks. The classic Firesheep extension is a great example, as is the more recent drive-by downloads via hotel wireless networks.

    The only way to be certain your connection is secure is to connect via a virtual private network (VPN). As the name suggests, a VPN extends your private network across a public network, such as the Internet.

    It’s worth mentioning that VPN software is one of those products where you get what you pay for. Do not use a free service. This simply moves the vulnerability from the local network to the VPN provider, who is likely making money some other way (selling your data, injecting ads, serving malware, et cetera). When it comes to VPN packages, the saying “if you’re not paying for it, you are the product” definitely applies.

    When setting up your VPN make sure all traffic is blocked until the VPN connection is established. That ensures you don’t leak any data during those few seconds between connecting to the wi-fi and connecting to the VPN.

    Implement Those Now

    Security best practices are all about creating multiple layers of protection, each making it a little bit harder for someone to attack you.

    Setting up those 5 tools takes less than an hour, and what you get is solid protection against all but the most sophisticated of attacks.

    By the way, most hacks are opportunistic (unless you’re being specifically targeted), which means implementing those security practices will deter most hackers from even trying.

    [Image Credit: The great folks at xkcd.com]

  5. What’s in your Backpack? Modular vs. Monolithic Development

    Leave a Comment

    While building version 2.0 of our Server Monitoring agent, we reached a point where we had to make a choice.

    We could either ship the new agent together with every supported plugin, in one single file. Or we could deploy just the core logic of the agent and let users install any further integrations, as they need them.

    This turned out to be a pivotal decision for us. And it was much more than technical considerations that advised it.

    Let’s start with some numbers.

    How Much Does Your File Weigh?

    Simple is better than complex.

    The Zen of Python

    The latest version of our agent allows Server Density to integrate with many applications and plugins. We made substantial improvements in the core logic and laid the groundwork for regular plugin iterations, new releases and updates.

    All that extra oomph comes with a relatively small price in terms of file size. Version 2.0 has a 10MB on-disk footprint.

    If we were to take the next step and push every compatible plugin into a single package, our agent would become ten times “heavier”. And it would only keep growing every time we support something new.

    Moving is Living

    Question: But agent footprint is not a real showstopper, is it? Why worry about file sizes when I can get everything I need in one go?

    Sure.

    There is something to be said about the convenience of the monolithic approach. You get everything you need in one serving.

    And yet, it is the nature of this “component multiplicity” that makes iterations of monolithic applications slower.

    For example, when a particular item (say, the Python interpreter or Postgres library) is updated by the vendor, our users would have to wait for us to update our agent before they get those patches. Troubleshooting and responding to new threats would therefore—by definition—take longer. This delay creates potential attack vectors and vulnerabilities.

    Even if we were on-the-button with every possible plugin update (an increasingly impossible feat as we continue to broaden our plugin portfolio), the majority of our users would then be exposed to more updates than they actually need.

    Either way, it’s a lousy user experience.

    To support all those new integrations—without introducing security risks or needless headaches for our users—was not easy. It took a significant amount of development time to come up with an elegant, modular solution that is simple—and yet functional—for our customers.

    The result is a file that includes the bare minimum: agent code plus some specific Python modules.

    Flexibility and Ease of Use

    To take advantage of all the new supported integrations, users may choose to install additional plugins as needed.

    Question: Doesn’t that present challenges in larger / diverse server environments?

    Probably not.

    Sysadmins continue to embrace Puppet manifests, Chef configuration deployment and Ansible automation—tools designed to keep track of server roles and requirements. It’s easier than ever to stay on top of what plugin goes to what server. Automation and configuration utilities can remove much of that headache. Since we tie into standard OS package managers (deb or RPM packages), we simply work with the existing tools everyone is already used to.

    By packaging the plugins separately we get to focus on what we control: the logic inside our agent. Users only ever download what they need, and enjoy greater control of what’s sitting on their servers. The end-result is a flexible monitoring solution that adapts to our users (rather than the other way around).

    The 1.x to 2.0 agent upgrade is not automatic. Existing installations will need to opt-in. We’ve made it easy to upgrade with a simple bash script. Fresh installs will default to version 2.0. The 1.x agent will still be available (but deprecated). All version 1.x custom plugins will continue to work with the new agent too.

    Summary

    Truth is ever to be found in simplicity, and not in the multiplicity and confusion of things.

    Isaac Newton

    The modular vs. monolithic debate has been going on for decades. We don’t have easy answers and it’s not our intention to dismiss the monolithic approach. There are plenty of examples of closed monolithic systems that work really well for well-defined target users.

    Knowing our own target users (professional sysadmins), we know we can serve them better by following a modular approach. We think it pays to keep things small and simple for them, even if it takes significantly more development effort.

    As we continue improving our back-end, our server monitoring agent will support more and more integrations. Employing a modular development, means prompt updates with fewer security risks. That’s what our customers expect, and that’s what drives our decisions.

    What about you? What approach do you follow?

    What's in Your Backpack?

  6. Server Alerts: a Step by Step Look Behind the Scenes

    6 Comments

    Update: We hosted a live Hangout on Air with some members of Server Density engineering and operations teams, in which we discussed about the infrastructure described in this blog post. We’ve made the video available, which can be found embedded at the bottom of this blog post.

    Alert processing is a key part of the server monitoring journey.

    From triggering webhooks to sending emails and SMS, this is where all the magic happens. In this article we will explore what takes place in those crucial few milliseconds, from agent data landing on our servers to the final alert appearing on your device.

    But before we dive into the inner workings of alerting, we’d go amiss if we didn’t touch on the underlying technology that makes it all possible.

    We’re all about Python

    At its core, Server Density is a Python company.

    Having an experienced Python team is only a part of it. For every performance problem we’ve faced in our infrastructure over the years, we’ve always managed to find a Python way to solve it.

    There is so much to like about Python.

    We love its syntax and the way it forces us to write readable code. By following the PEP-8 spec we ensure the code is easy to read and maintain. We also appreciate Python’s industry-leading unit testing capabilities, as they offer invaluable gains to our coding efforts. And while we don’t expect 100% testing coverage, we strive to be as close as we can. Python offers some simple and scalable functionalities towards that.

    Another key feature is simplicity. From prototyping testing scripts to proof of concept APIs, Python provides numerous small wins and speeds up our workflow. Testing new ideas and trying new approaches is much quicker with Python, compared to other languages.

    Last but not least, Python comes with “batteries included”. The vast amount of available modules (they do everything you can imagine), make Python a truly compelling platform for us.

    Our server alerts stack

    As you can see, our stack is not 100% Python. That said, all our backend developments are Python based. Here is a comprehensive list of the technologies we use:

    Now, let’s take a behind-the-scenes look at the alerts processing workflow.

    1.Entering the Server Density Network

    The agent only ever sends data over HTTPS which means no special protocols or firewall rules are used. It also means the data is encrypted in transit.

    It all starts when the JSON payload (a bundle of key metrics the user has chosen to monitor) enters the Cloudflare network. It is then proxied to Server Density and travels via accelerated transit to our Softlayer POP. Using an anycast routed global IP, the payload then hits our Nginx load balancers. Those load balancers are the only point of entry to the entire Server Density network.

    2. Asynchronous goodness

    Once routed by the load balancers, the payload enters into a Tornado cluster (4 bare-metal servers comprising 1 tornado instance for each of its 8 cores) for processing. We use the kafka-python library to implement the producer, as part of this cluster. This Tornado app is responsible for:

    • Data validation.
    • Statistics collection.
    • Basic data transformation.
    • Queuing payloads to kafka to prepare them for step 3 below.

    3. Payload processing

    Our payload processing starts with a cluster of servers running Apache Storm. This cluster is running one single topology (a graph of spouts and bolts that are connected with stream groupings), which is where all the key stuff happens.

    While Apache Storm is a Java based solution, all our code is using Python. To do this, we use the multi-lang feature offered by Apache Storm. This allows us to use some special Java based Spouts and Bolts which execute Python scripts with all our code. Those are long running processes which communicate over stdout and stdin following the multi-lang protocol defined by Apache Storm.

    The cluster communication is done using Zookeeper (the coordination transport) so the output of one process may automatically end up on the process of another node.

    At Server Density we have split up the processing effort into isolated steps, each implemented as an Apache bolt. This way we are able to parallelise work as much as possible. It also lets us keep our current internal SLA of 150ms for a full payload process cycle.

    4. Kafka consumer

    Here we use the standard KafkaSpout component from Apache Storm. It’s the only part of the topology that is not using a Python based implementation. What it does is connect to our Kafka cluster and inject the next payload into our Apache Storm topology, ready to be processed.

    5. Enriching our payloads

    The payload also needs some data from our database. This information is used to figure out some crucial things, like what alerts to trigger. Specialized bolts gather this information from our databases, attach it to the payload and emit it, so it can be used later in other bolts.

    At this point we also verify that the payload is for an active account and an active device. If it’s a new device, we check the quota of the account to decide whether we need to discard it (because we cannot handle new devices on that account), or carry on processing (and increase the account’s quota usage).

    We also verify that the provided agentKey is valid for the account it was intended for. If not, we discard the payload.

    6. Storing everything in metrics

    Each payload needs to be split up in smaller pieces and normalized in order to be stored in our metrics cluster. We group the metrics and generate a JSON snapshot every minute. That snapshot lasts five days. We also store metrics in an aggregated data format once every hour. That’s the permanent format we keep in our time series database.

    7. Alert processing

    In this step we match the values of the payload against the alert thresholds defined for any given device. If there is a wait time set, the alert starts the counter and waits for subsequent payloads to check for its expiration.

    When the counter expires (or if there was no wait value to begin with), we go ahead and emit all the necessary data to the notification bolt. That way, alerts can be dispatched to users based on the preferences for that particular alert.

    8. Notifications

    Once we’ve decided that a particular payload (or absence of it) has triggered an alert, one of our bolts will calculate which notifications need to be triggered. Then we’ll send one http request per notification to our notifications server, another tornado cluster (we will expand on the inner workings of this in a future post. Stay tuned).

    Summary

    Everything happens in an instant. Agents installed on 80,000 servers around the world send billions of metrics to our servers. We rely on Python (and other technologies) to keep the trains running, and so far we haven’t been disappointed.

    We hope this post has provided some clarity on the various moving parts behind our alerting logic. We’d love to hear from you. How do you use Python in your mission critical apps?

    Tech chat: processing billions of events a day with Kafka, Zookeeper and Storm


  7. Server Monitoring Alerts: War Stories from the Ops Trenches

    7 Comments


    Hush.

    What?

    . . . That’s the sound of nothing . . .

    No alerts, no incidents, nothing. Your infrastructure works like clockwork without a hitch and therefore without alerts. No news is good news. Right?

    Um, yes. But what happens when the illusion is inevitably shattered? What types of scenarios do we face here at Server Density, and how do we respond to them? What systems and tools do we use?

    This post is a collection of war stories from our very own ops environment.

    A Note on our Infrastructure

    Most of it is hosted on a hybrid cloud/dedicated environment at Softlayer. A small portion, however, is hosted on Google Compute Engine. That includes our Puppet Master (1 server), our Build servers (2 servers), and the Staging environment (12 servers).

    There are a number of reasons why we chose Google Compute Engine for this small part of our infrastructure. The main one was that we wanted to keep those services completely separate from production. If we were to host them on Softlayer we would need a different account.

    Here is a low-res snapshot of our infrastructure.

    Infrastructure

    5 Types of Server Monitoring Alerts

    Let’s start by stating the obvious. We monitor and get alerts for a whole bunch of services: MongoDB, RabbitMQ, Zookeeper and Kafka, Web Servers, Load Balancers, you name it. Here are some examples.

    1. Non events

    As much as we try to minimise the amount of noise (see below), there will always be times when our alerts are inconsequential. At some point 50% of the alerts we got were simply ignored. Obviously, we couldn’t abide by such a high level of interruption. So we recently ran a project that systematically eliminated the majority of those. We’ve now reached a point where such alerts are the rare exception rather than the rule.

    Further up the value chain are those alerts we resolve quickly, almost mechanically. It’s the types of incidents we add zero value and shouldn’t have to deal with (in an ideal world). For example, the quintessential . . .

    2. “Random disk failures in the middle of the night”

    Sound familiar? We get woken up by a no data alert, open an incident in Jira, try to SSH, no response, launch the provider console, see a bunch of io errors, open a ticket with the provider for them to change disks, go back to bed. Process takes less than 30 minutes.

    Speaking of providers, here is another scenario we’ve seen a couple of times.

    3. “They started our server without telling us”

    Provider went through a schedule maintenance. As part of this they had to reboot the entire datacenter. We were prepared for the downtime. What we didn’t expect was for them to subsequently restart the servers that were explicitly shut down to begin with. Obviously we were on the lookout for strange things to happen so we were quick to shut those servers down as soon as we got their alerts.

    Occasionally, the alerts we receive are mere clues. We have to piece things together before we can unearth broader system issues. Here is an example.

    4. Queue Lengths

    We monitor our own queue lengths. As part of asynchronous processing, a worker will take an item in a queue and process it. Long queues could indicate either that the workers are too slow or that the producers go too fast. The real underlying issue however could have nothing to do with workers, producers or their queues.

    What if there is too much network traffic, for example? A network problem won’t break the system and we may never know about it until we deduce it from indirect metrics like, say, queue lengths.

    Queue Length5. MongoDB seconds_behind_master alert

    If a replica slows down, failover is at risk. If there is packet loss or a small capacity link then there is no sufficient traffic between primary and secondary. This means that certain operations can’t be replicated to secondary. The risk is that in the case of a failover, the delta is lost.

    In October 2014 we experienced a weird outage that exemplifies this type of failure. Our primary MongoDB cluster is hosted on our Washington DC datacenter (WDC) and the secondary is in San Jose (SJC).

    Interesting things started to happen when some roadworks in the neighboring state of Virginia caused a fibre cut. This cut broke the connection between the two data centers. It also severed the link between WDC to the voting server in Dallas (DAL). The voting server carries no data. It is a mere arbiter that votes for promotions and demotions of the other two servers.

    Not long after the outage, based on a majority vote from both DAL and SJC, the latter was promoted to Primary. Here is where things get hairy. Following the SJC promotion, the link between the voting server (DAL) and WDC was somehow restored, but the link between WDC and SJC remained down. This weird succession of events left both WDC and SJC in primary server mode for a short amount of time, which meant we had to roll back some operations.

    How often does that happen?

    War Games

    War Game Horn

    Responding to alerts involves fixing production issues. Needless to say, tinkering with production environments requires some knowledge. One of the greatest time sinks in team environments is knowledge silos. To combat that risk, we document absolutely everything, and we use checklists. On top of that, every three months we organise War Games.

    As the name suggests, War Games is a simulation of incidents that test our readiness (knowledge) to resolve production incidents. Pedro, our Operations Manager hand-picks a varied range of scenarios and keeps them private until the day of the main event.

    He then sets up a private HipChat room with each engineer (everyone who participates in our on-call rotation). And then, once everything is ready, he sounds a bulb horn to signal the arrival of alerts.

    The same alert appears in every private HipChat room. Each participant then types down their troubleshooting commands and Pedro simulates the system responses.

    Other than the obvious benefits of increasing team readiness, War Games have often helped us discover workarounds for several limitations we thought we had. There is always a better and faster way of fixing things, and War Games is a great way to surface those. Knowledge sharing at its best. A side-outcome is improved documentation too.

    Summary

    Here at Server Density we spend a significant amount of our time improving our infrastructure. This includes rethinking and optimising the nature of alerts we receive and—perhaps most crucially—how we respond to them.

    We’ve faced a lot of ops scenarios over the last six years, and—together with scar tissue—we’re also accumulating knowledge. To stay productive, we strive to keep this knowledge “fresh” and accessible to everyone in the team. We leverage documentation, checklists, and we also host War Games: the best way to surface hidden nuggets of knowledge for everyone in the team to use.

    What about you? What types of incidents do you respond to, and how’ve you improved your ops readiness?

  8. Life on Call: Productivity and Wellbeing around the Clock

    Leave a Comment

    Picture this:

    They hand you a portable gadget that makes Gordon Gekko’s cellphone look hip. You must carry it 24/7. Every hour—day and night—you have to switch it on and listen to a series of three-digit code numbers. If you hear your number you need to race to the nearest telephone to find out where you’re needed. How does that sound?

    Launched in 1950 by physicians in New York, the beeper marked a seminal point. Work didn’t quite end when you left the office anymore.

    SaaS on-call

    In the early noughties, you needed north of a million dollars before your startup wrote its first line of code.

    All that money was used to buy things that we now get for free. That’s because much of our infrastructure has since become a commodity we don’t have to worry about.

    Doesn’t that mean less moving parts to fix? And less need for on-call work? Probably not. Near-zero infrastructure means near-absent barriers to entry. It means more competition. Less friction also means greater speed. Some call it streamlined, agile, elastic. Others call it frantic. Point is, features are often tested in production these days.

    On-call work has always been about reacting and mitigating. A physician will not invent a cure while on-call. They can only treat symptoms. Same goes with DevOps teams. An engineer will not fix code while on-call. They will do something more blunt, like restart a server. To restart a server they don’t walk to the back-of-house. Instead, they start an SSH session. If it still doesn’t work they raise a ticket. All from the (dis)comfort of their living room sofa.

    Wellbeing

    DevOps and sysadmin teams have not exactly cornered the market for on-call work. Far from it. One in five EU employees are actually working on-call.

    Even if you think your work does not require on-call duties, think again. When was the last time you checked your work email? The occasional notification may sound harmless. It’s not unheard of, however, for emails to arrive past midnight, followed by text messages asking why they were not answered.

    The anxiety of on-call work stems from the perceived lack of control. It doesn’t matter if the phone rings or not. Being on-call and not being called is, in fact, more stressful than a “busy” shift, according to this article. It is this non-stop vigilance, having to keep checking for possible “threats” that is unhealthy.

    We are not here to demonize on-call. As engineers in an industry that requires this type of work, we just think it pays to be well informed of potential pitfalls.

    Goodwill is Currency

    When the alert strikes at stupid o’clock, chances are you’ll be fixing someone else’s problem. It’s broken. It’s not your fault. And yet here you are, in a dark living room, squatting on a hard futon like a Gollum, cleaning somebody else’s mess.

    On-call work is a barometer of goodwill in a team. There is nothing revelational about this. Teamwork is essential.

    What happens when you (or a family member) is not feeling well? What if you need to take two hours off? Who is going to cover for you if everyone is “unreachable” and “offline”?

    The absence of goodwill makes on-call duty exponentially harder.

    Focus on Quality

    Better quality, by definition, means lower incident rates. Over time, it nurtures confidence in our own systems. I.e. we don’t expect things to break as often. The fear of impending incidents takes a dip, and our on-call shift gets less frightful.

    To encourage code resilience, it makes sense to expose everyone— including devs and designers—to production issues. In light of that, here at Server Density we run regular war games with everyone who participates in our on-call rotation. So when it comes to solving production alerts, our ops and dev teams have a similar knowledge base.

    We also write and use simple checklists that spell things out for us. Every single step. As if we’ve never done this before. At 2:00 in the morning we might have trouble finding the lights, let alone the Puppet master for our configuration.

    Our “Life On Call” Setup

    Each engineer at Server Density picks their own gear. In general we use a phone to get the alerts and a tablet or laptop to resolve them.

    For alerting we use PagerDuty. For multi-factor authentication we run Duo and for collaboration we have HipChat. We also have a Twitter client to notify us when PagerDuty is not available (doesn’t happen often).

    Upon receiving an alert we switch to a larger display in order to fix and resolve it. 80% of our incidents can be dealt with using a tablet. All we need is an SSH client and full browser. The tablet form factor is easier on the back and can be taken to more places than a laptop.

    Planning

    An overnight alert is like an auditory punch in the face. At least you’ve signed up for this, right? What about your partner? What have they done to deserve this?

    To avoid straining relationships it pays to be proactive. Where do you plan to be when on-call? Will you have reception? If an alert strikes, will you have 4G—preferably Wi-Fi—in order to resolve it? What about family obligations? Will you be that parent at the school event, who sits in a corner hiding behind a laptop for two hours?

    Summary

    At best, working on-call is nothing to write home about. At worst, well, it kind of sucks.

    Since it’s part of what we do though, it pays to be well informed and prepared. Focusing on code-resilience, nurturing teamwork, and setting the right expectations with colleagues and family, are some ways we try and take the edge off it.

    What about you? How do you approach on-call? What methodologies do you have in place?

  9. Server Naming Conventions and Best Practices

    3 Comments

    Early last year we wrote about the server naming conventions and best practices we use here at Server Density. The post was inspired in part by an issue in one of our failover datacenters in San Jose.

    Our workaround involved using Puppet to modify the internal DNS resolvers of the affected servers. Directing an automated change to the affected servers (our San Jose datacenter servers) turned out to be painless. Why? Because our naming convention includes a functional abbreviation for location:

    hcluster3-web1.sjc.sl.serverdensity.net

    Impressed with how easy deploying the fix was, we decided to talk about it in a post that has since proved quite popular. Frankly, we didn’t expect server naming conventions to be such a gripping topic. So we set out to understand why.

    As it turns out, there are two main reasons why server names strike a chord with people.

    1. Naming Things is Hard

    “There are only two hard problems in Computer Science: cache invalidation and naming things.”

    Phil Karlton

    Naming servers can get very tough, very quickly. That’s true, partly because deploying them has become incredibly easy. You can have a new server up and running in as little as 55 seconds.

    Lemmings

    It’s not unheard of for sysadmins to be responsible for dozens, hundreds, perhaps even thousands of servers these days. The cognitive burden involved with naming and managing rapidly escalating swarms of devices is beyond what humans are accustomed to.

    Most people only ever get to name a few things in their life.

    Chi chi van raisin

    And yet, that’s what sysadmins find themselves doing for much of their day.

    2. Servers Are not What They Used to Be

    Not long ago, servers occupied physical space. We could see and touch them.

    That’s no longer the case. The elastic nature of deploying infrastructure affords us an order of magnitude more servers that we can’t even see.

    Attempting to stretch old-school naming conventions (planet names and Greek Gods) to the limitless scope of this brave new world is proving to be difficult, if not impossible.

    Our habits and practices hail from a time when caring for each individual box was part of our job. When that doesn’t work, or suffice, we experience cognitive dissonance and confusion.

    “We’ve had flame wars on internal team mailing lists arguing how it should be done to no result,” wrote one sysadmin on Reddit.

    There is no such thing as a golden rule—much less a standard—on how to name servers.

    A sysadmin said they name servers “randomly, based on whatever whim we are on that day”. Other methods were even more state of the art: “I just roll my face on the keyboard. That way there’s sure to be no duplicates.”

    We spoke to Matt Simmons from Standalone Sysadmin to get his expert opinion on this transition.

    “A computer infrastructure largely exists in one of two worlds,” he says. “Either you have so few machines that you individually deal with them, and they’re treated as pets, or you have so many that you can’t individually deal with them, and you treat them like cattle.”

    Servers as Pets – our Old Scheme

    Names give meaning. They allow us to understand and communicate. When we are battling with the same few boxes day in day out, it makes sense to give them personable, endearing names.

    From Aztec Gods and painkillers, to Sopranos and Babar the Elephant, there is no shortage of charming candidates.

    When dealing with a limited amount of servers, “pet” naming conventions work really well. They are short, memorable, and cute. They’re also completely decoupled from the server’s role, which makes it harder for hackers to guess their name (security through obscurity).

    Back when we had a smaller number of servers we based our naming scheme on characters from His Dark Materials by Philip Pullman. A master database server was Lyra and the slave was Pan.

    Much of the formal guidance we found online caters for similar scenarios, i.e. finite numbers of servers. An old RFC offers some astute guidance:

    Words like “moron” or “twit” are good names if no one else is going to see them. But if you ever give someone a demo on your machine, you may find that they are distracted by seeing a nasty word on your screen. (Maybe their spouse called them that this morning.)

    Servers as Cattle – our New Scheme

    “The engineer that came up with our naming scheme has left the company. Nobody knows what is hosted on Prometheus anymore.” Sound familiar?

    There’s only so many heroes in A-Team and Arrested Development. Dog Breeds will only get you that far. There comes a point when naming servers has to be more about function than form.

    We moved to our current naming structure a few years ago. This allows us to quickly identify key information about our servers. It also helps us filter by role, provider or specific locations. Here is an example:

    hcluster3-web1.sjc.sl.serverdensity.net

    hcluster3 : this describes what the server is used for. In this case, it’s of cluster 3, which hosts our alerting and notification service (our monitoring app is built using a service orientated architecture). Other examples could be mtx2 (our time series metrics storage cluster) or sdcom (servers which power our website).

    web1 : this is a web server (Apache or nginx) and is number 1 in the cluster. We have multiple load balanced web servers.

    sjc : this is the datacenter location code, San Jose in this case. We also have locations like wdc (Washington DC) or tyo (Tokyo).

    sl : this is the facility vendor name, Softlayer in this case. We also have vendors like rax (Rackspace) and aws (Amazon Web Services).

    The advantage of this naming convention is that it scales as we grow. We can append and increment the numbers as needed.

    As per above, it’s also easy to modify servers based on criteria (role, provider, location). In our Puppet /etc/resolv.conf template we can do things like:

    <% if (domain =\~ /sl/) -%>
    <% if (domain =\~ /sjc.sl/) -%>
    # google DNS - temp until SL fixed
    nameserver 8.8.8.8
    nameserver 8.8.4.4
    <% else %>
    # Internal Softlayer DNS
    nameserver 10.0.80.11
    nameserver 10.0.80.12
    <% end -%>
    ...
    

    One disadvantage with long server names is that they can be unwieldy.

    When compared to “pet” names, “cattle” server names are hard to remember and even harder to type in CLIs. They also need to be updated when servers are moved to different geographies or their roles change.

    Security-wise they’re often seen as less robust than their “pet” name equivalents. That’s because they make it just one step easier for hackers, by helping them deduce the names of servers they want to access.

    Summary

    The transition to cloud computing has caused a dramatic increase of servers that sysadmins are tasked to administer (and provide names for).

    A good naming convention should make it easy to deploy, identify and filter through your server pool. If you’re only planning to have a handful of servers, then coming up with real names (servers as pets) might suffice.

    For anything remotely scaleable and if identifying your servers is key, then consider something more practical and functional (servers as cattle).

  10. Security Principles and Practices: How to Approach Security

    Leave a Comment

    October is Security Month here at Server Density. To mark the occasion we’ve partnered with our friends at Detectify to create a short series of security dispatches for you.

    In our previous three articles we looked at some essential security checks for your web applications, APIs and servers. But once the obvious vulnerabilities are considered, what happens next? How can we stay proactive and, most importantly, how do we become security conscious?

    What follows is a set of underlying security principles and practices you should look into.

    Minimise your Attack Surface

    An attack surface is the sum of the different points (attack vectors) from where an unauthorized user can inject or steal data from a given environment. Eliminating possible attack vectors is the first place to start when securing your systems.

    This means closing down every possible interface you’re not using. Let’s take web apps for example. Ports 80 and 443 should be the only ones open to the outside world. SSH port 22 (preferably changed to something else) should be accessible to a restricted subset of permitted IPs and only developers / administrators should have access. The obvious idea is to limit the scope for outside attackers to creep in.

    Here’s is an example scenario: You run a website which has the following two restrictions: i) Only developers have admin access, and ii) SSH access is only available through a VPN connection. For a break-in to happen, an intruder would therefore need to compromise the credentials of your developer, and they would also need access to your VPN and SSH keys. The attack would have to be highly coordinated.

    What’s more, any potential intrusion might not yield that much (internal systems may employ “defense in depth” and “least privilege” practices). It’s unlikely an attacker would spend the time and resources to jump through all those hoops (for uncertain gain), purely because there are easier targets out there.

    Most attacks are opportunistic. Which is why layers of security are important. Breaching one layer just gets you down to the next one rather than compromising the whole system. The rule of thumb is, attackers go after the easiest targets first. Your systems should, therefore, be as locked down as as possible. This includes servers, workstations, phones, portables, et cetera. As the attack surface diminishes, so does the likelihood of hacking attempts.

    If you don’t know what to look out for, third party services can help you determine how breachable your systems are. For example:

    • Detectify can evaluate your web applications
    • Nessus can scope your network-layer security
    • Penetration testers (pentesters) can assess your end-to-end security profile

    You then need to put the effort in and plug the issues that come up.

    Internal Practices and Company Culture

    The strongest of perimeters can’t protect against internal human error. Take “errors of commision,” for example. An employee quits their job, goes to a competitor and leaks intel. How do you anticipate and prevent that?

    Then there is a long list of “errors of omission”. People have businesses to run, busy lives to lead, important things to do. Staying secure is not always top-of-mind and we let things slide.  For example, are employees reminded to encrypt their laptops and portables? When was the last time you monitored your server activity? What systems do you have in place to negate the need to “remember”? Who handles security in your team? Who is accountable?

    Humans are the weakest link when it comes to safe systems. Your internal systems (and practices) need to account for that. Security needs to be a fundamental part of how you work and collaborate on projects.

    “Given enough eyeballs, all bugs are shallow”

    Linus Law

    Your internal practices should facilitate as many “eyes on the code” as possible. This can be done with peer reviews and code buddy schemes. To complement your team efforts, there are some compelling platforms for bug bounty and bug reporting you can tap into. [NB: Crowd skillsets are not—strictly speaking—an internal constituent of company culture. Admitting we don’t know it all and asking for help, however, is.]

    What Motivates Hackers?

    Some of them are out to prove a point. Others are criminal gangs looking for financial gains such as extortion and credit card theft. Then there is industrial espionage, botnets and a whole host of ugly stuff. The threat landscape is highly diverse. Ultimately all it takes is a single misstep for an attacker to get the keys to the kingdom.

    It therefore pays to think like a hacker. Why would someone want to hack your server? What data lives there? What is the easiest way in? What could the attacker do once inside?

    “The Enemy Knows the System”

    According to Kerckhoffs’s principle every secret creates a potential failure point. If you’re relying on “security through obscurity” to stay safe, then your systems are as safe as your secrets (see human factor above).

    A secure authentication policy, for example, does not depend on secrecy. Even if a password was compromised (how easy is it to impart a 20 character randomised password?) an attacker would still need a separate token to gain access (MFA).

    Further Reading

    If there is one underlying theme in our security dispatches so far, is this: Security is an incredibly fast moving field, with plenty of complexity and trade-offs involved.

    Getting up to speed and staying on top of the latest security trends and threats is a key requirement in maintaining secure systems and infrastructure.

    Reddit’s /r/netsec is great starting point. Hacker News tend to highlight the most evil vulnerabilities. There’s a bunch of very skilled security researchers on Twitter. Some indicative profiles are @SophosLabs, @TheHackersNews and @mikko.

    Some blogs we like are:

Articles you care about. Delivered.
Maybe another time