Author Archives: Pedro Pessoa

  1. Improving the Service Resilience of your App

    Leave a Comment

    Last week, a customer sent a note, asking us how we make sure Server Density remains available, around the clock. We love getting those questions. In fact, we take every opportunity to discuss service resilience in great detail.

    As you’d expect, there are two areas that govern how reliably your cloud app runs. The application itself, is one. How long does your code operate without failures? How is it architected, and does it scale gracefully?

    Equally important, though, is the infrastructure your app is sitting on. What happens if a VM is taken down? How do you handle datacenter failovers? What systems do you have in place?

    If your app is to deliver on its bold uptime metrics, you need a good handle on both application and infrastructure resilience. Let’s take a look at how we do those at Server Density.

    Service Resilience: the Importance of Application Quality

    At its very core, the application itself needs to be solid, i.e. any failures should be handled internally without causing outages (normal vs. catastrophic failure).

    Better software quality, by definition, means lower incident rates. To encourage application resilience, it makes sense to expose everyone—including devs and designers—to production issues. In light of that, here at Server Density we run regular war games with everyone who participates in our on-call rotation.

    We also recommend running regular Chaos Monkeys. Every single bug we found as a result of chaos events was in our own code. Most had to do with how our app interacts with databases in failover mode, and with libraries we’ve not written. Setting Chaos Monkeys loose on our infrastructure—and dealing with the aftermath—helps us strengthen our app.

    The Importance of Systems

    Infrastructure resilience is all about ensuring that individual component failures do not affect our overall infrastructure.

    The most obvious way to minimise service interruptions is by having “one more of everything” (for example, in a cluster of 4 servers handling current load, we would need 5). But you also need systems. When faced with an outage, there should be little doubt as to what needs to happen. Doubts cause delays and errors. Focus should be on executing an established set of steps (see checklists). Any downtime should therefore equal to no more than the time it takes to failover.

    So, let’s start from the most benign types of failures and gradually up the ante. Let’s take a look at how our infrastructure copes as the stakes get higher.

    Regular Server Maintenance – Downtime: None

    Most of our servers go offline at least once every 30 days. We do full upgrades virtually every month. At the very least, they would involve kernel (OS) updates, which require a reboot. By having a regular and scheduled server downtime, we get to flex our failover muscle and get better at routing traffic.

    Power and Networking Failures – Downtime: None

    Our provider, SoftLayer, has a redundant power supply, i.e. not just the electricity grid but also on site generators too. Should there be a power outage, the first thing that kicks in, is the UPS with its immediate batteries. That buys enough time for the provider to start their generators. To be able to cope with longer power outages, most providers keep fuel stocks onsite.

    Datacenters should also provide redundant networking paths. Not only at the actual physical layer of networking, but the networking equipment too. Redundancy throughout. We’ve worked with Softlayer to understand their architecture so we can place our servers and virtual instances across different failure paths.

    Zone Failures – Downtime: None

    Washington is our primary region. Within that region our workloads are split in two different zones. Each zone is a separate physical facility, and in our case they’re approximately 14 miles apart This protects against external unforeseen events, like excavators cutting a fiber cable and taking out all networking for the entire facility. We experienced this type of outage 3 years ago. In response to that, we introduced an extra layer of inter-region redundancy.

    If our primary datacenter fails for whatever reason, we now have a second identical and “hot” datacenter within Washington. That datacenter is fully up to date (current) and ready to go with zero delay. Virtually every provider now has multiple zones in region and it’s become a standard.

    service-resilience-map

    Regional Failures – Downtime: approx. 2 hours

    On October, 2012 Washington DC declared a state of emergency due to hurricane Sandy. To preempt any service interruption, we decided to failover to our secondary datacenter in San Jose (SJC).

    It’s important to note that we don’t sit and wait for black swans to happen before testing our readiness. We schedule and test regional failovers on a regular basis (last test was in 2015) so that we’re not taken by surprise in case of a “force majeure”.

    We run our database replication “live” (replication happens within a few seconds) but keep all the other instances as snapshot templates to avoid running idle servers, wasting money and natural resources. However, this means there is a lag time between pressing the button to starting everything up. Most failovers will be triggered due to our own choice e.g. playing it safe with weather warnings, so this doesn’t matter. But if both our zones were to fail then full recovery would take some time (5-10 minutes to boot up, 10-15 minutes for automatic procedures to complete, 60 minutes for a human to run through the failover checklist).

    Sustaining geo-level redundancy involves enormous amounts of duplicate systems. At some point, we asked ourselves: is there any way we could make better use of this capacity. After reviewing our options, we decided to move our secondary from San Jose to Toronto, which is geographically nearer. That will help us reduce the latency between the two datacenters from 70ms to 20ms. Minimising the latency means we can make better (and more dynamic) use of all that extra capacity in situations when a complete failover is not necessary. We can run both locations in active mode, achieving the same results as the zone setup but with enough geographic distance to avoid localised events such as weather.

    Provider Failure – Downtime: 1 day

    Are you keeping count? That’s 3 redundant datacenters so far (2 zones in Washington, 1 zone in Toronto). You’d think any more redundancy borders on, dunno, too much? How could more than two separate regions fail at the same time?

    Well, they can. And they have. In November 2014, Azure suffered a global outage that lasted two hours. (Unlike Azure’s interconnected architecture, SoftLayer and AWS facilities are completely isolated, so this specific type of global outage should be impossible).

    There are several ways we could mitigate this risk. We could have another provider with “hot” infrastructure in place, that could pick up our entire workload in a near instant manner. The folks at Auth0 can failover from Azure to AWS in 60 seconds.

    The other option would be to align our fate with our provider’s and do nothing. The risk here is obvious. Should our provider face a long term service interruption, we would run out of options.

    We decided to opt for something in the middle. Instead of having a “hot” provider on standby, we have put in place a disaster recovery process using MongoDB’s infrastructure. This involves having a live backup using MongoDB’s Cloud Backup service. We have built our own restore and verify service which runs twice a day to ensure that our backups actually work, and stores a copy of the backup on Google’s storage (versioned so we retain copies going back several weeks).

    A full rebuild would obviously not be instantaneous. It would take some time to rebuild, but that is significantly reduced by using Puppet to manage all our systems. We wouldn’t have to do as much by hand because we can easily replicate our existing setup.

    Write Good Postmortems

    Ultimately, there is no such thing as 100% availability. When sufficiently elaborate systems begin to scale it’s only a matter of time for some sort of failure to happen. There is no way around that.

    Writing good postmortems when systems are back online, helps restore customer confidence. It demonstrates that someone is investing time on their product. That they care enough to sit down and think things through.

    Summary

    Downtime is expensive in more ways than one. Service interruption can lead to lost revenue, it can impact your productivity and tarnish your reputation.

    Ultimately, your availability metrics are an indication of quality. How solid is your application infrastructure? How solid is your failover routine? Furthermore, how solid is your communication, customer care, and post mortems?

    Attaining 100% availability might be an impossible feat. How well you prepare, plan and execute around it, is not.

  2. Cluster Optimisation: Hunting Down CPU Differences

    7 Comments

    Notice any unusual activity in your cluster?

    The first thing to do is look for any subtle differences between the participating servers. The obvious place to start is—you guessed it—software.

    Given the wellspring (anarchy) of apps sitting on most servers, the task of manually tracking down any deltas in software versions can be onerous. Thankfully, the growing adoption of modern config management tools like Puppet and Chef has made this exercise much easier.

    Once software inconsistencies are ruled out, the next step is to look at hardware. It’s a classic case of playing detective, i.e. searching for clues and spotting anything out of the ordinary in your infrastructure.

    Here is how we do this at Server Density.

    Cluster Optimisation: The Process

    We do weekly reviews of several performance indicators across our entire infrastructure.

    This proactive exercise helps us spot subtle performance declines over time. We can then investigate any issues, schedule time for codebase optimisations and plan for upgrades.

    Since we use Server Density to monitor Server Density those reviews are easy. It only takes a couple of minutes to perform this audit, using preset time intervals on our performance dashboards.

    The Odd Performance Values

    It was during one of those audits—exactly this time last year—when we observed a particularly weird load profile. Here is the graph:

    Initial cluster load

    This is a 4 server queue processing cluster which runs on Softlayer with dedicated hardware (SuperMicro, Xeon 1270 Quadcores, 8GB RAM). We’d just finished upgrading those seemingly identical servers.

    The entire software stack is built from the same source using Puppet. Our deploy process ensures all cluster nodes run exactly the same versions. So why was one of the servers exhibiting a lower load for the exact same work? We couldn’t justify the difference.

    With the software element taken care of (config management), we turned our attention to hardware and got in touch with Softlayer support.

    “There are no discernible differences between the servers,” was their response.

    The Plot Thickens

    Feeling uneasy about running servers that should behave the same and are not, we decided to persevere with our investigation. Soon we discovered another, more worrying, issue: packet loss on the 3 servers with the higher load.

    Initial cluster packet loss

    Armed with those screenshots, we went straight back to Softlayer support.

    They were quite diligent and “looked at the switch/s for these servers, network speed, connections Established & Waiting, apache/python/tornado process etc…

    Even so, they came back empty-handed. Except . . . for a subtle difference on the cluster hardware:

    all of the processors are Xeon 1270 Quadcores, -web4 is running V3 and is the newest; -web2 and -web3 is running V2; -web1 is running V1“.

    Smoking Gun

    When ordering new servers, we get to pick the CPU type, but not the CPU version. As it turns out, the datacenter team provides whatever CPU version they happen to have “in stock”.

    We now knew what to look for.

    After some further inspections we spotted several potentially interesting differences among CPU versions throughout our infrastructure. We decided to eliminate all of them and see what happens.

    Softlayer is good at accommodating such special requests and we had no difficulty in getting this one through.

    The following graph shows the replacement of -web1 and then -web2 and -web3. Can you spot the improvement?

    cluster load

    Here is a similar plot for cluster packet loss:

    cluster packet loss

    It could be that the CPU version was incompatible with the hardware drivers, or a whole host of other issues obscured beneath that CPU version delta. Switching all the servers to a consistent CPU version solved the problem. All packet loss disappeared and performance equalised.

    Summary – What We Learned

    Consistency within clusters is a good thing to have. Specifically:

    1. Even subtle details—configurations, version settings and other indicators that easily go unnoticed—can have a measurable impact on our infrastructure.
    2. Using  modern config management tools allows us to eliminate any software discrepancies, and do it quickly.
    3. Scheduling regular proactive reviews of our infrastructure is a fantastic opportunity to spot any lurking issues, plan for codebase optimisations and decide upon hardware upgrades.
  3. Deploying nginx with Puppet

    Comments Off on Deploying nginx with Puppet

    This post is based on our talk at Puppet Camp Ghent. See our separate page for all our talk videos and slides about how Puppet is used to manage the Server Density monitoring infrastructure.

    We’ve been using Puppet to manage our infrastructure of around 100 servers for over a year now. This originally started with manually building manifests to match the setup in our old hosting environment at Terremark so we could migrate to Softlayer. Over the last few months we’ve been refactoring it to make use of Forge modules and combining roles on some servers to reduce the total number. Using Forge modules allows us to reuse community code rather than reimplementing everything ourselves.

    Puppet Forge

    Puppet Forge

    While preparing our infrastructure for Server Density v2, out in the next few months, one of the areas that required special consideration was our load balancer. We currently use Pound, but the new product required additional functionality that we could only get from Nginx.

    Nginx is an event driven web server with inbuilt proxying and load balancing features. It has inbuilt native support for FCGI and uWSGI protocols allowing us to run Python WSGI apps in fast application servers such as uWSGI.

    Unlike Pound and Varnish, new builds of Nginx have support for both WebSockets and Google’s SPDY standard plus supported third party modules for full TCP/IP socket handling, all of which allows us to pick and mix between our asynchronous WebSocket Python apps and our upstream Node.js proxy.

    This obviously had to be both fully deployable and fully configurable through Puppet.

    Searching for a Puppet solution

    The choice was between writing our manifest – yet another one to the collection – or refraining from reinventing the wheel by reaching out to the community and looking whether our problem had already been solved (or at least some sort of kick start we could stand on).

    Since we believe more in reuse than roll our own, a visit to the Forge was inevitable.

    Our setup

    • We run Puppet Enterprise using our own manifests. These are pulled from our Gihub repo by the puppet master.
    • Puppet Console and Live Management get used quite intensively to trigger transient changes such as failover switches (video from PuppetConf 2012).

    Integrating it all

    Having selected the Puppetlabs nginx module, we needed to add it to our Puppet Enterprise (PE) setup. We need to A) get the actual code in and B) be able to have it run on existing nodes.

    There are 2 ways to do A):

    1. puppet module install puppetlabs/nginx
    2. git submodule add https://github.com/puppetlabs/puppetlabs-nginx.git on the existing puppet master modules folder

    This decision is eased by looking at the module and realising that we will be making enhancements to it. Since we’ll fork puppetlabs-nginx for those changes, the best way to get that fork into the puppet master is by having a git submodule.

    Now for B). We found out quickly that PE Console does not yet support parameterised classes such as this one. So we’re left with the option of doing a merge between our site.pp (which is empty) and the console, it being an ENC and all. But that would kill our ability to use the Console to trigger transient changes as mentioned before. We want to continue to have our node parameters managed on the Console and not on site.pp. For completeness, node inheritance could also be used but would probably get quite messy.

    The solution we settled on is having a class hierarchy:

    Our own serverdensity-nginx is used by PE Console and it then includes the nginx class, allowing us to use both the Console node parameters and the nginx module functionality.

    Trigger transient changes using the Console

    One of the best aspects of this solution is that it allows to overcome nginx lack of a control interface, keeping the operations functionality we were used to with Pound.

    Nginx requires the configuration files to change and it then loads the configuration on reload. To, eg. remove one node from the load balancer rotation, one would need to edit the corresponding configuration file and trigger a nginx reload.

    With Puppet, this is achieved changing the node, or group, parameter using the Console and then triggering the node to run. Puppet will then reload nginx. And best, the in flight connection are not terminated by a reload.

    5 7 8 9

    Alternatives, or what could we have done differently?

    Later we learned that Puppet Labs Nginx module is derived from James Fryman’s module. It might have been better to pick that one up since the Puppet Labs module hasn’t been updated since the 30th June 2011. Apparently Puppet Labs wants to ensure that all modules in the Puppet Labs namespace on the Forge are best-in-class and actively maintained. They mentioned that they hope to kick off this project in early 2013.

    We have made some modifications to the module:

    • Several SSL improvements. Thank you Qualys SSL Labs!
    • Additional custom logging
    • Some bug corrections and beautification

    And you can find our fork on Github.

  4. High availability with Pound load balancer

    9 Comments

    Following our migration to SoftLayer and their release of global IPs, we started to implement our multi-datacenter plan by replacing dedicated (static IP) load balancers with Pound and a global IP. Pound in itself is very easy to deploy and manage. There’s a package for Ubuntu 12.04, the distribution we are using for the load balancer servers, allowing us to have it running in no time out of the box, particularly if what you’re load balancing doesn’t have any special requirements.

    Pound load balancer

    Unfortunately, this was not the case of our Server Density monitoring web app that requires a somewhat longer cookies header. The Pound Ubuntu package is compiled with the default MAXBUF = 4096 but we needed about twice as that to allow our header through. This is a bug we didn’t discover until testing Pound because our hardware load balancers didn’t have this limit but it highlights something to fix on the next version of Server Density. We don’t particularly like recompiling distribution packages, mostly because we diverge from the general usage and eventually will cause suspicion on these changes if some problem arises on that particular package.

    Presented with no other option without breaking existing customer connections (cookie is sent before we can truncate it) we decided to start a PPA for our Pound changed package. This carries two advantages we appreciate, it’s shared with the world and we can make use of Launchpad build capabilities.

    Pound Load Balancer Configuration

    Besides the previous application specific change, our Pound configuration is quite simple and managed from Puppet Enterprise – hence the ruby template syntax ahead. From the defaults, we changed:

    • Use of the dynamic rescaling code. Pound will periodically try to modify the back-end priorities in order to equalize the response times from the various back-ends. Although our backend servers are all exactly the same, they are deployed as virtualised instances on a “public” cloud at Softlayer, so can independently suffer from performance impact on the host.
    • A “redirect service”. Anything not *serverdensity.com* is immediately redirected to http://www.serverdensity.com and doesn’t even hit the back-end servers.
    • Each back-end relies on a co-located mongo-s router process to reach our Mongo data cluster.  We use the HAport configuration option to make sure the back-end is taken out of rotation when there is a problem with the database and the webserver is still responding on port 80.
    • Finally, because we have a cluster of load balancers, we needed to be able to trace which load balancer handled the request. For this we add an extra header.

    Redundancy and automated failover

    The SoftLayer Global IP (GIP) is the key to give us the failover capability with minimum lost connections to our service. By deploying two load balancers per data center, being targeted by a single GIP, we can effectively route traffic to any load balancer.

    We deploy the load balancers in an active-standby configuration. While the active is being targeted by the GIP, receiving all traffic, the standby load balancer monitors the active health. If the active load balancer stops responding (ICMP, HTTP or HTTPS) the GIP is automatically re-routed to the standby load balancer using the SoftLayer API. This situation is then alerted through a PagerDuty, also using their API, to allow the on-call engineer to respond. There’s no automatic recover attempt to avoid flapping and to allow investigation of the event.

    Next

    For the upcoming version of Server Density, we’ll be deploying nginx because we’ll be using Tornado. Another blog post will be in order by then.

    We’ll also be presenting the integration of Pound and this automation with Puppet Enterprise at PuppetConf 2012 on September 27th and 28th.