Author Archives: Pedro Pessoa

  1. Subtle differences – tracking down CPU version performance


    When unusual activity occurs in a cluster of servers, the first job is to eliminate any variables so you can rule out small configuration differences. Often this will be small software version differences but with the use of modern config management tools like Puppet or Chef, this is becoming less and less likely.

    So when that doesn’t reveal anything unusual, the next step is to look at the hardware. Here lies a classic example of taking care of spotting the details, and understanding what is out of the ordinary when it comes to your infrastructure.

    The process

    We have weekly reviews of several performance indicators across our infrastructure. This doesn’t replace automated monitoring and alerting on those indicators, however it allows us to spot small performance decreases over time so that we can investigate issues within the infrastructure, schedule in time for performance improvements to our codebase and plan for upgrades.

    Since we use Server Density to monitor Server Density that becomes very easy – it usually takes only a couple of minutes by glancing at some preset time interval on our performance dashboards.

    This is the last 3 months of data on one of those dashboards:

    3 months performance data

    The odd performance values

    Some time ago, soon after an upgrade on one of our clusters, it started to show this load profile:

    Initial cluster load

    This is a 4 server queue processing cluster run on Softlayer with dedicated hardware (SuperMicro, Xeon 1270 Quadcores, 8GB RAM). All the software stack is built from the same source using Puppet and our deploy process ensures all of the cluster nodes run exactly the same versions.

    Why was one of the servers showing lower load for the exact same work? We couldn’t justify any difference so we went and asked Softlayer support:

    There are no discernible differences between the servers

    was the first answer we got.

    The plot thickens

    Not being happy with having servers that should behave the same and not doing so, we looked further into the matter and found yet another, this time more worrying, issue – packet loss on the 3 servers that showed the higher load:

    Initial cluster packet loss

    So we went back to Softlayer support. They were quite diligent and “looked at the switch/s for these servers, network speed, connections Established & Waiting, apache/python/tornado process etc…” but in the end came back empty except for a subtle difference on the cluster hardware: “all of the processors are Xeon 1270 Quadcores, -web4 is running V3 and is the newest; -web2 and -web3 is running V2; -web1 is running V1“.

    When we order new servers, we pick the CPU type but it doesn’t offer the granularity of the CPU versions. The data center team deliver what they have ready.

    The fix

    After some research, we discovered that there were some potentially interesting differences between the CPU versions and so we decided to eliminate the hardware difference and see what would happen.

    Softlayer usually accommodates special requests and we had no difficulty in getting this through.

    The next graph show the replacement of -web1 and then -web2 and -web3. Can you see when it was done?

    cluster load

    Then a similar plot for the cluster packet loss:

    cluster packet loss

    Switching all the servers to a consistent CPU and CPU version solved the problem. The packet loss disappeared and the performance equalised. This is a great example of a very subtle difference having some measurable impact on the operation of the server. Using config management allowed us to quickly eliminate a software cause, at least one that we could control. It’s possible that the CPU version had some issue with the hardware drivers, but it illustrates how consistency within a cluster is important.

  2. Deploying nginx with Puppet

    Comments Off on Deploying nginx with Puppet

    This post is based on our talk at Puppet Camp Ghent. See our separate page for all our talk videos and slides about how Puppet is used to manage the Server Density monitoring infrastructure.

    We’ve been using Puppet to manage our infrastructure of around 100 servers for over a year now. This originally started with manually building manifests to match the setup in our old hosting environment at Terremark so we could migrate to Softlayer. Over the last few months we’ve been refactoring it to make use of Forge modules and combining roles on some servers to reduce the total number. Using Forge modules allows us to reuse community code rather than reimplementing everything ourselves.

    Puppet Forge

    Puppet Forge

    While preparing our infrastructure for Server Density v2, out in the next few months, one of the areas that required special consideration was our load balancer. We currently use Pound, but the new product required additional functionality that we could only get from Nginx.

    Nginx is an event driven web server with inbuilt proxying and load balancing features. It has inbuilt native support for FCGI and uWSGI protocols allowing us to run Python WSGI apps in fast application servers such as uWSGI.

    Unlike Pound and Varnish, new builds of Nginx have support for both WebSockets and Google’s SPDY standard plus supported third party modules for full TCP/IP socket handling, all of which allows us to pick and mix between our asynchronous WebSocket Python apps and our upstream Node.js proxy.

    This obviously had to be both fully deployable and fully configurable through Puppet.

    Searching for a Puppet solution

    The choice was between writing our manifest – yet another one to the collection – or refraining from reinventing the wheel by reaching out to the community and looking whether our problem had already been solved (or at least some sort of kick start we could stand on).

    Since we believe more in reuse than roll our own, a visit to the Forge was inevitable.

    Our setup

    • We run Puppet Enterprise using our own manifests. These are pulled from our Gihub repo by the puppet master.
    • Puppet Console and Live Management get used quite intensively to trigger transient changes such as failover switches (video from PuppetConf 2012).

    Integrating it all

    Having selected the Puppetlabs nginx module, we needed to add it to our Puppet Enterprise (PE) setup. We need to A) get the actual code in and B) be able to have it run on existing nodes.

    There are 2 ways to do A):

    1. puppet module install puppetlabs/nginx
    2. git submodule add on the existing puppet master modules folder

    This decision is eased by looking at the module and realising that we will be making enhancements to it. Since we’ll fork puppetlabs-nginx for those changes, the best way to get that fork into the puppet master is by having a git submodule.

    Now for B). We found out quickly that PE Console does not yet support parameterised classes such as this one. So we’re left with the option of doing a merge between our site.pp (which is empty) and the console, it being an ENC and all. But that would kill our ability to use the Console to trigger transient changes as mentioned before. We want to continue to have our node parameters managed on the Console and not on site.pp. For completeness, node inheritance could also be used but would probably get quite messy.

    The solution we settled on is having a class hierarchy:

    Our own serverdensity-nginx is used by PE Console and it then includes the nginx class, allowing us to use both the Console node parameters and the nginx module functionality.

    Trigger transient changes using the Console

    One of the best aspects of this solution is that it allows to overcome nginx lack of a control interface, keeping the operations functionality we were used to with Pound.

    Nginx requires the configuration files to change and it then loads the configuration on reload. To, eg. remove one node from the load balancer rotation, one would need to edit the corresponding configuration file and trigger a nginx reload.

    With Puppet, this is achieved changing the node, or group, parameter using the Console and then triggering the node to run. Puppet will then reload nginx. And best, the in flight connection are not terminated by a reload.

    5 7 8 9

    Alternatives, or what could we have done differently?

    Later we learned that Puppet Labs Nginx module is derived from James Fryman’s module. It might have been better to pick that one up since the Puppet Labs module hasn’t been updated since the 30th June 2011. Apparently Puppet Labs wants to ensure that all modules in the Puppet Labs namespace on the Forge are best-in-class and actively maintained. They mentioned that they hope to kick off this project in early 2013.

    We have made some modifications to the module:

    • Several SSL improvements. Thank you Qualys SSL Labs!
    • Additional custom logging
    • Some bug corrections and beautification

    And you can find our fork on Github.

  3. High availability with Pound load balancer


    Following our migration to SoftLayer and their release of global IPs, we started to implement our multi-datacenter plan by replacing dedicated (static IP) load balancers with Pound and a global IP. Pound in itself is very easy to deploy and manage. There’s a package for Ubuntu 12.04, the distribution we are using for the load balancer servers, allowing us to have it running in no time out of the box, particularly if what you’re load balancing doesn’t have any special requirements.

    Pound load balancer

    Unfortunately, this was not the case of our Server Density monitoring web app that requires a somewhat longer cookies header. The Pound Ubuntu package is compiled with the default MAXBUF = 4096 but we needed about twice as that to allow our header through. This is a bug we didn’t discover until testing Pound because our hardware load balancers didn’t have this limit but it highlights something to fix on the next version of Server Density. We don’t particularly like recompiling distribution packages, mostly because we diverge from the general usage and eventually will cause suspicion on these changes if some problem arises on that particular package.

    Presented with no other option without breaking existing customer connections (cookie is sent before we can truncate it) we decided to start a PPA for our Pound changed package. This carries two advantages we appreciate, it’s shared with the world and we can make use of Launchpad build capabilities.

    Pound Load Balancer Configuration

    Besides the previous application specific change, our Pound configuration is quite simple and managed from Puppet Enterprise – hence the ruby template syntax ahead. From the defaults, we changed:

    • Use of the dynamic rescaling code. Pound will periodically try to modify the back-end priorities in order to equalize the response times from the various back-ends. Although our backend servers are all exactly the same, they are deployed as virtualised instances on a “public” cloud at Softlayer, so can independently suffer from performance impact on the host.
    • A “redirect service”. Anything not ** is immediately redirected to and doesn’t even hit the back-end servers.
    • Each back-end relies on a co-located mongo-s router process to reach our Mongo data cluster.  We use the HAport configuration option to make sure the back-end is taken out of rotation when there is a problem with the database and the webserver is still responding on port 80.
    • Finally, because we have a cluster of load balancers, we needed to be able to trace which load balancer handled the request. For this we add an extra header.

    Redundancy and automated failover

    The SoftLayer Global IP (GIP) is the key to give us the failover capability with minimum lost connections to our service. By deploying two load balancers per data center, being targeted by a single GIP, we can effectively route traffic to any load balancer.

    We deploy the load balancers in an active-standby configuration. While the active is being targeted by the GIP, receiving all traffic, the standby load balancer monitors the active health. If the active load balancer stops responding (ICMP, HTTP or HTTPS) the GIP is automatically re-routed to the standby load balancer using the SoftLayer API. This situation is then alerted through a PagerDuty, also using their API, to allow the on-call engineer to respond. There’s no automatic recover attempt to avoid flapping and to allow investigation of the event.


    For the upcoming version of Server Density, we’ll be deploying nginx because we’ll be using Tornado. Another blog post will be in order by then.

    We’ll also be presenting the integration of Pound and this automation with Puppet Enterprise at PuppetConf 2012 on September 27th and 28th.