Author Archives: Pedro Pessoa

  1. Beyond servers: How we monitor energy consumption

    2 Comments

    Running an office is very similar to running a SaaS infrastructure. We need uninterrupted power supply so we can get things done, but we also want the smallest possible carbon footprint so we can be kind to the environment.

    To achieve that goal, we monitor energy usage in the same way we monitor our servers, using Server Density.

    The following graph illustrates the electricity consumption at one of our coworking spaces in Lisbon. The flat line on the 9th and 10th of April is our weekend usage, while the ensuing spikes occur during work hours. Should our usage rise above what we define as “normal” then Server Density makes sure we get notified.

    Monitor Energy - Week

    A few weeks ago, for example, I received an email alerting me that electricity usage had crossed the 600W Saturday threshold. I live near the office so I walked in, switched the heater off, and placed a friendly reminder on our Slack channel.

    Another thing we can do is zoom out and observe our seasonal consumption patterns. As you can see in the following graph, our baseline consumption was higher in mid-November, and similarly from late December to mid-March. Those patterns correspond to the addition (and removal) of test servers on our office network.

    Monitor Energy - Seasonal

    Since we started monitoring our electricity with Server Density we’ve achieved a 10% reduction in our energy bills. Energy-related alerts prompt us to ask the right questions, power down unneeded equipment, and optimise the devices we use. In short, energy monitoring makes us more accountable and, ahem, less forgetful.

    How we monitor energy consumption

    Server Density can monitor, graph, and alert on anything that it can gather data from (anything with an API really).

    Take the sensor we used, for example. It measures electricity usage by counting the brief flashes (impressions) of the LED on the electricity panel (see how). It’s pretty low tech, it’s cheap, and it has an API, which means we can pull data from it.

    Monitor Energy - EOT

    All we had to do was attach the sensor on the electricity counter. We then used the EnergyOT Android app to connect the device to our WiFi network. We created a new Server Density account, added a device and installed the agent.

    We then wrote the following plugin:

    While this code is specific to the EnergyOT endpoint, you can tailor it for any device and endpoint, or you can write your own plugin (here is how). The only prerequisite is the existence of an HTTP API for Server Density to pull data from.

    Monitoring water levels

    In the next few months we will be relocating our Lisbon office away from the hustle and bustle of the city, and into the countryside.

    In much of rural Portugal there is no immediate access to the water grid. It’s common practice to use water pumps to fill water tanks for household use. Obviously, pumping water requires electricity to power the pumps so they can “lift” the water from beneath the ground.

    Most power companies use a lower tariff for “out of hours” consumption, so it makes sense to fill the tank in the evening. At the same time, we don’t want to run out of water during the day. The water pump should therefore be able to add water if levels drop below a minimum threshold.

    To make this happen, we wanted a technique for monitoring the water level. As it turns out, the easiest way to do this is by attaching a sonar inside the tank. This sonar measures its distance from the water surface and sends us a measurement calibrated in percentage points. Should that measurement drop below a predefined threshold then we trigger the pump, even if it’s during the day.

    Monitor Energy - Sensor

    This solution uses the same webhooks we used for watering the plants. Instead of alerts, Server Density sends a webhook to the device that controls the pump.

    As part of our tests, we filled up a bucket with water and emptied it while taking readings from the sonar. We used a MaxSonar sensor and interfaced it with a Raspberry Pi which is supposed to collect all the data and send it to Server Density.

    Here are the readings we collected.

    Sonar Readings

    I don’t have a water tank. Why should I care?

    No water tank? No problem. Pick anything else. If it has an API, then go ahead and monitor it.

    Leveraging the Server Density platform negates the need for in-house monitoring solutions. Instead of investing in non-core functions, manufacturers like EnergyOT could incorporate Server Density as a state of the art monitoring function that works out of the box.

    If you are an enthusiast and hobbyist, Server Density makes it easier to monitor your devices, and yes, stay on top of your energy consumption too.

    What about you? Are you monitoring anything other than websites and servers?

  2. Why we use NGINX

    Leave a Comment

    Wait, another article about NGINX?

    You bet. Every single Server Density request goes through NGINX. NGINX is such an important part of our infrastructure. A single post couldn’t possibly do it justice. Want to know why we use NGINX? Keep reading.

    Why is NGINX so important?

    Because it’s part of the application routing fabric. Routing, of course, is a highly critical function because it enables load balancing. Load balancing is a key enabler of highly available systems. Running those systems requires having “one more of everything”: one more server, one more datacenter, one more zone, region, provider, et cetera. You just can’t have redundant systems without a load balancer to route requests between redundant units.

    But why choose NGINX and not something else, say Pound?

    We like Pound. It is easy to deploy and manage. There is nothing wrong with it. In fact it’s a great option, as long as your load balancing setup doesn’t have any special requirements.

    Special requirements?

    Well, in our case we wanted to handle WebSocket requests. We needed support for the faster SPDY and HTTP2 protocols. We also wanted to use the Tornado web framework. So, after spending some time with Pound, we eventually settled on NGINX open source.

    NGINX is an event driven web server with inbuilt proxying and load balancing features. It has inbuilt native support for FCGI and uWSGI protocols. This allows us to run Python WSGI apps in fast application servers such as uWSGI.

    NGINX also supports third party modules for full TCP/IP socket handling. This allows us to pick and mix between our asynchronous WebSocket Python apps and our upstream Node.js proxy.

    What’s more, NGINX is fully deployable and configurable through Puppet.

    So, how do you deploy NGINX? Do you use Puppet?

    Indeed, we do. In fact we’ve been using Puppet to manage our infrastructure for several years now. We started with building manifests to match the setup of our old hosting environment at Terremark, so we could migrate to Softlayer.

    In setting up Puppet, we needed to choose between writing our own manifest or reaching out to the community. As advocates of reusing existing components (versus building your own), a visit to the Forge was inevitable.

    Why-we-use-NGINX-Forge

    We run Puppet Enterprise using our own manifests (Puppet master pulls them from our Github repo). We also make intense use of Puppet Console and Live Management to trigger transient changes such as failover switches.

    Before we continue, a brief note on Puppet Live Management.

    Puppet Live Management: It’s Complicated Deprecated

    Puppet Live management allows admins to change configurations on the fly, without changing any code. It also allows them to propagate those changes to a subset of servers. Live Management is a cool feature.

    Alas, last year Puppet Labs deprecated this cool feature. Live Management doesn’t show up in the enterprise console any more. Thankfully, the feature is still there but we needed to use a configuration flag to unearth and activate it again (here is how).

    Why-we-use-NGINX-Console

    Our NGINX Module

    Initially, we used the Puppetlabs nginx module. Unnerved by the lack of maintenance updates, though, we decided to fork James Fryman’s module (Puppet Labs’ NGINX module was a fork of the James Fryman’s module, anyway).

    We started out by using a fork and adding extra bits of functionality as we went along. In time, as James Fryman’s module continued to evolve, we started using the (unforked) module directly.

    Why we use NGINX with multiple load balancers

    We used to run one big load balancer for everything. I.e. one load balancing server handling all services. In time, we realised this is not optimal. If requests for one particular service piled up, it would often cross-talk to other services and affect them too.

    So we decided to go with many, smaller load balancer servers. Preferably one for each service.

    All our load balancers have the same underlying class in common. Depending on the specific service they route, each load balancer class will have its own corresponding service configuration.

    Trigger transient changes using the console

    One of the best aspects of the Live Management approach is that it helps overcome the lack of control inherent in the NGINX open source interface (NGINX Plus offers more options).

    Why-we-use-NGINX-Enterprise-Console

    In order for NGINX configuration file changes to take effect, NGINX needs to reload. For example, to remove one node from the load balancer rotation, we would need to edit the corresponding configuration file and then trigger a NGINX reload.

    Using the Puppet Console, we can change any node or group parameters and then trigger the node to run. Puppet will then reload NGINX. What’s cool is that the in flight connection is not terminated by the reload.

    Summary

    NGINX caters for a micro-service architecture, composed of small, independent processes. It can handle WebSockets and it supports SPDY. It’s also compatible with the Tornado web framework. Oh, and it was built around FCGI and uWSGI protocols.

    Over the years, we’ve tried to fine-tune how we use Puppet to deploy NGINX. As part of that, we use Puppet Console and Live Management features quite extensively.

    Is NGINX part of your infrastructure? Please comment with your own use case and suggestions, in the comments below.

  3. Improving the Service Resilience of your App

    Leave a Comment


    Last week, a customer sent a note, asking us how we make sure Server Density remains available, around the clock. We love getting those questions. In fact, we take every opportunity to discuss service resilience in great detail.

    As you’d expect, there are two areas that govern how reliably your cloud app runs. The application itself, is one. How long does your code operate without failures? How is it architected, and does it scale gracefully?

    Equally important, though, is the infrastructure your app is sitting on. What happens if a VM is taken down? How do you handle datacenter failovers? What systems do you have in place?

    If your app is to deliver on its bold uptime metrics, you need a good handle on both application and infrastructure resilience. Let’s take a look at how we do those at Server Density.

    Service Resilience: the Importance of Application Quality

    At its very core, the application itself needs to be solid, i.e. any failures should be handled internally without causing outages (normal vs. catastrophic failure).

    Better software quality, by definition, means lower incident rates. To encourage application resilience, it makes sense to expose everyone—including devs and designers—to production issues. In light of that, here at Server Density we run regular war games with everyone who participates in our on-call rotation.

    We also recommend running regular Chaos Monkeys. Every single bug we found as a result of chaos events was in our own code. Most had to do with how our app interacts with databases in failover mode, and with libraries we’ve not written. Setting Chaos Monkeys loose on our infrastructure—and dealing with the aftermath—helps us strengthen our app.

    The Importance of Systems

    Infrastructure resilience is all about ensuring that individual component failures do not affect our overall infrastructure.

    The most obvious way to minimise service interruptions is by having “one more of everything” (for example, in a cluster of 4 servers handling current load, we would need 5). But you also need systems. When faced with an outage, there should be little doubt as to what needs to happen. Doubts cause delays and errors. Focus should be on executing an established set of steps (see checklists). Any downtime should therefore equal to no more than the time it takes to failover.

    So, let’s start from the most benign types of failures and gradually up the ante. Let’s take a look at how our infrastructure copes as the stakes get higher.

    Regular Server Maintenance – Downtime: None

    Most of our servers go offline at least once every 30 days. We do full upgrades virtually every month. At the very least, they would involve kernel (OS) updates, which require a reboot. By having a regular and scheduled server downtime, we get to flex our failover muscle and get better at routing traffic.

    Power and Networking Failures – Downtime: None

    Our provider, SoftLayer, has a redundant power supply, i.e. not just the electricity grid but also on site generators too. Should there be a power outage, the first thing that kicks in, is the UPS with its immediate batteries. That buys enough time for the provider to start their generators. To be able to cope with longer power outages, most providers keep fuel stocks onsite.

    Datacenters should also provide redundant networking paths. Not only at the actual physical layer of networking, but the networking equipment too. Redundancy throughout. We’ve worked with Softlayer to understand their architecture so we can place our servers and virtual instances across different failure paths.

    Zone Failures – Downtime: None

    Washington is our primary region. Within that region our workloads are split in two different zones. Each zone is a separate physical facility, and in our case they’re approximately 14 miles apart This protects against external unforeseen events, like excavators cutting a fiber cable and taking out all networking for the entire facility. We experienced this type of outage 3 years ago. In response to that, we introduced an extra layer of inter-region redundancy.

    If our primary datacenter fails for whatever reason, we now have a second identical and “hot” datacenter within Washington. That datacenter is fully up to date (current) and ready to go with zero delay. Virtually every provider now has multiple zones in region and it’s become a standard.

    service-resilience-map

    Regional Failures – Downtime: approx. 2 hours

    On October, 2012 Washington DC declared a state of emergency due to hurricane Sandy. To preempt any service interruption, we decided to failover to our secondary datacenter in San Jose (SJC).

    It’s important to note that we don’t sit and wait for black swans to happen before testing our readiness. We schedule and test regional failovers on a regular basis (last test was in 2015) so that we’re not taken by surprise in case of a “force majeure”.

    We run our database replication “live” (replication happens within a few seconds) but keep all the other instances as snapshot templates to avoid running idle servers, wasting money and natural resources. However, this means there is a lag time between pressing the button to starting everything up. Most failovers will be triggered due to our own choice e.g. playing it safe with weather warnings, so this doesn’t matter. But if both our zones were to fail then full recovery would take some time (5-10 minutes to boot up, 10-15 minutes for automatic procedures to complete, 60 minutes for a human to run through the failover checklist).

    Sustaining geo-level redundancy involves enormous amounts of duplicate systems. At some point, we asked ourselves: is there any way we could make better use of this capacity. After reviewing our options, we decided to move our secondary from San Jose to Toronto, which is geographically nearer. That will help us reduce the latency between the two datacenters from 70ms to 20ms. Minimising the latency means we can make better (and more dynamic) use of all that extra capacity in situations when a complete failover is not necessary. We can run both locations in active mode, achieving the same results as the zone setup but with enough geographic distance to avoid localised events such as weather.

    Provider Failure – Downtime: 1 day

    Are you keeping count? That’s 3 redundant datacenters so far (2 zones in Washington, 1 zone in Toronto). You’d think any more redundancy borders on, dunno, too much? How could more than two separate regions fail at the same time?

    Well, they can. And they have. In November 2014, Azure suffered a global outage that lasted two hours. (Unlike Azure’s interconnected architecture, SoftLayer and AWS facilities are completely isolated, so this specific type of global outage should be impossible).

    There are several ways we could mitigate this risk. We could have another provider with “hot” infrastructure in place, that could pick up our entire workload in a near instant manner. The folks at Auth0 can failover from Azure to AWS in 60 seconds.

    The other option would be to align our fate with our provider’s and do nothing. The risk here is obvious. Should our provider face a long term service interruption, we would run out of options.

    We decided to opt for something in the middle. Instead of having a “hot” provider on standby, we have put in place a disaster recovery process using MongoDB’s infrastructure. This involves having a live backup using MongoDB’s Cloud Backup service. We have built our own restore and verify service which runs twice a day to ensure that our backups actually work, and stores a copy of the backup on Google’s storage (versioned so we retain copies going back several weeks).

    A full rebuild would obviously not be instantaneous. It would take some time to rebuild, but that is significantly reduced by using Puppet to manage all our systems. We wouldn’t have to do as much by hand because we can easily replicate our existing setup.

    Write Good Postmortems

    Ultimately, there is no such thing as 100% availability. When sufficiently elaborate systems begin to scale it’s only a matter of time for some sort of failure to happen. There is no way around that.

    Writing good postmortems when systems are back online, helps restore customer confidence. It demonstrates that someone is investing time on their product. That they care enough to sit down and think things through.

    Summary

    Downtime is expensive in more ways than one. Service interruption can lead to lost revenue, it can impact your productivity and tarnish your reputation.

    Ultimately, your availability metrics are an indication of quality. How solid is your application infrastructure? How solid is your failover routine? Furthermore, how solid is your communication, customer care, and post mortems?

    Attaining 100% availability might be an impossible feat. How well you prepare, plan and execute around it, is not.

  4. Cluster Optimisation: Hunting Down CPU Differences

    7 Comments

    Notice any unusual activity in your cluster?

    The first thing to do is look for any subtle differences between the participating servers. The obvious place to start is—you guessed it—software.

    Given the wellspring (anarchy) of apps sitting on most servers, the task of manually tracking down any deltas in software versions can be onerous. Thankfully, the growing adoption of modern config management tools like Puppet and Chef has made this exercise much easier.

    Once software inconsistencies are ruled out, the next step is to look at hardware. It’s a classic case of playing detective, i.e. searching for clues and spotting anything out of the ordinary in your infrastructure.

    Here is how we do this at Server Density.

    Cluster Optimisation: The Process

    We do weekly reviews of several performance indicators across our entire infrastructure.

    This proactive exercise helps us spot subtle performance declines over time. We can then investigate any issues, schedule time for codebase optimisations and plan for upgrades.

    Since we use Server Density to monitor Server Density those reviews are easy. It only takes a couple of minutes to perform this audit, using preset time intervals on our performance dashboards.

    The Odd Performance Values

    It was during one of those audits—exactly this time last year—when we observed a particularly weird load profile. Here is the graph:

    Initial cluster load

    This is a 4 server queue processing cluster which runs on Softlayer with dedicated hardware (SuperMicro, Xeon 1270 Quadcores, 8GB RAM). We’d just finished upgrading those seemingly identical servers.

    The entire software stack is built from the same source using Puppet. Our deploy process ensures all cluster nodes run exactly the same versions. So why was one of the servers exhibiting a lower load for the exact same work? We couldn’t justify the difference.

    With the software element taken care of (config management), we turned our attention to hardware and got in touch with Softlayer support.

    “There are no discernible differences between the servers,” was their response.

    The Plot Thickens

    Feeling uneasy about running servers that should behave the same and are not, we decided to persevere with our investigation. Soon we discovered another, more worrying, issue: packet loss on the 3 servers with the higher load.

    Initial cluster packet loss

    Armed with those screenshots, we went straight back to Softlayer support.

    They were quite diligent and “looked at the switch/s for these servers, network speed, connections Established & Waiting, apache/python/tornado process etc…

    Even so, they came back empty-handed. Except . . . for a subtle difference on the cluster hardware:

    all of the processors are Xeon 1270 Quadcores, -web4 is running V3 and is the newest; -web2 and -web3 is running V2; -web1 is running V1“.

    Smoking Gun

    When ordering new servers, we get to pick the CPU type, but not the CPU version. As it turns out, the datacenter team provides whatever CPU version they happen to have “in stock”.

    We now knew what to look for.

    After some further inspections we spotted several potentially interesting differences among CPU versions throughout our infrastructure. We decided to eliminate all of them and see what happens.

    Softlayer is good at accommodating such special requests and we had no difficulty in getting this one through.

    The following graph shows the replacement of -web1 and then -web2 and -web3. Can you spot the improvement?

    cluster load

    Here is a similar plot for cluster packet loss:

    cluster packet loss

    It could be that the CPU version was incompatible with the hardware drivers, or a whole host of other issues obscured beneath that CPU version delta. Switching all the servers to a consistent CPU version solved the problem. All packet loss disappeared and performance equalised.

    Summary – What We Learned

    Consistency within clusters is a good thing to have. Specifically:

    1. Even subtle details—configurations, version settings and other indicators that easily go unnoticed—can have a measurable impact on our infrastructure.
    2. Using  modern config management tools allows us to eliminate any software discrepancies, and do it quickly.
    3. Scheduling regular proactive reviews of our infrastructure is a fantastic opportunity to spot any lurking issues, plan for codebase optimisations and decide upon hardware upgrades.

  5. High availability with Pound load balancer

    9 Comments

    Following our migration to SoftLayer and their release of global IPs, we started to implement our multi-datacenter plan by replacing dedicated (static IP) load balancers with Pound and a global IP. Pound in itself is very easy to deploy and manage. There’s a package for Ubuntu 12.04, the distribution we are using for the load balancer servers, allowing us to have it running in no time out of the box, particularly if what you’re load balancing doesn’t have any special requirements.

    Pound load balancer

    Unfortunately, this was not the case of our Server Density monitoring web app that requires a somewhat longer cookies header. The Pound Ubuntu package is compiled with the default MAXBUF = 4096 but we needed about twice as that to allow our header through. This is a bug we didn’t discover until testing Pound because our hardware load balancers didn’t have this limit but it highlights something to fix on the next version of Server Density. We don’t particularly like recompiling distribution packages, mostly because we diverge from the general usage and eventually will cause suspicion on these changes if some problem arises on that particular package.

    Presented with no other option without breaking existing customer connections (cookie is sent before we can truncate it) we decided to start a PPA for our Pound changed package. This carries two advantages we appreciate, it’s shared with the world and we can make use of Launchpad build capabilities.

    Pound Load Balancer Configuration

    Besides the previous application specific change, our Pound configuration is quite simple and managed from Puppet Enterprise – hence the ruby template syntax ahead. From the defaults, we changed:

    • Use of the dynamic rescaling code. Pound will periodically try to modify the back-end priorities in order to equalize the response times from the various back-ends. Although our backend servers are all exactly the same, they are deployed as virtualised instances on a “public” cloud at Softlayer, so can independently suffer from performance impact on the host.
      DynScale 1
    • A “redirect service”. Anything not *serverdensity.com* is immediately redirected to http://www.serverdensity.com and doesn’t even hit the back-end servers.
      Service "Direct Access"
       HeadDeny "Host: .*serverdensity.com.*"
      Redirect "http://www.serverdensity.com"
      End
    • Each back-end relies on a co-located mongo-s router process to reach our Mongo data cluster.  We use the HAport configuration option to make sure the back-end is taken out of rotation when there is a problem with the database and the webserver is still responding on port 80.
      HAport <%= hAportMongo %>
    • Finally, because we have a cluster of load balancers, we needed to be able to trace which load balancer handled the request. For this we add an extra header.
     AddHeader "X-Load-Balancer: <%= hostname %>"

    Redundancy and automated failover

    The SoftLayer Global IP (GIP) is the key to give us the failover capability with minimum lost connections to our service. By deploying two load balancers per data center, being targeted by a single GIP, we can effectively route traffic to any load balancer.

    We deploy the load balancers in an active-standby configuration. While the active is being targeted by the GIP, receiving all traffic, the standby load balancer monitors the active health. If the active load balancer stops responding (ICMP, HTTP or HTTPS) the GIP is automatically re-routed to the standby load balancer using the SoftLayer API. This situation is then alerted through a PagerDuty, also using their API, to allow the on-call engineer to respond. There’s no automatic recover attempt to avoid flapping and to allow investigation of the event.

    Next

    For the upcoming version of Server Density, we’ll be deploying nginx because we’ll be using Tornado. Another blog post will be in order by then.

    We’ll also be presenting the integration of Pound and this automation with Puppet Enterprise at PuppetConf 2012 on September 27th and 28th.

Articles you care about. Delivered.

Maybe another time