Author Archives: David Mytton

About David Mytton

David Mytton is the founder of Server Density. He has been programming in PHP and Python for over 10 years, regularly speaks about MongoDB (including running the London MongoDB User Group), co-founded the Open Rights Group and can often be found cycling in London or drinking tea in Japan. Follow him on Twitter and Google+.
  1. Saving $500k per month buying your own hardware: cloud vs co-location

    29 Comments

    Editor’s note: This is an updated version of an article originally published on GigaOm on 07/12/2013.

    A few weeks ago we compared cloud instances against dedicated servers. We also explored various scenarios where it can be significantly cheaper to use dedicated servers instead of cloud services.

    But that’s not the end of it. Since you are still paying on a monthly basis then if you project the costs out over 1 to 3 years, you end up paying much more than it would have cost to outright purchase the hardware. This is where buying and co-locating your own hardware becomes a more attractive option.

    Putting the numbers down: cloud vs co-location

    Let’s consider the case of a high throughput database hosted on suitable machines on cloud and dedicated servers and on a purchased/co-located server. For dedicated instances, Amazon has a separate fee structure and on Rackspace you effectively have to get their largest instance type.

    So, calculating those costs out for our database instance on an annual basis would look like this:

    Amazon EC2 c3.4xlarge dedicated heavy utilization reserved
    Pricing for 1-year term
    $4,785 upfront cost
    $0.546 effective hourly cost
    $2 per hour, per region additional cost
    $4,785 + ($0.546 + $2.00) * 24 * 365 = $27,087.96

    Rackspace OnMetal I/O
    Pricing for 1-year term
    $2.46575 hourly cost
    $0.06849 additional hourly cost for managed infrastructure
    Total Hourly Cost: $2.53424
    $2.53424 * 24 * 365 = $22,199.94

    Softlayer

    Given the annual cost of these instances, it makes sense to consider dedicated hardware where you rent the resources and the provider is responsible for upkeep. Here, at Server Density, we use Softlayer, now owned by IBM, and have dedicated hardware for our database nodes. IBM is becoming very competitive with Amazon and Rackspace so let’s add a similarly spec’d dedicated server from SoftLayer, at list prices. To match a similar spec we can choose the Monthly Bare Metal Dual Processor (Xeon E5-2620 – 2.0Ghz, 32GB RAM, 500GB storage). This bears a monthly cost of $491 or $5,892/year.

    Dedicated servers summary

    Rackspace Cloud Amazon EC2 Softlayer Dedicated
    $22,199.54 $27,087.96 $5,892

    Let’s also assume purchase and colocation of a Dell PowerEdge R430 (two 8-core processors, 32GB RAM, 1TB SATA disk drive).

    The R430 one-time list price is $3,774.45 – some 36% off the price of the SoftLayer server at $5,892/year. Of course there might be some more usage expenses such as power and bandwidth, depending on where you choose to colocate your server. Power usage in particular is difficult to calculate because you’d need to stress test the server, figure out the maximum draw and run real workloads to see what your normal usage is.

    Running our own hardware

    We have experimented with running our own hardware in London. In order to draw some conclusions we decided to use our 1U Dell server that has specs very similar to Dell R430 above. With everyday usage, our server’s power needs range close to 0.6A. For best results we stress tested it with everything maxed, for a total of 1.2A.

    Hosting this with the ISP who supplies our office works out at $140/month or $1,680/year. This makes the total annual cost figures look as follows:

    Rackspace Cloud Amazon EC2 Softlayer Dedicated Co-location
    $22,199.54 $27,087.96 $5,892 $5,454.45/year 1, then $1,680/year

    With Rackspace, Amazon and SoftLayer you’d have to pay the above price every year. With co-location, on the other hand, after the first year the annual cost drops to $1,680 because you already own the hardware. What’s more, the hardware can also be considered an asset yielding tax benefits.

    Large scale implementation

    While we were still experimenting on a small scale, I spoke to Mark Schliemann, who back then was VP of Technical Operations at Moz.com. They’d been running a hybrid environment and they had recently moved the majority of their environment off AWS and into a colo facility with Nimbix. Still, they kept using AWS for processing batch jobs (the perfect use case for elastic cloud resources).

    Moz worked on detailed cost comparisons to factor in the cost of the hardware leases (routers, switches, firewalls, load balancers, SAN/NAS storage & VPN), virtualization platforms, misc software, monitoring software/services, connectivity/bandwidth, vendor support, colo, and even travel costs. Using this to calculate their per server costs meant that on AWS they would spend $3,200/m vs. $668/m with their own hardware. Their calculations resulted in costs of $8,096 vs. $38,400 at AWS, projecting out 1 year.

    Optimizing utilization is much more difficult on the cloud because of the fixed instance sizes. Moz found they were much more efficient running their own systems virtualized because they could create the exact instance sizes they needed. Cloud providers often increase CPU allocation alongside memory whereas most use cases tend to rely on either one or the other. Running your own environment allows you to optimize this balance, and this was one of the key ways Moz improved their utilization metrics. This has helped them become more efficient with their spending.

    Here is what Mark told me: “Right now we are able to demonstrate that our colo is about 1/5th the cost of Amazon, but with RAM upgrades to our servers to increase capacity we are confident we can drive this down to something closer to 1/7th the cost of Amazon.”

    Co-location has its benefits, once you’re established

    Co-location looks like a winner but there are some important caveats:

    • First and foremost, you need in-house expertise because you need to build and rack your own equipment and design the network. Networking hardware can be expensive, and if things go wrong your team needs to have the capacity and skills to resolve any problems. This could involve support contracts with vendors and/or training your own staff. However, it does not usually require hiring new people because the same team that deals with cloud architecture, redundancy, failover, APIs, programming, etc, can also work on the ops side of things running your own environment.
    • The data centers chosen have to be easily accessible 24/7 because you may need to visit at unusual times. This means having people on-call and available to travel, or paying remote hands at the data center high hourly fees to fix things.
    • You have to purchase the equipment upfront which means large capital outlay (although this can be mitigated by leasing.)

    So what does this mean for the cloud? On a pure cost basis, buying your own hardware and colocating is significantly cheaper. Many will say that the real cost is hidden in staffing requirements but that’s not the case because you still need a technical team to build your cloud infrastructure.

    At a basic level, compute and storage are commodities. The way the cloud providers differentiate is with supporting services. Amazon has been able to iterate very quickly on innovative features, offering a range of supporting products like DNS, mail, queuing, databases, auto scaling and the like. Rackspace was slower to do this but has already started to offer similar features.

    Flexibility of cloud needs to be highlighted again too. Once you buy hardware, you’re stuck with it for the long term, but the point of the example above was that you had a known workload.

    Considering the hybrid model

    Perhaps a hybrid model makes sense, then? This is where I believe a good middle ground is. I know I saw Moz making good use of such a model. You can service your known workloads with dedicated servers and then connect to the public cloud when you need extra flexibility. Data centers like Equinix offer Direct Connect services into the big cloud providers for this very reason, and SoftLayer offers its own public cloud to go alongside dedicated instances. Rackspace is placing bets in all camps with public cloud, traditional managed hosting, a hybrid of the two, and support services for OpenStack.

    And when should you consider switching? Nnamdi Orakwue, Dell VP of Cloud until late 2015, said companies often start looking at alternatives when their monthly AWS bill hits $50,000 but is even this too high?

  2. Datacenter efficiency and its effect on Humans

    Leave a Comment

    Did you know?

    About 2 percent of world energy expenditure goes into datacenters. That’s according to Anne Curie, co-founder of Microscaling Systems who spoke at the most recent HumanOps event here in London.

    That 2 percent is on par with the aviation industry who, as Curie points out, gets plenty of slack very publicly about being a serious polluter—even if the aviation industry is incredibly more efficient than the datacenter industry average.

    Curie starts her talk with some good news. To a large extend, all the tech progress achieved over the last 20 years went into improving the lives of developers and ops people alike. The cloud takes away the pain of deploying new machines, while higher level languages like Ruby and Python make development exponentially quicker and painless.

    We optimize for speed of deployment and we optimize for developer productivity. We use an awful lot of Moore’s Law gains in order to do that.

    Anne Curie

    Enter datacenter efficiency

    But there is a caveat to all that progress. Suddenly all of that motivation you had for using your servers more efficiently is gone because somebody else is maintaining those servers for you. You don’t have to worry about where they are, you don’t have to lug them, you don’t even have to order them or find space for them.

    Anne Curie offers some fascinating insights on what all this progress means for humans, their systems, and the environment overall.

    Want to find out more? Watch Anne Curie’s talk. And if you want the full transcript (it’s a keeper), go ahead and use the download link right below this post.

    What is HumanOps again?

    HumanOps is a collection of principles that advance our focus away from systems, and towards humans. It starts from a basic conviction, namely that technology affects the wellbeing of humans just as humans affect the reliable operation of technology.

    Alert Costs is one such feature. Built right into Server Density, Alert Costs measures the impact of alerts in actual human hours. Armed with this knowledge, a sysadmin can then look for ways to reduce interruptions, mitigate alert fatigue, and improve everyone’s on-call shift.

    Find out more about Alert Costs, and see you on our next HumanOps event.

  3. Automatic timezone conversion in JavaScript

    4 Comments

    Editor’s note: This is an updated version of an article originally published here on 21/01/2010.

    It’s been awhile since JavaScript charts and graphs became the go-to industry norm for data visualization. In fact we decided to build our own graphing engine for Server Density several years ago. That’s because we needed some functionality that was not possible with the Flash charts we used earlier. Plus, it allowed us to customize the experience to better fit our own design.

    Since then we’ve been revamping the entire engine. Our latest charts take advantage of various modern JS features such as toggling line series, pinning extended info and more.

    Automatic timezone conversion in JavaScript 1

    Switching to a new graphing engine was no painless journey of course. JS comes with its own challenges, one of which is automatic timezone conversion.

    Timezones are a pain

    Timezone conversion is one of the issues you should always expect to deal with when building JS applications targeted at clients in varying timezones. Here is what we had to deal with.

    Our new engine supports user preferences with timezones. We do all the timezone calculations server-side and pass JSON data to the Javascript graphs, with the timestamps for each point already converted.

    However, it turns out that the JavaScript Date object does its own client-side timezone conversion based on the user’s system timezone settings. This means that if the default date on the graph is 10:00 GMT and your local system timezone is Paris, then JavaScript will automatically change that to 11:00 GMT.

    This only works when the timestamp passed is in GMT. So it presents a problem when we have already done the timezone conversion server-side, i.e. the conversion will be calculated twice – first on the server, then again on the client.

    We could allow JavaScript to handle timezones and perform all the conversions. However, this would result in messed up links, because we used data points to redirect the user to the actual snapshots.

    Snapshots are provided in Unix timestamp format, so even if the JS did the conversion, the snapshot timestamp would still be incorrect. To completely remove the server side conversion and rely solely on JS would require more changes and a lot more JS within the interface.

    UTC-based workaround

    As such, we modified our getDate function to return the values in UTC—at least it is UTC as far as JS is concerned but in reality we’d have already done the conversion on the server. This effectively disables the JavaScript timezone conversion.

    The following code snippet converts the Unix timestamp in JavaScript provided by the server into a date representation that we can use to display in the charts:

    getDate: function(timestamp)
    {
    // Multiply by 1000 because JS works in milliseconds instead of the UNIX seconds
    var date = new Date(timestamp * 1000);
    
    var year = date.getUTCFullYear();
    var month = date.getUTCMonth() + 1; // getMonth() is zero-indexed, so we'll increment to get the correct month number
    var day = date.getUTCDate();
    var hours = date.getUTCHours();
    var minutes = date.getUTCMinutes();
    var seconds = date.getUTCSeconds();
    
    month = (month < 10) ? '0' + month : month;
    day = (day < 10) ? '0' + day : day;
    hours = (hours < 10) ? '0' + hours : hours;
    minutes = (minutes < 10) ? '0' + minutes : minutes;
    seconds = (seconds < 10) ? '0' + seconds: seconds;
    
    return year + '-' + month + '-' + day + ' ' + hours + ':' + minutes;
    }

    So this is how we handle timezone with JavaScript for the Server Density graphing engine. What is your experience with timezones in JavaScript?

  4. How GOV.UK Reduced their Incidents and Alerts

    Leave a Comment

    Did you watch last week’s HumanOps video—the one with Spotify? How about the one with Barclays?

    Keep reading gentle reader, this is not some Friends episode potboiler joke. We just can’t help getting pumped up with all the amazing HumanOps work that’s happening out there. Independent 3rd party events are now taking place around the world (San Francisco and Poznan most recently).

    So we decided to host another one closer to home in London.

    The event will take place at the Facebook HQ (get your invite). And for those of you who are not around London in November, fear not. We’ll fill you in right here at the Server Density blog.

    In the meantime, let’s take a look at the recent GOV.UK HumanOps talk. GOV.UK is the UK government’s digital portal. Millions of people access GOV.UK every single day whenever they need to interact with the UK government.

    Bob Walker, Head of Web Operations, spoke about their recent efforts to reduce their incidents and alerts (a core tenet of HumanOps). What follows is the key take-aways from his talk. You can also watch the entire video or download it in PDF format and read at your own time (see right below the article).

    GOV.UK does HumanOps

    After extensive rationalisation, GOV.UK have reached a stage where only 6 types of incidents can alert (wake them up) out of hours. The rest can wait until next morning.

    GOV.UK mirrors their website across disparate geographical locations and operates a managed CDN at the front. As a result, even if parts of their infrastructure fail, most of their website should remain available.

    Once issues are resolved, GOV.UK carries out incident reviews (their own flavour of postmortems). In reiterating the importance of blameless postmortems, bob said:

    Every Wednesday at 11:00AM they test their paging system. The purpose of this exercise is to not only test their monitoring system but also to ensure people have configured their phones to receive alerts!

    Want to find out more? Watch Bob Walker’s talk. And if you want the full transcript, go ahead and use the download link right below this post.

    See you in a HumanOps event!

  5. Spotify Engineering: Making Ops Human

    Leave a Comment

    If you’ve read us for awhile, then you’ve probably heard us sing the praises of HumanOps—a set of principles that restores our focus away from systems and towards humans. In equal measure.

    As it turns out, Server Density is not the only team out there getting excited about HumanOps. We recently wrote about Portia Tung from Barclays and all the exciting things she’s been working on.

    Today we’d like to shift our gaze to Spotify and Francesc Zacarias, one of their lead site availability engineers.

    What follows is the key take-aways from his HumanOps talk. You can watch the entire video (scroll down) or download it in PDF format and read at your own time (see below article).

    Spotify Engineering goes HumanOps

    According to Francesc, Spotify Engineering is a cross-functional organisation. What this means is that each engineering team includes members from disparate functions. What this also means is that each team is able to fully own the service they run in its entirety.

    Spotify is growing fast. From 100 services running on 1,300 servers in 2011, they now have 1400 services on 10,000 servers.

    In the past, the Spotify Ops team was responsible for hundreds of services. Given how small their team was (a handful of engineers) and how quickly new services were appearing, their Ops team was turning into a bottleneck for the entire organisation.

    While every member of the Ops team was an expert in their own specific area, there was no sharing between Ops engineers, or across the rest of the engineering organisation.

    You were paged on a service you didn’t know existed because someone deployed and forgot to tell you.

    Francesc Zacarias, Spotify Engineering

    Under the new Spotify structure, developers now own their services. In true devops fashion, building something is no longer separate from running it. Developers control the entire lifecycle including operational tasks like backup, monitoring and, of course, on call rotation.

    This change required a significant cultural shift. Several folks were sceptical about this change while others braced themselves for unmitigated disaster.

    In most instances however it was a case of “trust but verify.” Everyone had to trust their colleagues, otherwise the new structure wouldn’t take off.

    Now both teams move faster.

    Operations are no longer blocking developers as the latter handle all incidents pertaining to their own services. They are more aware of the pitfalls of running code in production because they are the ones handling production incidents (waking up to alerts, et cetera).

    Want to find out more? Check out the Spotify Labs engineering blog. And if you want to take the Spotify talk with you to read at your own pace, just use the download link below.

  6. Cloud vs dedicated pricing – which is cheaper?

    Leave a Comment

    Editor’s note: This is an updated version of an article originally published on GigaOm on 29/11/2013.

    Using cloud infrastructure is the natural starting point for any new project because it’s one of the ideal use cases for cloud infrastructure – where you have unknown requirements; the other being where you need elasticity to run workloads for short periods at large scale, or handle traffic spikes. The problem comes months later when you know your baseline resource requirements.

    As an example, let’s consider a high throughput database like the one we use here at Server Density. Most web applications have a database storing customer information behind the scenes but whatever the project, requirements are very similar – you need a lot of memory and high performance disk I/O.

    Evaluating pure cloud

    Looking at the costs for a single instance illustrates the requirements. In the real world you would need multiple instances for redundancy and replication but for now, let’s just work with a single instance.

    Amazon EC2 c3.4xlarge (30GB RAM, 2 x 160GB SSD storage)

    Pricing:

    $4,350 upfront cost
    
    $0.497 effective hourly cost

    Rackspace I/O1-30 (30GB RAM, 300GB SSD Storage)

    Pricing:

    $0.96/hr + $0.15/hr for managed infrastructure = $1.11/hr

    Databases also tend to exist for a long time and so don’t generally fit into the elastic model. This means you can’t take advantage of the hourly or minute based pricing that makes cloud infrastructure cheap in short bursts.

    So extend those costs on an annual basis:

    Amazon EC2 c3.4xlarge

    $4,350 + ($0.497 * 24 * 365) = $8,703.72

    Rackspace I/O1-30

    $1.11 * 24 * 365 = $9,723.60

    Dedicated Servers/Instances

    Another issue with databases is they tend not to behave nicely if you’re contending for I/O on a busy host, so both Rackspace and Amazon let you pay for dedicated instances. On Amazon this has a separate fee structure and on Rackspace you effectively have to get their largest instance type.

    So, calculating those costs out for our annual database instance would look like this:

    Amazon EC2 c3.4xlarge dedicated heavy utilization reserved. Pricing for 1-year term:

    $4,785 upfront cost
    
    $0.546 effective hourly cost
    
    $2 per hour, per region additional cost
    
    $4,785 + ($0.546 + $2.00) * 24 * 365 = $27,087.96

    Rackspace OnMetal I/O Pricing for 1-year term:

    $2.46575 hourly cost
    
    $0.06849 additional hourly cost for managed infrastructure
    
    Total Hourly Cost: $2.53424
    
    $2.53424 * 24 * 365 = $22,199.94

    Consider the dedicated hardware option…

    Given the annual cost of these instances, the next logical step is to consider dedicated hardware where you rent the resources and the provider is responsible for upkeep. Here at Server Density, we use Softlayer, now owned by IBM, and have dedicated hardware for our database nodes. IBM is becoming very competitive with Amazon and Rackspace so let’s add a similarly spec’d dedicated server from SoftLayer, at list prices:

    To match a similar spec we can choose the Monthly Bare Metal Dual Processor (Xeon E5-2620 – 2.0Ghz, 32GB RAM, 500GB storage). This costs $491/month or $5,892/year. This is 78.25 percent cheaper than Amazon and 73.46 percent cheaper than Rackspace before you add data transfer costs – SoftLayer includes 500GB of public outbound data transfer per month which would cost extra on both Amazon and Rackspace.

    … or buy your own

    There is another step you can take as you continue to grow — purchasing your own hardware and renting datacenter space i.e. colocation. But that’s the subject of a different post altogether so make sure you subscribe.

  7. Handling timezone conversion with PHP DateTime

    30 Comments

    Editor’s note: This is an updated version of an article originally published on 21/03/2009.

    Back in 2009 we introduced a new location preference feature for Server Density. Users could now specify their desired location, and then all dates/times automatically converted to their timezone (including handling of DST). We did that by using the DateTime class that was introduced with PHP 5.2.

    Your very first challenge related to timezones is to deal with how they are calculated relative to the server’s default timezone setting. Since PHP 5.1, all the date/time functions create times in the server timezone of the server. And as of PHP 5.2 you can set the timezone programmatically using the date_default_timezone_set() function.

    So, if you call the date() function—without specifying a timestamp as the second parameter and the timezone is set to GMT—then the date will default to the +0000 timezone. Equally, if you set the timezone to New York in winter time the timezone will be -0500 (-0400 in summer).

    The ins and outs of handling timezone conversion

    If you want the date in GMT, you need to know the offset of the date you’re working with so you can convert it to +0000, if necessary. When would you need to do this? Well, the MySQL TIMESTAMP field type stores the timestamp internally, using GMT (UTC), but always returns it in the server’s timezone. So, for any SELECT statements you will need to convert the timestamp you pass in your SQL to UTC.

    This might sound complicated but you can let the DateTime class do most of the hard work. You first need to get the user to specify their timezone. This will be attached to any DateTime object you create so the right offset can be calculated. The PHP manual provides a list of all the acceptable timezone strings.

    There is also a PHP function that outputs the list of timezones. Server Density uses this to generate a list of timezones as a drop-down menu for the user to select from.

    DateTimeZone Object

    Once you have the user’s timezone, you can create a DateTimeZone object from it. This will be used for all the offset calculations.

    $userTimezone = new DateTimeZone($userSubmittedTimezoneString);

    To convert a date/time into the user’s timezone, you simply need to create it as a DateTime object:

    $myDateTime = new DateTime('2016-03-21 13:14');

    This will create a DateTime object which has the time specified. The parameter accepts any format supported by strtotime(). If you leave it empty it will default to “now”.

    Note that the time created will be in the default timezone of the server. This is relevant because the calculated offset will be relative to that timezone. For example, if the server is on GMT and you want to convert to Paris time, it will require adding 1 hour. However, if the server is in the Paris timezone then the offset will be zero. You can force the timezone that you want $myDateTime to be in by specifying the second parameter as a DateTimeZone object. If,  for example, you wanted it to be 13:14 on 21st March 2016 in GMT, you’d need to use this code or something similar:

    $gmtTimezone = new DateTimeZone('GMT');
    $myDateTime = new DateTime('2016-03-21 13:14', $gmtTimezone);

    To double check, you can run:

    echo $myDateTime->format('r');

    which would output Mon, 21 Mar 2016 13:14:00 +0000.

    The final step is to work out the offset from your DateTime object to the user’s timezone so you can convert it to that timezone. This is where the $userTimezone DateTimeZone object comes in (because we use the getOffset() method):

    $offset = $userTimezone->getOffset($myDateTime);

    This will return the number of seconds you need to add to $myDateTime to convert it into the user’s timezone. Therefore:

    $userTimezone = new DateTimeZone('America/New_York');
    $gmtTimezone = new DateTimeZone('GMT');
    $myDateTime = new DateTime('2016-03-21 13:14', $gmtTimezone);
    $offset = $userTimezone->getOffset($myDateTime);
    echo $offset;

    This will print -14400, or 4 hours (because New York is on DST).

    DateTime::add

    As of PHP 5.3, you can also use DateTime::add method to create the new date just by adding the offset. So:

    $userTimezone = new DateTimeZone('America/New_York');
    $gmtTimezone = new DateTimeZone('GMT');
    $myDateTime = new DateTime('2016-03-21 13:14', $gmtTimezone);
    $offset = $userTimezone->getOffset($myDateTime);
    $myInterval=DateInterval::createFromDateString((string)$offset . 'seconds');
    $myDateTime->add($myInterval);
    $result = $myDateTime->format('Y-m-d H:i:s');
    Echo $result;

    The above would output 2016-03-21 09:14 which is the correct conversion from 2016-03-21 13:14 London GMT to New York time.

    So that’s how we handle PHP timezones at Server Density. What’s your approach?

  8. HumanOps Events: Get excited

    Leave a Comment

    The health of your infrastructure is not just about hardware, software, automations and uptime—it also includes the health and wellbeing of your team. Sysadmins are not super humans. They are susceptible to stress and fatigue just like everybody else.

    Now here is the thing.

    A superhero culture exists that places unreasonable expectations on Ops teams. While understandable, this level of expectation is neither helpful nor sustainable.

    In our efforts to highlight the effects of this culture on sysadmins and their productivity, earlier this year we introduced HumanOps, a collection of principles that advance our focus away from systems, and towards humans. What’s more we got everyone together in HumanOps events around the world (watch the talks below).

    We built features so you can build more features

    On May 19th, we launched Sparklines for iOS. A great way of translating data and information into something a human can assimilate quickly. With system trends at their fingertips, sysadmins can now quickly decide whether to go home, or whether they can finish dinner before reaching for their laptop.

    Then we shifted our gaze at interruptions. Context switching does not come free for humans, especially for tasks that require focus and problem solving, which is precisely what sysadmins do. When an engineer (and anyone for that matter) is working on a complex task the worst thing you can do is expose them to random alerts. It takes an average of 23 minutes to regain intense focus after being interrupted.

    To alleviate that, we introduced Alert Costs, an easy way to measure the impact of incidents and alerts in human hours. This new quantifiable information allows engineering teams to measure and, most importantly, communicate the actual toll of systems on their human operators. And then do something about it.

    retina-alert-costs-1024x651

    But we couldn’t stop there. What about sleep? It’s no surprise that quality of sleep correlates with health, productivity and overall wellbeing; while sleep deprivation is associated with stress, irritability, and cognitive impairment. We cannot mitigate the human cost of on-call work without some sort of objective and relevant sleep metric we can measure. And that’s exactly what we did.

    We launched Opzzz.sh; a tool that correlates sleep efficiency with incidents, and then visualises the human cost of on-call work.

    dashboard-1024x562

    Let’s talk HumanOps

    HumanOps drives a lot of what we do here at Server Density. But HumanOps is not about one feature or even one company. There are several teams out there that are doing amazing work in this space. To bring all those efforts to the forefront, to celebrate, and to push the HumanOps agenda forward, we are organising HumanOps events around the world.

    Over the last few months, for example, we hosted HumanOps meetups in San Francisco and London. In this series of articles, we will go through some of the highlights from the HumanOps efforts of companies like Spotify and Barclays, and organisations like GOV.UK. To give you a taste of what’s to come, here is the introductory video of our recent HumanOps event in London.

    What follows is the key take-aways from Barclay’s HumanOps talk.

    Portia Tung – Barclays

    Portia Tung is an Executive Agile Coach at Barclays, and founder of The School of Play.

    In her HumanOps talk she highlighted the importance of play at all stages of human development, in and out of the office. She pointed to research that demonstrates how play makes animals smarter and more adaptable. How it helps them sustain social relationships, supercharge their creativity and innovation.

    Portia Tung touched on things like the recommended daily amount of play (5 to 10 minutes minimum) and things like play deficiency and its effects on employees (hint: not good). Play is a natural and essential human need throughout our life, i.e. not just when we’re young. A productive, collaborative, and happy workplace is a playful workplace.

    Check out the Portia’s talk here:

    And if you want the full transcript, then use the download link at the bottom of the article.

    Stay tuned for more

    HumanOps is a collection of principles that advance our focus away from systems, and towards humans. It starts from a basic conviction, namely that technology affects the wellbeing of humans just as humans affect the reliable operation of technology.

    Did you enjoy the talk? Make sure you download and share the beautifully designed transcripts and stay tuned as next week we will be sharing some interesting work Spotify is doing in the HumanOps space.

  9. How to Monitor Apache

    Leave a Comment

    Editor’s note: An earlier version of this article was published on Oct, 2, 2014.

    Apache HTTP Server been around since 1995 and it’s deployed on the majority of web servers out there (although losing ground to NGINX).

    As a core constituent of the classic LAMP stack and a critical component of any web architecture, it is a good idea to monitor Apache thoroughly.

    Keep reading to find out how we monitor Apache here at Server Density.

    Enabling Apache monitoring with mod_status

    Most of the tools for monitoring Apache require the use of the mod_status module. This is included by default but it needs to be enabled. You will also need to specify an endpoint in your Apache config:

    <Location /server-status>
    
      SetHandler server-status
      Order Deny,Allow
      Deny from all
      Allow from 127.0.0.1
    
    </Location>
    

    This will make the status page available at http://localhost/server-status on your server (check out our guide). Be sure to enable the ExtendedStatus directive to get full access to all the stats.

    Monitoring Apache from the command line

    Once you have enabled the status page and verified it works, you can use the command line tools to monitor the traffic on your server in real time. This is useful for debugging issues and examining traffic as it happens.

    The apache-top tool is a popular method of achieving this. It is often available as a system package e.g. apt-get install apachetop but can also be downloaded from the source, as it is just a simple Python script.

    Apache monitoring and alerting – Apache stats

    apache-top is particularly good at i) real time debugging and ii) determining what’s happening on your server right now. When it comes to collecting statistics, however, apache-top will probably leave you wanting.

    This is where monitoring products such as Server Density come in handy. Our monitoring agent supports parsing the Apache server status output and can give you statistics on requests per second and idle/busy workers.

    Apache has several process models. The most common one is worker processes running idle waiting for service requests. As more requests come in, more workers are launched to handle them—up to a pre-configured limit. Once past that limit all requests are queued and visitors experience service delays. So it’s important to monitor not only raw requests per second but idle workers too.

    A good way to configure Apache alerts is by first determining what the baseline traffic of your application is and then setting alerts around it. For example, you can generate an alert if the stats are significantly higher (indicating a sudden traffic spike) or if the values drop significantly (indicating an issue that blocks traffic somewhere).

    You could also benchmark your server to figure out at what traffic level things start to slow down. This can then act as the upper limit for triggering alerts.

    Apache monitoring and alerting – server stats

    Monitoring Apache stats like requests per second and worker status is useful in keeping an eye on Apache performance, and indicates how overloaded your web server is. Ideally you will be running Apache on a dedicated instance so you don’t need to worry about contention with other apps.

    Web servers are CPU hungry. As traffic grows Apache workers take up more CPU time and are distributed across the available CPUs and cores.

    CPU % usage is not necessarily a useful metric to alert on because the values tend to be on a per CPU or per core basis whereas you probably have multiple instances of each. It’s more useful to monitor the average CPU utilisation across all CPUs or cores.

    Using a tool such as Server Density, you can visualise all this plus configure alerts that notify you when the CPU is overloaded – our guide to understanding these metrics and configuring CPU alerts should help.

    On Linux the CPU average discussed above is abstracted out to another system metric called load average. This is a decimal number rather than a percentage and allows you to view load from the perspective of the operating system i.e. how long processes have to wait for access to the CPU. The recommended threshold for load average therefore depends on how many CPUs and cores you have – our guide to load average will help you understand this further.

    Monitoring the remote status of Apache

    All those metrics monitor the internal status of Apache and the servers it runs on but it is also important to monitor the end user experience too.

    You can achieve that by using external status and response time tools. You need to know how well your Apache instance serves traffic from different locations around the world (wherever your customers are). Based on that, you can then determine at what stage you should add more hardware capacity.

    This is very easy to achieve with services like Server Density because of our in-built website monitoring. You can check the status of your public URLs and other endpoints from custom locations and get alerts when performance drops or when there is an outage.

    This is particularly useful when you need graphs to correlate Apache metrics with remote response times, especially if you are benchmarking your servers and want to know when a certain load average starts to affect end-user performance.

  10. Improve your Sleep while On-Call

    Leave a Comment

    We need to talk about sleep.

    Sleep is not the black hole of productivity. It’s not this pesky hurdle we have to mitigate, minimise, or hack. It’s not a sign of weakness, laziness, or stupor. Skimping on sleep will not make us any more successful or rich.

    Money Never Sleeps
    Money Never Sleeps

    Quite the opposite: Sleep is the pinnacle of productivity.

    During those eight to ten hours our brain continues to spin along, sifting through tasks, reordering thoughts, opening pathways and forging new connections.

    It’s no surprise that quality of sleep correlates with health, productivity and overall wellbeing; while sleep deprivation is associated with stress, irritability, and cognitive impairment.

    As such, sleep is a personal as much as a business affair. A well rested human responds better to personal and business challenges alike.

    The sheer impact sleep has on the quality of our work, team morale and decision making, should give us pause. We should be asking ourselves: How do we minimise stress and fatigue? We should be asking ourselves: How do we safeguard downtime and renewal? We should, but we don’t. We don’t because we have no data, no ammunition to prove what each of us intuitively knows.

    We cannot mitigate the human cost of on-call work without some sort of objective and relevant sleep metric we can measure.

    So that’s what we set out to do.

    Introducing Opzzz

    Finding an objective sleep metric was not hard. There are plenty of decent sleep trackers out there. But we also wanted them to be relevant. In particular, we wanted to quantify the impact on-call work has on our sleep. In other words, we wanted to marry two disparate worlds: the personal insights of sleep quality and the business insights of alerts and incidents.

    That’s what our latest app, Opzzz is about.

    Opzzz Dashboard

    Fitbit collects sleep information, while PagerDuty and Server Density store information about our incidents. What Opzzz does is connect the dots between sleep efficiency and incidents. By correlating sleep data with on call incidents, we can then illustrate the human cost of on-call work.

    Harry and his developer team built the backend using Python on Google App Engine, while for the front end we used Javascript and Cycle.js. We collect sleep data using the Fitbit API and we also have an incident endpoint to the Server Density API (or PagerDuty API).

    cycle-component

    Summary

    HumanOps is a collection of questions, principles and ideas aimed at improving the life of sysadmins. It starts from a basic conviction, namely that technology affects the wellbeing of humans just as humans affect the reliable operation of technology.

    At Server Density we’ve observed a strong correlation between human and system metrics. Reduced stress leads to fewer errors and escalations. Reduction in incidents and alerts leads to better sleep and reduced stress. Better sleep leads to better time-to-resolution metrics.

    Unfortunately, the effects of on-call work on sleep quality are often ignored. They’re ignored because sleep happens out-of-hours and away from the office. But, most crucially, they are ignored because they’re not measured.

    That’s why we built Opzzz.

    Opzzz correlates sleep efficiency with incidents in a direct and measurable way. As a SaaS company with a global infrastructure, on call is a core constituent of what we do. So we appreciate and feel its effects on sleep quality, on wellbeing and productivity.

    Opzzz is the clearest expression of our vision for HumanOps. And we’re only getting started. So, go ahead, create a free Opzzz account account, start graphing incidents with sleep data, and let us know what you think.

Articles you care about. Delivered.

Help us speak your language. What is your primary tech stack?

Maybe another time