Author Archives: David Mytton

About David Mytton

David Mytton is the founder of Server Density. He has been programming in PHP and Python for over 10 years, regularly speaks about MongoDB (including running the London MongoDB User Group), co-founded the Open Rights Group and can often be found cycling in London or drinking tea in Japan. Follow him on Twitter and Google+.
  1. Saving $500k per month buying your own hardware: cloud vs co-location

    29 Comments


    Editor’s note: This is an updated version of an article originally published on GigaOm on 07/12/2013.

    A few weeks ago we compared cloud instances against dedicated servers. We also explored various scenarios where it can be significantly cheaper to use dedicated servers instead of cloud services.

    But that’s not the end of it. Since you are still paying on a monthly basis then if you project the costs out over 1 to 3 years, you end up paying much more than it would have cost to outright purchase the hardware. This is where buying and co-locating your own hardware becomes a more attractive option.

    Putting the numbers down: cloud vs co-location

    Let’s consider the case of a high throughput database hosted on suitable machines on cloud and dedicated servers and on a purchased/co-located server. For dedicated instances, Amazon has a separate fee structure and on Rackspace you effectively have to get their largest instance type.

    So, calculating those costs out for our database instance on an annual basis would look like this:

    Amazon EC2 c3.4xlarge dedicated heavy utilization reserved
    Pricing for 1-year term
    $4,785 upfront cost
    $0.546 effective hourly cost
    $2 per hour, per region additional cost
    $4,785 + ($0.546 + $2.00) * 24 * 365 = $27,087.96

    Rackspace OnMetal I/O
    Pricing for 1-year term
    $2.46575 hourly cost
    $0.06849 additional hourly cost for managed infrastructure
    Total Hourly Cost: $2.53424
    $2.53424 * 24 * 365 = $22,199.94

    Softlayer

    Given the annual cost of these instances, it makes sense to consider dedicated hardware where you rent the resources and the provider is responsible for upkeep. Here, at Server Density, we use Softlayer, now owned by IBM, and have dedicated hardware for our database nodes. IBM is becoming very competitive with Amazon and Rackspace so let’s add a similarly spec’d dedicated server from SoftLayer, at list prices. To match a similar spec we can choose the Monthly Bare Metal Dual Processor (Xeon E5-2620 – 2.0Ghz, 32GB RAM, 500GB storage). This bears a monthly cost of $491 or $5,892/year.

    Dedicated servers summary

    Rackspace Cloud Amazon EC2 Softlayer Dedicated
    $22,199.54 $27,087.96 $5,892

    Let’s also assume purchase and colocation of a Dell PowerEdge R430 (two 8-core processors, 32GB RAM, 1TB SATA disk drive).

    The R430 one-time list price is $3,774.45 – some 36% off the price of the SoftLayer server at $5,892/year. Of course there might be some more usage expenses such as power and bandwidth, depending on where you choose to colocate your server. Power usage in particular is difficult to calculate because you’d need to stress test the server, figure out the maximum draw and run real workloads to see what your normal usage is.

    Running our own hardware

    We have experimented with running our own hardware in London. In order to draw some conclusions we decided to use our 1U Dell server that has specs very similar to Dell R430 above. With everyday usage, our server’s power needs range close to 0.6A. For best results we stress tested it with everything maxed, for a total of 1.2A.

    Hosting this with the ISP who supplies our office works out at $140/month or $1,680/year. This makes the total annual cost figures look as follows:

    Rackspace Cloud Amazon EC2 Softlayer Dedicated Co-location
    $22,199.54 $27,087.96 $5,892 $5,454.45/year 1, then $1,680/year

    With Rackspace, Amazon and SoftLayer you’d have to pay the above price every year. With co-location, on the other hand, after the first year the annual cost drops to $1,680 because you already own the hardware. What’s more, the hardware can also be considered an asset yielding tax benefits.

    Large scale implementation

    While we were still experimenting on a small scale, I spoke to Mark Schliemann, who back then was VP of Technical Operations at Moz.com. They’d been running a hybrid environment and they had recently moved the majority of their environment off AWS and into a colo facility with Nimbix. Still, they kept using AWS for processing batch jobs (the perfect use case for elastic cloud resources).

    Moz worked on detailed cost comparisons to factor in the cost of the hardware leases (routers, switches, firewalls, load balancers, SAN/NAS storage & VPN), virtualization platforms, misc software, monitoring software/services, connectivity/bandwidth, vendor support, colo, and even travel costs. Using this to calculate their per server costs meant that on AWS they would spend $3,200/m vs. $668/m with their own hardware. Their calculations resulted in costs of $8,096 vs. $38,400 at AWS, projecting out 1 year.

    Optimizing utilization is much more difficult on the cloud because of the fixed instance sizes. Moz found they were much more efficient running their own systems virtualized because they could create the exact instance sizes they needed. Cloud providers often increase CPU allocation alongside memory whereas most use cases tend to rely on either one or the other. Running your own environment allows you to optimize this balance, and this was one of the key ways Moz improved their utilization metrics. This has helped them become more efficient with their spending.

    Here is what Mark told me: “Right now we are able to demonstrate that our colo is about 1/5th the cost of Amazon, but with RAM upgrades to our servers to increase capacity we are confident we can drive this down to something closer to 1/7th the cost of Amazon.”

    Co-location has its benefits, once you’re established

    Co-location looks like a winner but there are some important caveats:

    • First and foremost, you need in-house expertise because you need to build and rack your own equipment and design the network. Networking hardware can be expensive, and if things go wrong your team needs to have the capacity and skills to resolve any problems. This could involve support contracts with vendors and/or training your own staff. However, it does not usually require hiring new people because the same team that deals with cloud architecture, redundancy, failover, APIs, programming, etc, can also work on the ops side of things running your own environment.
    • The data centers chosen have to be easily accessible 24/7 because you may need to visit at unusual times. This means having people on-call and available to travel, or paying remote hands at the data center high hourly fees to fix things.
    • You have to purchase the equipment upfront which means large capital outlay (although this can be mitigated by leasing.)

    So what does this mean for the cloud? On a pure cost basis, buying your own hardware and colocating is significantly cheaper. Many will say that the real cost is hidden in staffing requirements but that’s not the case because you still need a technical team to build your cloud infrastructure.

    At a basic level, compute and storage are commodities. The way the cloud providers differentiate is with supporting services. Amazon has been able to iterate very quickly on innovative features, offering a range of supporting products like DNS, mail, queuing, databases, auto scaling and the like. Rackspace was slower to do this but has already started to offer similar features.

    Flexibility of cloud needs to be highlighted again too. Once you buy hardware, you’re stuck with it for the long term, but the point of the example above was that you had a known workload.

    Considering the hybrid model

    Perhaps a hybrid model makes sense, then? This is where I believe a good middle ground is. I know I saw Moz making good use of such a model. You can service your known workloads with dedicated servers and then connect to the public cloud when you need extra flexibility. Data centers like Equinix offer Direct Connect services into the big cloud providers for this very reason, and SoftLayer offers its own public cloud to go alongside dedicated instances. Rackspace is placing bets in all camps with public cloud, traditional managed hosting, a hybrid of the two, and support services for OpenStack.

    And when should you consider switching? Nnamdi Orakwue, Dell VP of Cloud until late 2015, said companies often start looking at alternatives when their monthly AWS bill hits $50,000 but is even this too high?

  2. Automatic timezone conversion in JavaScript

    4 Comments


    Editor’s note: This is an updated version of an article originally published here on 21/01/2010.

    It’s been awhile since JavaScript charts and graphs became the go-to industry norm for data visualization. In fact we decided to build our own graphing engine for Server Density several years ago. That’s because we needed some functionality that was not possible with the Flash charts we used earlier. Plus, it allowed us to customize the experience to better fit our own design.

    Since then we’ve been revamping the entire engine. Our latest charts take advantage of various modern JS features such as toggling line series, pinning extended info and more.

    Automatic timezone conversion in JavaScript 1

    Switching to a new graphing engine was no painless journey of course. JS comes with its own challenges, one of which is automatic timezone conversion.

    Timezones are a pain

    Timezone conversion is one of the issues you should always expect to deal with when building JS applications targeted at clients in varying timezones. Here is what we had to deal with.

    Our new engine supports user preferences with timezones. We do all the timezone calculations server-side and pass JSON data to the Javascript graphs, with the timestamps for each point already converted.

    However, it turns out that the JavaScript Date object does its own client-side timezone conversion based on the user’s system timezone settings. This means that if the default date on the graph is 10:00 GMT and your local system timezone is Paris, then JavaScript will automatically change that to 11:00 GMT.

    This only works when the timestamp passed is in GMT. So it presents a problem when we have already done the timezone conversion server-side, i.e. the conversion will be calculated twice – first on the server, then again on the client.

    We could allow JavaScript to handle timezones and perform all the conversions. However, this would result in messed up links, because we used data points to redirect the user to the actual snapshots.

    Snapshots are provided in Unix timestamp format, so even if the JS did the conversion, the snapshot timestamp would still be incorrect. To completely remove the server side conversion and rely solely on JS would require more changes and a lot more JS within the interface.

    UTC-based workaround

    As such, we modified our getDate function to return the values in UTC—at least it is UTC as far as JS is concerned but in reality we’d have already done the conversion on the server. This effectively disables the JavaScript timezone conversion.

    The following code snippet converts the Unix timestamp in JavaScript provided by the server into a date representation that we can use to display in the charts:

    getDate: function(timestamp)
    {
    // Multiply by 1000 because JS works in milliseconds instead of the UNIX seconds
    var date = new Date(timestamp * 1000);
    
    var year = date.getUTCFullYear();
    var month = date.getUTCMonth() + 1; // getMonth() is zero-indexed, so we'll increment to get the correct month number
    var day = date.getUTCDate();
    var hours = date.getUTCHours();
    var minutes = date.getUTCMinutes();
    var seconds = date.getUTCSeconds();
    
    month = (month < 10) ? '0' + month : month;
    day = (day < 10) ? '0' + day : day;
    hours = (hours < 10) ? '0' + hours : hours;
    minutes = (minutes < 10) ? '0' + minutes : minutes;
    seconds = (seconds < 10) ? '0' + seconds: seconds;
    
    return year + '-' + month + '-' + day + ' ' + hours + ':' + minutes;
    }

    So this is how we handle timezone with JavaScript for the Server Density graphing engine. What is your experience with timezones in JavaScript?

  3. Cloud vs dedicated pricing – which is cheaper?

    Leave a Comment


    Editor’s note: This is an updated version of an article originally published on GigaOm on 29/11/2013.

    Using cloud infrastructure is the natural starting point for any new project because it’s one of the ideal use cases for cloud infrastructure – where you have unknown requirements; the other being where you need elasticity to run workloads for short periods at large scale, or handle traffic spikes. The problem comes months later when you know your baseline resource requirements.

    As an example, let’s consider a high throughput database like the one we use here at Server Density. Most web applications have a database storing customer information behind the scenes but whatever the project, requirements are very similar – you need a lot of memory and high performance disk I/O.

    Evaluating pure cloud

    Looking at the costs for a single instance illustrates the requirements. In the real world you would need multiple instances for redundancy and replication but for now, let’s just work with a single instance.

    Amazon EC2 c3.4xlarge (30GB RAM, 2 x 160GB SSD storage)

    Pricing:

    $4,350 upfront cost
    
    $0.497 effective hourly cost

    Rackspace I/O1-30 (30GB RAM, 300GB SSD Storage)

    Pricing:

    $0.96/hr + $0.15/hr for managed infrastructure = $1.11/hr

    Databases also tend to exist for a long time and so don’t generally fit into the elastic model. This means you can’t take advantage of the hourly or minute based pricing that makes cloud infrastructure cheap in short bursts.

    So extend those costs on an annual basis:

    Amazon EC2 c3.4xlarge

    $4,350 + ($0.497 * 24 * 365) = $8,703.72

    Rackspace I/O1-30

    $1.11 * 24 * 365 = $9,723.60

    Dedicated Servers/Instances

    Another issue with databases is they tend not to behave nicely if you’re contending for I/O on a busy host, so both Rackspace and Amazon let you pay for dedicated instances. On Amazon this has a separate fee structure and on Rackspace you effectively have to get their largest instance type.

    So, calculating those costs out for our annual database instance would look like this:

    Amazon EC2 c3.4xlarge dedicated heavy utilization reserved. Pricing for 1-year term:

    $4,785 upfront cost
    
    $0.546 effective hourly cost
    
    $2 per hour, per region additional cost
    
    $4,785 + ($0.546 + $2.00) * 24 * 365 = $27,087.96

    Rackspace OnMetal I/O Pricing for 1-year term:

    $2.46575 hourly cost
    
    $0.06849 additional hourly cost for managed infrastructure
    
    Total Hourly Cost: $2.53424
    
    $2.53424 * 24 * 365 = $22,199.94

    Consider the dedicated hardware option…

    Given the annual cost of these instances, the next logical step is to consider dedicated hardware where you rent the resources and the provider is responsible for upkeep. Here at Server Density, we use Softlayer, now owned by IBM, and have dedicated hardware for our database nodes. IBM is becoming very competitive with Amazon and Rackspace so let’s add a similarly spec’d dedicated server from SoftLayer, at list prices:

    To match a similar spec we can choose the Monthly Bare Metal Dual Processor (Xeon E5-2620 – 2.0Ghz, 32GB RAM, 500GB storage). This costs $491/month or $5,892/year. This is 78.25 percent cheaper than Amazon and 73.46 percent cheaper than Rackspace before you add data transfer costs – SoftLayer includes 500GB of public outbound data transfer per month which would cost extra on both Amazon and Rackspace.

    … or buy your own

    There is another step you can take as you continue to grow — purchasing your own hardware and renting datacenter space i.e. colocation. But that’s the subject of a different post altogether so make sure you subscribe.

  4. Handling timezone conversion with PHP DateTime

    29 Comments


    Editor’s note: This is an updated version of an article originally published on 21/03/2009.

    Back in 2009 we introduced a new location preference feature for Server Density. Users could now specify their desired location, and then all dates/times automatically converted to their timezone (including handling of DST). We did that by using the DateTime class that was introduced with PHP 5.2.

    Your very first challenge related to timezones is to deal with how they are calculated relative to the server’s default timezone setting. Since PHP 5.1, all the date/time functions create times in the server timezone of the server. And as of PHP 5.2 you can set the timezone programmatically using the date_default_timezone_set() function.

    So, if you call the date() function—without specifying a timestamp as the second parameter and the timezone is set to GMT—then the date will default to the +0000 timezone. Equally, if you set the timezone to New York in winter time the timezone will be -0500 (-0400 in summer).

    The ins and outs of handling timezone conversion

    If you want the date in GMT, you need to know the offset of the date you’re working with so you can convert it to +0000, if necessary. When would you need to do this? Well, the MySQL TIMESTAMP field type stores the timestamp internally, using GMT (UTC), but always returns it in the server’s timezone. So, for any SELECT statements you will need to convert the timestamp you pass in your SQL to UTC.

    This might sound complicated but you can let the DateTime class do most of the hard work. You first need to get the user to specify their timezone. This will be attached to any DateTime object you create so the right offset can be calculated. The PHP manual provides a list of all the acceptable timezone strings.

    There is also a PHP function that outputs the list of timezones. Server Density uses this to generate a list of timezones as a drop-down menu for the user to select from.

    DateTimeZone Object

    Once you have the user’s timezone, you can create a DateTimeZone object from it. This will be used for all the offset calculations.

    $userTimezone = new DateTimeZone($userSubmittedTimezoneString);

    To convert a date/time into the user’s timezone, you simply need to create it as a DateTime object:

    $myDateTime = new DateTime('2016-03-21 13:14');

    This will create a DateTime object which has the time specified. The parameter accepts any format supported by strtotime(). If you leave it empty it will default to “now”.

    Note that the time created will be in the default timezone of the server. This is relevant because the calculated offset will be relative to that timezone. For example, if the server is on GMT and you want to convert to Paris time, it will require adding 1 hour. However, if the server is in the Paris timezone then the offset will be zero. You can force the timezone that you want $myDateTime to be in by specifying the second parameter as a DateTimeZone object. If,  for example, you wanted it to be 13:14 on 21st March 2016 in GMT, you’d need to use this code or something similar:

    $gmtTimezone = new DateTimeZone('GMT');
    $myDateTime = new DateTime('2016-03-21 13:14', $gmtTimezone);

    To double check, you can run:

    echo $myDateTime->format('r');

    which would output Mon, 21 Mar 2016 13:14:00 +0000.

    The final step is to work out the offset from your DateTime object to the user’s timezone so you can convert it to that timezone. This is where the $userTimezone DateTimeZone object comes in (because we use the getOffset() method):

    $offset = $userTimezone->getOffset($myDateTime);

    This will return the number of seconds you need to add to $myDateTime to convert it into the user’s timezone. Therefore:

    $userTimezone = new DateTimeZone('America/New_York');
    $gmtTimezone = new DateTimeZone('GMT');
    $myDateTime = new DateTime('2016-03-21 13:14', $gmtTimezone);
    $offset = $userTimezone->getOffset($myDateTime);
    echo $offset;

    This will print -14400, or 4 hours (because New York is on DST).

    DateTime::add

    As of PHP 5.3, you can also use DateTime::add method to create the new date just by adding the offset. So:

    $userTimezone = new DateTimeZone('America/New_York');
    $gmtTimezone = new DateTimeZone('GMT');
    $myDateTime = new DateTime('2016-03-21 13:14', $gmtTimezone);
    $offset = $userTimezone->getOffset($myDateTime);
    $myInterval=DateInterval::createFromDateString((string)$offset . 'seconds');
    $myDateTime->add($myInterval);
    $result = $myDateTime->format('Y-m-d H:i:s');
    Echo $result;

    The above would output 2016-03-21 09:14 which is the correct conversion from 2016-03-21 13:14 London GMT to New York time.

    So that’s how we handle PHP timezones at Server Density. What’s your approach?

  5. How to Monitor Apache

    Leave a Comment


    Editor’s note: An earlier version of this article was published on Oct, 2, 2014.

    Apache HTTP Server been around since 1995 and it’s deployed on the majority of web servers out there (although losing ground to NGINX).

    As a core constituent of the classic LAMP stack and a critical component of any web architecture, it is a good idea to monitor Apache thoroughly.

    Keep reading to find out how we monitor Apache here at Server Density.

    Enabling Apache monitoring with mod_status

    Most of the tools for monitoring Apache require the use of the mod_status module. This is included by default but it needs to be enabled. You will also need to specify an endpoint in your Apache config:

    <Location /server-status>
    
      SetHandler server-status
      Order Deny,Allow
      Deny from all
      Allow from 127.0.0.1
    
    </Location>
    

    This will make the status page available at http://localhost/server-status on your server (check out our guide). Be sure to enable the ExtendedStatus directive to get full access to all the stats.

    Monitoring Apache from the command line

    Once you have enabled the status page and verified it works, you can use the command line tools to monitor the traffic on your server in real time. This is useful for debugging issues and examining traffic as it happens.

    The apache-top tool is a popular method of achieving this. It is often available as a system package e.g. apt-get install apachetop but can also be downloaded from the source, as it is just a simple Python script.

    Apache monitoring and alerting – Apache stats

    apache-top is particularly good at i) real time debugging and ii) determining what’s happening on your server right now. When it comes to collecting statistics, however, apache-top will probably leave you wanting.

    This is where monitoring products such as Server Density come in handy. Our monitoring agent supports parsing the Apache server status output and can give you statistics on requests per second and idle/busy workers.

    Apache has several process models. The most common one is worker processes running idle waiting for service requests. As more requests come in, more workers are launched to handle them—up to a pre-configured limit. Once past that limit all requests are queued and visitors experience service delays. So it’s important to monitor not only raw requests per second but idle workers too.

    A good way to configure Apache alerts is by first determining what the baseline traffic of your application is and then setting alerts around it. For example, you can generate an alert if the stats are significantly higher (indicating a sudden traffic spike) or if the values drop significantly (indicating an issue that blocks traffic somewhere).

    You could also benchmark your server to figure out at what traffic level things start to slow down. This can then act as the upper limit for triggering alerts.

    Apache monitoring and alerting – server stats

    Monitoring Apache stats like requests per second and worker status is useful in keeping an eye on Apache performance, and indicates how overloaded your web server is. Ideally you will be running Apache on a dedicated instance so you don’t need to worry about contention with other apps.

    Web servers are CPU hungry. As traffic grows Apache workers take up more CPU time and are distributed across the available CPUs and cores.

    CPU % usage is not necessarily a useful metric to alert on because the values tend to be on a per CPU or per core basis whereas you probably have multiple instances of each. It’s more useful to monitor the average CPU utilisation across all CPUs or cores.

    Using a tool such as Server Density, you can visualise all this plus configure alerts that notify you when the CPU is overloaded – our guide to understanding these metrics and configuring CPU alerts should help.

    On Linux the CPU average discussed above is abstracted out to another system metric called load average. This is a decimal number rather than a percentage and allows you to view load from the perspective of the operating system i.e. how long processes have to wait for access to the CPU. The recommended threshold for load average therefore depends on how many CPUs and cores you have – our guide to load average will help you understand this further.

    Monitoring the remote status of Apache

    All those metrics monitor the internal status of Apache and the servers it runs on but it is also important to monitor the end user experience too.

    You can achieve that by using external status and response time tools. You need to know how well your Apache instance serves traffic from different locations around the world (wherever your customers are). Based on that, you can then determine at what stage you should add more hardware capacity.

    This is very easy to achieve with services like Server Density because of our in-built website monitoring. You can check the status of your public URLs and other endpoints from custom locations and get alerts when performance drops or when there is an outage.

    This is particularly useful when you need graphs to correlate Apache metrics with remote response times, especially if you are benchmarking your servers and want to know when a certain load average starts to affect end-user performance.

  6. Diversity is Good Business. Here is Why

    Leave a Comment


    Let’s get the obvious out of the way. The tech industry has a serious and chronic diversity problem. The very industry that’s supposed to spearhead new ideas, innovation and progress, is woefully behind the times where it matters most. The heterogeneity of its people.

    Tech workers are predominantly male and white, while non-white workers earn significantly less than their white counterparts. To make matters worse, an overwhelming majority of tech firms do not have gender diverse senior management at the helm. And while there has been some welcome transparency in the last few years (annual diversity reports and so on) it was not followed by any meaningful change in momentum. Minorities continue to be underrepresented and women continue to leave the tech industry in greater rates than their male peers.

    What this indicates is that we cannot deal with diversity in the same way we tackle most problems in tech. In other words . . .

    This is not a metrics problem

    We can’t approach diversity as a hiring quota challenge, hard as that challenge may be. The diversity issue goes deeper than that. It’s a culture problem that starts from schooling and education before it expresses itself everywhere else, including boardrooms, office corridors and water cooler corners.

    Within companies, diversity starts at the top.

    Leadership is where culture is born and shaped. As a corollary, any investments in hiring can easily go to waste if the company is not driven by culturally diverse values. What good is hiring more people if the workplace cannot integrate and retain their talents?

    And while we’re at it: what’s so good about diversity? Why do we want it? Is it because of an upcoming equal opportunity report? Are we paying lip service to diversity because that’s what everyone else is doing?

    Behind most of those questions lies an inherent aversion to diversity. As if tech companies have to mitigate diversity, tacitly dismissing it as another cost of doing business. This is not only short-sighted (diversity takes time and effort) but it is also counterproductive since diversity is associated with creativity, innovation, and real economic benefits.

    Diversity is Good Business

    Ideas generated by people from different backgrounds are informed by different experiences, worldviews, and values. It’s great when ideas get the chance to cross-pollinate like this. As James Altucher says, you combine two ideas to come up with a better idea. A more diverse workplace is therefore a more fertile place for ideas.

    Idea evolution works much faster than human evolution.

    James Altucher

    Now, here is the thing: ideas in diverse environments do not come easy. Why? Because diverse ideas tend to be different. Different (opposing) ideas have to be debated. They have to be weighed, discussed and decided upon. This lack of initial consensus, this creative friction does not come free. The rigour and discipline involved in negotiating and distilling insights and action plans from a broad and varied pool of ideas comes with an upfront cost. But it bears fruits down the line. The result of this requisite complexity translates in a more thought-out and “creatively hardened” product that has more chances of surviving against other ideas in the marketplace.

    In short, if you want to create new and better products—products that appeal to a broader audience—you should focus on creating a diverse company culture, starting from the top.

    Our diversity journey

    We live in an increasingly pluralistic society. The majority of our customers are outside the UK; they come from many different backgrounds. By having a more diverse team, we have a better chance of building something that appeals to our diverse customers.

    Server Density launched in 2009, and for much of our first few years it was just a few of us building stuff. Diversity did not become a priority until our team was several engineers strong. Most of them work remotely from various parts of Europe and the UK. Having multilingual folks from different geographies and cultures working in the same team is an incredible creative catalyst for everyone. Our product couldn’t be what it is today if we didn’t have all those different perspectives.

    In line with the overall industry, however, the percentage of female engineers in our team is lower than we would like. We took some time to study this challenge and observe what other companies have done. We wanted to address this now, while our company and culture were in their formative years, realising that any change would be exponentially harder to make a few years down the line.

    So here is what we did.

    Avoid gender-coded job ads

    It turns out that power words (driven, logic, outspoken) are more masculine and attract male candidates, while warmer ones (together, interpersonal, yield) encourage more women to apply. We now use online analysis tools to scan all our job ads and suggest changes before we publish them.

    Another problem, as illustrated by a Harvard Business Review article, is that women tend to avoid applying for roles they are not 100% qualified for, contrary to men who go ahead and apply anyway. To cater for that behaviour we try and remove as much self selection criteria as possible. We want to be the ones deciding if the candidate is qualified enough, not them. Even if it means more work and delays in filling up open positions.

    Avoid unconscious bias

    As part of the hiring process, we ask all our candidates to take a writing test followed by a coding exercise. When we review those, the name of the candidate is now hidden, in order to avoid unconscious bias in assessing those tests.

    Encourage a diverse culture

    The next, and harder, step involves fostering a culture that encourages diverse ideas. We thought long and hard about this. How do you make sure everyone gets a chance to steer the direction of our company and have a voice when it comes to what features we invest in?

    While we are still navigating those questions, we’ve already started making targeted adjustments in how we collaborate. We started running planning games, for example. Planning games is a regular forum where we plan our engineering efforts. Everyone has an equal voice in this meeting and we review and vote all ideas based on merit. We stand up and defend whatever it is we think. We support and encourage folks to participate.

    We also reviewed our employee handbook including all company policies. We made significant changes to ensure they are as inclusive as they can be. Many of our policies (equal opportunities, hiring/selection, complaints procedure, code of conduct and maternity/paternity leave) used to be informal. What we found was that by just having them written down and being able to point to them during our recruitment efforts has a tangible impact. It shows you’ve at least thought about it.

    So we codified our policies in a systematic manner, using pull requests so the proposed format could be discussed by everyone. As an example, if someone feels unable to escalate an issue to their manager, we now have alternative routes in place, including members of the board if needed, and in full confidence.

    Going forward

    As with most worthwhile things, the hardest step is the first one. Going from zero to one employees in underrepresented demographics is invariably undermined by the assumption that if you don’t have diversity it’s like this for a reason.

    In response to that, we rely on referrals quite heavily as a way to proactively reach out for competent candidates in diverse backgrounds. Obviously that is a short-term measure, and ideally we should gain traction with all demographics sooner rather than later. Having a diverse culture allows you to tap into a broader talent pool, internally and externally. As CTO of Busuu, Rob Elkin, put it, “We just want to make sure that the process for showing that someone should be part of the team is as open and fair as possible.”

    We have also started to sponsor and participate in various industry events that encourage diversity (e.g. COED Code). On top of that, we are looking to broaden where we place our engineering job ads. So far we’ve been publishing them on stackoverflow but we want to reach further and wider.

    In closing

    The Canadian cabinet consists of 30 ethnically and religiously diverse ministers, evenly split between women and men who are mostly aged under 50. While we don’t plan to relocate to Canada just yet, it certainly serves as a great example of leadership that is inclusive and representative of as many people as possible.

    At Server Density we don’t tackle diversity with a single-minded metrics driven approach. This is not a numbers problem as much as it is a culture problem. It’s not so much about putting a tick in a box as it is about i) understanding the challenge ii) internalising the benefits of diversity and iii) making strategic and nuanced changes in the way we lead our people.

    A truly diverse culture is not a compromise. It couldn’t be. It’s a long-term investment into the fundamentals of our team and our future prospects as a company.

  7. How do you document your ops infrastructure?

    21 Comments


    Editor’s note: For a detailed look at how we systematically unearth productivity black-holes in our Ops team, join the webinar at the end of this page. Note, this is a new version of an article originally published on 03/15/2014.

    As your team infrastructure grows, one of the most important things is how it’s documented. Anyone new joining the team, existing members working on new areas, or even the on-call team needs to know how things work.

    The first line of documentation is essentially config management, and for this we use Puppet. This defines things like packages, config files, server roles, etc. However, it only defines the “state.” In addition to this, documentation needs to cover things like emergency response, how to deal with alerts, failover procedures, processes, checklists and vendor information.

    What do we want from our ops documentation?

    I recently started a project at Server Density to revamp all our docs. We’ve had some problems which could have been avoided or resolved faster if our docs were better. As our infrastructure continues to grow, this is important to address properly, and then keep well maintained.

    Confluence

    Historically, we used Confluence as a wiki but we gradually transitioned to using GitHub with markdown formatted files alongside code. However there are some problems:

    • Search. Searching in GitHub is more designed to search code, and requires some filters for the organisation and repository. We’d need to split the docs to a separate repo to avoid the code alongside them also being searched. In Confluence search was never accurate and also quite slow.
    • Editing. The biggest challenge for any documentation is keeping it up to date. Being able to quickly edit the docs is important and there’s some overhead with a wiki format or having to commit code—it’s minor, but is an extra step. Formatting is also inflexible.
    • Collaboration. Being able to work on a a doc simultaneously or discuss changes / comment on areas of a doc is much better in GitHub than on Confluence but is still focused around individual commits, or pull requests combining specific changes. This works well for a specific body of work but not for ongoing discussions.
    • Speed. GitHub has a good performance but Confluence is really slow at everything. We used their hosted version rather than the on-premise install.

    In summary, we want a system that has minimal barriers to creating / editing docs, can be searched quickly and accurately, is easy to collaborate on and ideally it should also be available offline and/or downloadable.

    GitHub

    How do other people document their infrastructure?

    I asked on Twitter to see what other people were doing, having looked online and not found much about what other companies are doing (other than a brief mention of Confluence by Etsy).

    You can click through to see the range o replies—they included things like Mediawiki, GitHub Wiki, OneNote, HackPad (was since acquired by DropBox), Confluence, and some more complex tools with offline sync. Also noted was how GitHub do this, using Markdown files which are synced offline too.

    What did we pick for Ops documentation?

    Having already tried confluence and Markdown files in GitHub, I decided to try Google Docs. The whole team already had access to it through the web, offline and via mobile; documents can be created and edited very quickly, in-line and collaborated on by multiple team members; it has a built in drawing tool so we can create system diagrams; it’s very fast to load; and crucially, search is incredibly fast and accurate. Indeed, it is Google search after all! You can also download documents in multiple formats to store offline if you prefer.

    Are you doing something different or have a good way to address the documentation problem—please do comment!

    Also, make sure you join the Running Better Ops Teams webinar (see below). It reveals the ins and outs of how to systematically unearth engineer-time black holes, eliminate knowledge silos, and save time for things that matter: Improving your product and growing your business.

    Google Docs

  8. How to Write a Postmortem

    Leave a Comment


    When sufficiently elaborate systems begin to scale it’s only a matter of time for some sort of failure to happen.

    Failure sucks, at the time, but there are significant learnings to be had. Taking the time to extract every last bit of insight from failure, is an invaluable exercise. We’d be robbing ourselves of that gift if we skipped postmortems.

    So, despite the grim sounding name, we appreciate postmortems here at Server Density (and we’re in good company, it seems).

    Keep reading to find out why.

    Postmortems restore focus

    MerrillCoveyMatrix

    When faced with service interruptions, we drop everything in our hands and perform operational backflips 24×7 until the service is restored for all customers.

    This type of activity classifies as “important” and “urgent” (see quadrant 1 of the “Eisenhower Decision Matrix“).

    When the outage is over, however, we need to consciously shift our focus back to what’s “important” and “not urgent” (see quadrant 2). If we don’t then we risk spending time on distractions and busywork (quadrants 3 and 4).

    The discipline of writing things down requires us to take a pause, collect our thoughts and draft an impartial, sober, and fearless account of what happened, how we dealt with it, what we learned and what steps we’re taking to fix it.

    Postmortems restore confidence

    Right from the beginning, we decided we wanted to treat our customers the same way we wanted to be treated. Generally speaking, enterprise companies (Github, Google Cloud, Amazon, et cetera) have more engaged and invested technical audiences who want to know the details of what’s going on. Amazon, for example, offers some great postmortems. We wanted to offer something similar.

    Communicating detailed postmortems helps restore our credibility with our users. It demonstrates that someone is investing time on their product. That they care enough to sit down and think things through.

    When it comes to service interruption, over-communication is a good thing. As is transparency, i.e. acknowledging problems on time and throwing the public light of accountability on all remaining issues until they’re resolved. Going public provides all the incentives we need to fix problems.

    How we write postmortems

    Our postmortems start their lives as long posts on our internal Jira Incident Response page.

    Write-a-postmortem-2

    Internal outages might not affect our customers but they do take a toll on our engineering team (for example, server failovers waking someone up). We treat those with the same priority and focus. As advocates of HumanOps, we’re all about having the right systems in place so that operational issues don’t spill over into our personal time and impact our wellbeing.

    In case of an actual service outage, we replicate the same postmortem to our dedicated status page (we filter out obvious security specifics). Here is a case that started from Jira (see above) and graduated to our status page:

    Write-a-postmortem-3

    Postmortem timing

    While the crisis is still unfolding we publish short status updates at regular intervals. We stick to the facts, including scope of impact and possible workarounds. We update the status page even if it’s just to say “we’re still looking into it.”

    It usually takes a week, from issue resolution to the point when we’re ready to author a full postmortem. That timeframe affords us the opportunity to do a number of things:

    1. Rule out the possibility of follow-up incidents. Ensure the problem is fixed.
    2. Speak to all internal teams and external providers, compare notes with everyone and agree on what went wrong. Mind you, getting in touch with all the right people is not always easy. The outage might’ve occurred over the weekend or during local holidays or the engineer might be on their off-call day.
    3. Decide on a timeline for implementing strategic changes to our process, infrastructure, provider selection, product, et cetera.

    Postmortem content

    Postmortems are no different to other types of written communication. To be effective, their content needs a story and a timeline:

    1. What was the root cause? What turn of events led to the server failover? What roadworks cut what fiber? What DNS failures happened, and where? Keep in mind that a root cause may’ve set things in motion months before any outage took place.
    2. What steps did we take to identify and isolate the issue? How long did it take for us to triangulate it, and is there anything we could do to shorten that time?
    3. Who / what services bore the brunt of the outage?
    4. How did we fix it?
    5. What did we learn? How will those learnings advise our process, product, and strategy?

    Who writes a postmortem

    Our status updates are published by whoever is leading the incident response or happens to be on call. It’s usually either the ops or the support team.

    Once the issue is resolved, the same people will be expected to draft a postmortem on Jira for everyone to comment and discuss. Once that review is complete, as the CEO, I will then publish that postmortem onto our dedicated public page.

    Summary

    Successful outage resolutions go hand in hand with comprehensive postmortems. If you don’t take the time to document things properly, you rob your team from the opportunity to learn. This opens up the possibility of repeating the same mistakes. You also miss out on an opportunity to grow as a company.

    What about you? How do you deal with failure? Do you write a postmortem, and who is accountable for it?

  9. Secure your Accounts – Team Security Best Practices

    Leave a Comment

    In previous articles we looked at some key technical principles and security best practices for your infrastructure and application development.

    A much larger attack surface, however, is your team.

    People are susceptible to fraud, deception and human error, and that makes us the weakest link when it comes to safe systems. That is why it’s important to have multiple layers in your security in place. If one of them falls, the rest are still there to provide protection.

    Team Security best practices

    At Server Density, we maintain an ops security checklist which every new team member is required to complete, and then review on a monthly basis. This ensures we don’t miss out on the easy, low hanging fruits offered by our security tools.

    While it’s impossible to be 100% secure, there are a number of key team security practices you can adopt to dramatically improve your operational security.

    1. Two-Factor Authentication (2FA)

    Even if you do nothing else, multi-factor authentication is the single most important security tool you need to adopt. Even if you have a poor (or compromised) password or even if you use the same password for multiple accounts, implementing two-factor authentication could compensate for those shortcomings.

    Email is the first tool you should protect with 2FA. And for good reason. All password resets go to an email account, which means email truly is the gateway to your identity.

    Here is how it works. When you log in from new locations, new systems or even from your existing computer (after a time threshold, usually 30 days or so) you will need to verify yourself through an additional token authentication. This may be an app in your phone or a physical device you carry with you.

    2. Strong, Account-Specific Passwords

    secure your accounts

    Brute force attacks against well protected services such as your Google account are unlikely. That’s due to the rate limiting protections they most likely employ. A weak password could, however, be easy to guess without the need for brute force or any other hacking methods.

    If you use the same password for every online account, when that service is compromised and the unencrypted user database leaked, then your password will be compromised. There are many examples of hacked accounts due to online password dumps sourced from other services.

    The best way to protect against this is to have an auto generated password for each individual account. And because it’s impossible to remember hundreds of strong passwords, we suggest you employ a password manager such as 1Password or LastPass.

    Assisted by browser extensions and shortcuts, password managers can often speed your workflow even more. All you need to remember is a single password to unlock the password manager itself. What’s more, most password managers now have good integrations into the popular mobile OSs which means your mobile workflow is just as fast.

    Note that, while most password managers are cloud based, it’s prudent to keep your own backup of your password database.

    3. Secure Connections – SSL

    Accessing web sites and email over encrypted connections sounds like an overkill, right? I mean, who is going to snoop my email? Who cares enough to do that?

    This misconception provides a false sense of safety to many people. Here is the thing. You don’t need to be a celebrity or a VIP for your online identity to be valuable to hackers.

    Most Man In The Middle (MiTM) attacks intercept your connection in order to inject malware. The sole purpose of this malware is to convert your computer into a botnet (or harvest data using keyloggers). Perhaps less common in Western countries, this is a widely exploited attack vector in China.

    4. System Updates

    There is a good reason your phone and computer keeps pestering you for updates. Staying current with the latest OS and app releases helps protect against known vulnerabilities. Google’s Chrome browser has, of course, spoiled us with its silent, automatic updates.

    Make sure you keep all your software, including core OSs up to date. If you’re not ready for all the new features of new software versions, then at least install all patches and point releases. You can therefore avoid being targeted by malware that takes advantage of known security holes.

    5. Travel Securely with a VPN

    When connecting to any wi-fi network outside your control (airport, cafe, library), you open yourself up to a vast range of possible attacks. The classic Firesheep extension is a great example, as is the more recent drive-by downloads via hotel wireless networks.

    The only way to be certain your connection is secure is to connect via a virtual private network (VPN). As the name suggests, a VPN extends your private network across a public network, such as the Internet.

    It’s worth mentioning that VPN software is one of those products where you get what you pay for. Do not use a free service. This simply moves the vulnerability from the local network to the VPN provider, who is likely making money some other way (selling your data, injecting ads, serving malware, et cetera). When it comes to VPN packages, the saying “if you’re not paying for it, you are the product” definitely applies.

    When setting up your VPN make sure all traffic is blocked until the VPN connection is established. That ensures you don’t leak any data during those few seconds between connecting to the wi-fi and connecting to the VPN.

    Implement Those Now

    Security best practices are all about creating multiple layers of protection, each making it a little bit harder for someone to attack you.

    Setting up those 5 tools takes less than an hour, and what you get is solid protection against all but the most sophisticated of attacks.

    By the way, most hacks are opportunistic (unless you’re being specifically targeted), which means implementing those security practices will deter most hackers from even trying.

    [Image Credit: The great folks at xkcd.com]

  10. What’s in your Backpack? Modular vs. Monolithic Development

    Comments Off on What’s in your Backpack? Modular vs. Monolithic Development

    While building version 2.0 of our Server Monitoring agent, we reached a point where we had to make a choice.

    We could either ship the new agent together with every supported plugin, in one single file. Or we could deploy just the core logic of the agent and let users install any further integrations, as they need them.

    This turned out to be a pivotal decision for us. And it was much more than technical considerations that advised it.

    Let’s start with some numbers.

    How Much Does Your File Weigh?

    Simple is better than complex.

    The Zen of Python

    The latest version of our agent allows Server Density to integrate with many applications and plugins. We made substantial improvements in the core logic and laid the groundwork for regular plugin iterations, new releases and updates.

    All that extra oomph comes with a relatively small price in terms of file size. Version 2.0 has a 10MB on-disk footprint.

    If we were to take the next step and push every compatible plugin into a single package, our agent would become ten times “heavier”. And it would only keep growing every time we support something new.

    Moving is Living

    Question: But agent footprint is not a real showstopper, is it? Why worry about file sizes when I can get everything I need in one go?

    Sure.

    There is something to be said about the convenience of the monolithic approach. You get everything you need in one serving.

    And yet, it is the nature of this “component multiplicity” that makes iterations of monolithic applications slower.

    For example, when a particular item (say, the Python interpreter or Postgres library) is updated by the vendor, our users would have to wait for us to update our agent before they get those patches. Troubleshooting and responding to new threats would therefore—by definition—take longer. This delay creates potential attack vectors and vulnerabilities.

    Even if we were on-the-button with every possible plugin update (an increasingly impossible feat as we continue to broaden our plugin portfolio), the majority of our users would then be exposed to more updates than they actually need.

    Either way, it’s a lousy user experience.

    To support all those new integrations—without introducing security risks or needless headaches for our users—was not easy. It took a significant amount of development time to come up with an elegant, modular solution that is simple—and yet functional—for our customers.

    The result is a file that includes the bare minimum: agent code plus some specific Python modules.

    Flexibility and Ease of Use

    To take advantage of all the new supported integrations, users may choose to install additional plugins as needed.

    Question: Doesn’t that present challenges in larger / diverse server environments?

    Probably not.

    Sysadmins continue to embrace Puppet manifests, Chef configuration deployment and Ansible automation—tools designed to keep track of server roles and requirements. It’s easier than ever to stay on top of what plugin goes to what server. Automation and configuration utilities can remove much of that headache. Since we tie into standard OS package managers (deb or RPM packages), we simply work with the existing tools everyone is already used to.

    By packaging the plugins separately we get to focus on what we control: the logic inside our agent. Users only ever download what they need, and enjoy greater control of what’s sitting on their servers. The end-result is a flexible monitoring solution that adapts to our users (rather than the other way around).

    The 1.x to 2.0 agent upgrade is not automatic. Existing installations will need to opt-in. We’ve made it easy to upgrade with a simple bash script. Fresh installs will default to version 2.0. The 1.x agent will still be available (but deprecated). All version 1.x custom plugins will continue to work with the new agent too.

    Summary

    Truth is ever to be found in simplicity, and not in the multiplicity and confusion of things.

    Isaac Newton

    The modular vs. monolithic debate has been going on for decades. We don’t have easy answers and it’s not our intention to dismiss the monolithic approach. There are plenty of examples of closed monolithic systems that work really well for well-defined target users.

    Knowing our own target users (professional sysadmins), we know we can serve them better by following a modular approach. We think it pays to keep things small and simple for them, even if it takes significantly more development effort.

    As we continue improving our back-end, our server monitoring agent will support more and more integrations. Employing a modular development, means prompt updates with fewer security risks. That’s what our customers expect, and that’s what drives our decisions.

    What about you? What approach do you follow?

    What's in Your Backpack?

Articles you care about. Delivered.

Help us speak your language. What is your primary tech stack?

Maybe another time