Author Archives: David Mytton

About David Mytton

David Mytton is the founder of Server Density. He has been programming in PHP and Python for over 10 years, regularly speaks about MongoDB (including running the London MongoDB User Group), co-founded the Open Rights Group and can often be found cycling in London or drinking tea in Japan. Follow him on Twitter and Google+.
  1. AWS Outage Teaches Us Monitor Cloud Like It’s Your Data Center

    Leave a Comment

    At the beginning of the month, AWS suffered a major outage of its S3 service, the original storage product which launched Amazon Web Services back in 2006.

    Reliance on this service was highlighted by the vast number of services which suffered downtime or degraded service as a result. The root cause turned out to be human error followed by cascading system failures.

    With a growing dependence on the cloud for computing and with no signs of demand for cloud resources abating, we really need to treat those resources like the on-premises data center that we relied on for so many years.

    As the article, “3 Steps to Ensure Cloud Stability in 2017” points out “it’s critical to ensure the stability of your cloud ecosystem” and that starts with monitoring. The article offers the following advice: “Ensure that you have access to reports which can give you actionable, predictive analytics around your cloud so that you can stay ahead of any issues. This goes a long way in helping your cloud be stable.”

    Of course, I couldn’t agree more! Server Density even built an app to send notifications when cloud providers have outages.

    The cloud might provide “unlimited” scalability and instant provisioning, but the SLAs and reliability guarantees are often confused with meaning 100% uptime and complete reliability. Note that S3 itself guarantees 99.99% uptime every year, which equates to just under an hour of expected downtime.

    But note that the outage only affected the US East region. Other regions were unaffected, yet the fact that many services suffered outages indicates they are relying on a single region for deployments. AWS runs many zones within regions, which are equivalent to individual data centers but are still within a logical group and a small geographical area. Cross region deployment is typically reserved for mitigating against geographic events e.g. storms, but should also be used to mitigate software and system failures. Good systems practice means code changes get rolled out gradually and indeed, AWS states that regions are entirely isolated and operated independently.

    S3 itself has a feature which automates cross region replication. Of course, this doubles your bill because you have data in two regions, but it does allow you to switch over in the event an entire region is lost. Whether that cost is worth it depends on the type of service you’re running. Expecting an hour a year of downtime is the starting point for the cost benefit calculation, but this particular outage took the service offline for more than that.

    Human error can never be eliminated, but the chances can be reduced. Using automation, checklists and ensuring teams practice incident response all contribute to good system design. Having a plan when things go wrong is crucial, just as crucial as testing the plan actually works on a regular basis! And when the incident is resolved, following up with a detailed (and blameless) post mortem will provide reassurance to customers that you are working to prevent the same situation from happening again.

    Outages will always throw up something interesting, such as the AWS Status Dashboard itself being hosted on S3. The key is knowing when something is going wrong, having a plan and closing it up with a post mortem.

  2. Time series data with OpenTSDB + Google Cloud Bigtable

    Leave a Comment

    For the last 6 years, we’ve used MongoDB as our time series datastore for graphing and metrics storage in Server Density. It has scaled well over the years and you can read about our setup here and here, but last year we came to the decision to replace it with Google Cloud Bigtable.

    As of today, the migration is complete and all customers are now reading from our time series setup running on Google Cloud. We successfully completed a migration with over 100,000 writes per second into a new architecture with a new database on a new cloud vendor with no downtime. Indeed, all customers should notice is even faster graphing performance!

    I presented this journey at Google’s Cloud Next conference last week, so this post is a writeup of that talk, which you can watch below:

    The old architecture

    Our server monitoring agent posts data back from customer servers over HTTPS to our intake endpoint. From here, it is queued for processing. This was originally based on MongoDB as a very lightweight, redundant queuing system. The payload was processed against the alerting rules engine and any notifications sent. Then it was passed over to the MongoDB time series storage engine, custom written in PHP. Everything ran on high spec bare metal servers at Softlayer.

    The old Server Density time series architecture

    Scaling problems

    Over the years, we rewrote every part of the system except the core metrics service. We implemented a proper queue and alerts processing engine on top of Kafka and Storm, rewriting it in Python. But MongoDB scaled with us until about a year ago, when the issues that had been gradually growing began to cause real pain.

    • Time sink. Whilst we used a product, MongoDB, it is designed as a general purpose database and we had to implement a custom API and schema to have it handle time series data efficiently. This was taking a lot of time to maintain.
    • We want to build more. The metrics service was custom built and as a small team, we didn’t have the time to build basic time series features like aggregation and statistics functions. We were focused on other areas of the product without time to enhance basic graphing.
    • Unpredictable scaling. Contrary to popular belief, MongoDB does scale! However, getting sharding working properly is complex and replica sets can be a pain to maintain. You have to be very careful to maintain sufficient overhead so when you add a new shard, migrations can take place without impacting the rest of the cluster. It’s also difficult to estimate resource usage and predict what is needed to continue to maintain performance.
    • Expensive hardware. To ensure queries are fast, we had to maintain huge amounts of RAM so that commonly accessed data is in memory. SSDs are needed for the rest of the data – tests we did showed that HDDs were much too slow.

    Finding a replacement

    In early 2016 we decided to evaluate alternatives. After extensive testing and evaluation of a range of options including Cassandra, DynamoDB, Redshift, MySQL and even flat files, we picked OpenTSDB running on Google Cloud Bigtable as the storage engine.

    • Managed service. Google Cloud Bigtable is fully managed. You simply choose the storage type (HDD or SSD) and how many nodes you want, and Google deals with everything else. We would no longer need to worry about hardware sizing, component failures, software upgrades or any other infrastructure management tasks.
    • OpenTSDB is actively maintained. All the features we wanted right now, and also want to build into the product are available as standard with OpenTSDB. It is actively developed so new things are regularly released, which would mean we could add features with minimal effort. We have also contributed fixes back to the project because it is open source.
    • Linear scalability. When you buy a Bigtable node, you get 10,000 reads/writes per second at 6ms 99th percentile latency. We can easily measure our throughput and calculate it on a per customer basis, so we know exactly when to scale the system. Deploying a new node takes 1 click and will be online within minutes. Contrast this with ordering new hardware, configuring it, deploying MongoDB replica sets, adding the shard and then waiting for data to rebalance. Bigtable gives us linear scalability of both cost and performance.
    • Specialist datastore. MongoDB is a good general purpose database, but Bigtable is optimised specifically for our data format. It learns usage patterns, distributing data around the cluster to optimise performance. It’s much more efficient for this type of data so we can see significant performance and cost improvements.

    The migration

    The first challenge for the migration was that it needed to communicate across providers – moving from Softlayer to Google Cloud. We tested a few options but since Server Density is built using HTTP microservices and every service is independent, we decided to implement it entirely on Google Cloud, exposing the APIs over HTTPS restricted to our IPs. Payloads still come into Softlayer and are queued in Kafka, but they are then posted over the internet from Softlayer to the new metrics service running on Google Cloud. Client reads are the same.

    We thought this might cause performance problems but in testing, we only saw a slight latency increase because we picked a Google region close to our primary Softlayer environment. We are in the process of migrating all our infrastructure to Google Cloud so this will only be a temporary situation anyway.

    Our goal was to deploy the new system with zero downtime. We achieved this by implementing dual writes so Kafka queues up a write to the new system as well as the old system. All writes from a certain date went to both systems and we ran a migration process to backfill the data from the old system into the new one. As the migration completed, we flipped a feature flag for each customer so it gradually moved everyone over to the new system.

    The new system looks like this:

    The new Server Density time series architecture

    Using the Google load balancers, we expose our metrics API which abstracts the OpenTSDB functionality so that it can be queried by our existing UI and API. OpenTSDB itself runs on Google Container Engine, connecting via the official drivers to Google Cloud Bigtable deployed across multiple zones for redundancy.

    What did we end up with?

    A linearly scalable system, high availability across multiple zones, new product features, lower operational overhead and lower costs.

    As a customer, you should notice faster loading graphs (especially if you’re using our API) right away. Over the next few months we’ll be releasing new features that are enabled by this move, the first you may have already noticed as unofficially available – our v2 agent can backfill data when it loses network connectivity or cannot post back for some reason!

  3. Comparing Server Density vs Datadog

    Leave a Comment

    In 2009 I wrote the original version of the Server Density monitoring agent – sd-agent – designed to be lightweight and quick and easy to deploy. It was released under the FreeBSD open source license because I thought that if I was installing software onto my systems, I would at least want to have the ability to examine the source code! There weren’t any good SaaS server monitoring products, so I decided to build one.

    In 2010, Datadog forked sd-agent into dd-agent and started building their company around the agent. Since then, they have grown very quickly, raised a huge amount of investment and added a lot of functionality to their product.

    At the end of 2015 we released sd-agent v2, which was a merged version of dd-agent. We brought many of the improvements from the Datadog fork into the Server Density agent (although we decided to package plugins independently, rather than bundled into one distribution so you can maintain a lightweight installation and update components separately).

    We think Server Density is a great alternative to Datadog, and there are a few features in particular which make Server Density vs Datadog an interesting comparison.

    Mobile apps for iOS and Android

    Server Density has native server monitoring apps available for iPhone on the Apple App Store and for Android on the Google Play Store. These allow you to receive push notifications directly to your device so you can stay up to date on the move. One of our customers, SugarCube, even use this as a selling point to show their customers they’re monitoring things all the time!

    Server Density vs Datadog: iPhone server monitoring

    We also have a free app for iOS called Cloud Status App which allows you to monitor the status of all the major cloud providers, and get push notifications if they post any status updates.

    Tag based user permissions

    Server Density uses tags to allow you to choose which users can have access to specific resources. You can add as many users as you wish and by using tags, you can control whether they can see particular servers and availability monitors. Uses for this range from reselling monitoring to your customers through to giving particular access to development teams whilst the operations team maintain a complete overview.

    Server Density vs Datadog: Tag based user permissions

    TV NOC dashboards with Apple TV

    Displaying dashboards on a TV in your office or in a NOC is a common use case we see from customers like Tooplay. However, it is a pain having to set up a TV with a standalone computer and attempting to use browsers built into the TV is never good. Using an Apple TV is a quick, easy and relatively cheap way to get your dashboards displayed on a TV, made possible by our native Apple TV app.

    Server Density vs Datadog: Apple TV server monitoring

    Slackbot for chatops

    Our Slackbot allows you to ask questions about the state of your systems. Request graphs and check alerts from within Slack.

    Server Density vs Datadog: Server monitoring Slackbot

    Website and API availability monitoring

    By running monitoring nodes in locations all over the world, you can use Server Density to quickly configure HTTP and TCP availability checks to monitor website, application and API response time and uptime. We run the monitoring locations for you, so you can get an external perspective of your customer experience.

    We have monitoring locations in: Australia, Brazil, Chile, China, France, Germany, Hong Kong, Iceland, Ireland, Italy, Japan, The Netherlands, New Zealand, Russia, Singapore, South Africa, Spain, Sweden, UK and USA.

    HumanOps

    We started the HumanOps community in 2016 to encourage the operations community to discuss the human aspects of running infrastructure. This has resulted in events around the world, including the UK, US, France, Germany, Poland and more. Companies such as Spotify, PagerDuty, Yelp and Facebook have contributed to sharing ideas and best practices for life on call, dealing with technical debt, fatigue and stress.

    Not only that, but we’re building features inspired by HumanOps, such as our Alert Costs functionality that reports on how much time alerts are wasting for your team.

    We’re building more functionality to help teams implement HumanOps principles in their own company – a journey unique to Server Density.

    Try Server Density

    These are the key features we think makes us stand out when comparing Server Density vs Datadog but the best way is to try the product yourself!

    Whether you’re after a less complex alternative like Firebox were, or whether you don’t want to have to deal with managing your own open source monitoring like furryLogic, Server Density is a great choice.

    Sign up for a free trial.

  4. Saving $500k per month buying your own hardware: cloud vs co-location

    30 Comments

    Editor’s note: This is an updated version of an article originally published on GigaOm on 07/12/2013.

    A few weeks ago we compared cloud instances against dedicated servers. We also explored various scenarios where it can be significantly cheaper to use dedicated servers instead of cloud services.

    But that’s not the end of it. Since you are still paying on a monthly basis then if you project the costs out over 1 to 3 years, you end up paying much more than it would have cost to outright purchase the hardware. This is where buying and co-locating your own hardware becomes a more attractive option.

    Putting the numbers down: cloud vs co-location

    Let’s consider the case of a high throughput database hosted on suitable machines on cloud and dedicated servers and on a purchased/co-located server. For dedicated instances, Amazon has a separate fee structure and on Rackspace you effectively have to get their largest instance type.

    So, calculating those costs out for our database instance on an annual basis would look like this:

    Amazon EC2 c3.4xlarge dedicated heavy utilization reserved
    Pricing for 1-year term
    $4,785 upfront cost
    $0.546 effective hourly cost
    $2 per hour, per region additional cost
    $4,785 + ($0.546 + $2.00) * 24 * 365 = $27,087.96

    Rackspace OnMetal I/O
    Pricing for 1-year term
    $2.46575 hourly cost
    $0.06849 additional hourly cost for managed infrastructure
    Total Hourly Cost: $2.53424
    $2.53424 * 24 * 365 = $22,199.94

    Softlayer

    Given the annual cost of these instances, it makes sense to consider dedicated hardware where you rent the resources and the provider is responsible for upkeep. Here, at Server Density, we use Softlayer, now owned by IBM, and have dedicated hardware for our database nodes. IBM is becoming very competitive with Amazon and Rackspace so let’s add a similarly spec’d dedicated server from SoftLayer, at list prices. To match a similar spec we can choose the Monthly Bare Metal Dual Processor (Xeon E5-2620 – 2.0Ghz, 32GB RAM, 500GB storage). This bears a monthly cost of $491 or $5,892/year.

    Dedicated servers summary

    Rackspace Cloud Amazon EC2 Softlayer Dedicated
    $22,199.54 $27,087.96 $5,892

    Let’s also assume purchase and colocation of a Dell PowerEdge R430 (two 8-core processors, 32GB RAM, 1TB SATA disk drive).

    The R430 one-time list price is $3,774.45 – some 36% off the price of the SoftLayer server at $5,892/year. Of course there might be some more usage expenses such as power and bandwidth, depending on where you choose to colocate your server. Power usage in particular is difficult to calculate because you’d need to stress test the server, figure out the maximum draw and run real workloads to see what your normal usage is.

    Running our own hardware

    We have experimented with running our own hardware in London. In order to draw some conclusions we decided to use our 1U Dell server that has specs very similar to Dell R430 above. With everyday usage, our server’s power needs range close to 0.6A. For best results we stress tested it with everything maxed, for a total of 1.2A.

    Hosting this with the ISP who supplies our office works out at $140/month or $1,680/year. This makes the total annual cost figures look as follows:

    Rackspace Cloud Amazon EC2 Softlayer Dedicated Co-location
    $22,199.54 $27,087.96 $5,892 $5,454.45/year 1, then $1,680/year

    With Rackspace, Amazon and SoftLayer you’d have to pay the above price every year. With co-location, on the other hand, after the first year the annual cost drops to $1,680 because you already own the hardware. What’s more, the hardware can also be considered an asset yielding tax benefits.

    Large scale implementation

    While we were still experimenting on a small scale, I spoke to Mark Schliemann, who back then was VP of Technical Operations at Moz.com. They’d been running a hybrid environment and they had recently moved the majority of their environment off AWS and into a colo facility with Nimbix. Still, they kept using AWS for processing batch jobs (the perfect use case for elastic cloud resources).

    Moz worked on detailed cost comparisons to factor in the cost of the hardware leases (routers, switches, firewalls, load balancers, SAN/NAS storage & VPN), virtualization platforms, misc software, monitoring software/services, connectivity/bandwidth, vendor support, colo, and even travel costs. Using this to calculate their per server costs meant that on AWS they would spend $3,200/m vs. $668/m with their own hardware. Their calculations resulted in costs of $8,096 vs. $38,400 at AWS, projecting out 1 year.

    Optimizing utilization is much more difficult on the cloud because of the fixed instance sizes. Moz found they were much more efficient running their own systems virtualized because they could create the exact instance sizes they needed. Cloud providers often increase CPU allocation alongside memory whereas most use cases tend to rely on either one or the other. Running your own environment allows you to optimize this balance, and this was one of the key ways Moz improved their utilization metrics. This has helped them become more efficient with their spending.

    Here is what Mark told me: “Right now we are able to demonstrate that our colo is about 1/5th the cost of Amazon, but with RAM upgrades to our servers to increase capacity we are confident we can drive this down to something closer to 1/7th the cost of Amazon.”

    Co-location has its benefits, once you’re established

    Co-location looks like a winner but there are some important caveats:

    • First and foremost, you need in-house expertise because you need to build and rack your own equipment and design the network. Networking hardware can be expensive, and if things go wrong your team needs to have the capacity and skills to resolve any problems. This could involve support contracts with vendors and/or training your own staff. However, it does not usually require hiring new people because the same team that deals with cloud architecture, redundancy, failover, APIs, programming, etc, can also work on the ops side of things running your own environment.
    • The data centers chosen have to be easily accessible 24/7 because you may need to visit at unusual times. This means having people on-call and available to travel, or paying remote hands at the data center high hourly fees to fix things.
    • You have to purchase the equipment upfront which means large capital outlay (although this can be mitigated by leasing.)

    So what does this mean for the cloud? On a pure cost basis, buying your own hardware and colocating is significantly cheaper. Many will say that the real cost is hidden in staffing requirements but that’s not the case because you still need a technical team to build your cloud infrastructure.

    At a basic level, compute and storage are commodities. The way the cloud providers differentiate is with supporting services. Amazon has been able to iterate very quickly on innovative features, offering a range of supporting products like DNS, mail, queuing, databases, auto scaling and the like. Rackspace was slower to do this but has already started to offer similar features.

    Flexibility of cloud needs to be highlighted again too. Once you buy hardware, you’re stuck with it for the long term, but the point of the example above was that you had a known workload.

    Considering the hybrid model

    Perhaps a hybrid model makes sense, then? This is where I believe a good middle ground is. I know I saw Moz making good use of such a model. You can service your known workloads with dedicated servers and then connect to the public cloud when you need extra flexibility. Data centers like Equinix offer Direct Connect services into the big cloud providers for this very reason, and SoftLayer offers its own public cloud to go alongside dedicated instances. Rackspace is placing bets in all camps with public cloud, traditional managed hosting, a hybrid of the two, and support services for OpenStack.

    And when should you consider switching? Nnamdi Orakwue, Dell VP of Cloud until late 2015, said companies often start looking at alternatives when their monthly AWS bill hits $50,000 but is even this too high?

  5. Datacenter efficiency and its effect on Humans

    Leave a Comment

    Did you know?

    About 2 percent of world energy expenditure goes into datacenters. That’s according to Anne Curie, co-founder of Microscaling Systems who spoke at the most recent HumanOps event here in London.

    That 2 percent is on par with the aviation industry who, as Curie points out, gets plenty of slack very publicly about being a serious polluter—even if the aviation industry is incredibly more efficient than the datacenter industry average.

    Curie starts her talk with some good news. To a large extend, all the tech progress achieved over the last 20 years went into improving the lives of developers and ops people alike. The cloud takes away the pain of deploying new machines, while higher level languages like Ruby and Python make development exponentially quicker and painless.

    We optimize for speed of deployment and we optimize for developer productivity. We use an awful lot of Moore’s Law gains in order to do that.

    Anne Curie

    Enter datacenter efficiency

    But there is a caveat to all that progress. Suddenly all of that motivation you had for using your servers more efficiently is gone because somebody else is maintaining those servers for you. You don’t have to worry about where they are, you don’t have to lug them, you don’t even have to order them or find space for them.

    Anne Curie offers some fascinating insights on what all this progress means for humans, their systems, and the environment overall.

    Want to find out more? Watch Anne Curie’s talk. And if you want the full transcript (it’s a keeper), go ahead and use the download link right below this post.

    What is HumanOps again?

    HumanOps is a collection of principles that advance our focus away from systems, and towards humans. It starts from a basic conviction, namely that technology affects the wellbeing of humans just as humans affect the reliable operation of technology.

    Alert Costs is one such feature. Built right into Server Density, Alert Costs measures the impact of alerts in actual human hours. Armed with this knowledge, a sysadmin can then look for ways to reduce interruptions, mitigate alert fatigue, and improve everyone’s on-call shift.

    Find out more about Alert Costs, and see you on our next HumanOps event.

  6. Automatic timezone conversion in JavaScript

    5 Comments

    Editor’s note: This is an updated version of an article originally published here on 21/01/2010.

    It’s been awhile since JavaScript charts and graphs became the go-to industry norm for data visualization. In fact we decided to build our own graphing engine for Server Density several years ago. That’s because we needed some functionality that was not possible with the Flash charts we used earlier. Plus, it allowed us to customize the experience to better fit our own design.

    Since then we’ve been revamping the entire engine. Our latest charts take advantage of various modern JS features such as toggling line series, pinning extended info and more.

    Automatic timezone conversion in JavaScript 1

    Switching to a new graphing engine was no painless journey of course. JS comes with its own challenges, one of which is automatic timezone conversion.

    Timezones are a pain

    Timezone conversion is one of the issues you should always expect to deal with when building JS applications targeted at clients in varying timezones. Here is what we had to deal with.

    Our new engine supports user preferences with timezones. We do all the timezone calculations server-side and pass JSON data to the Javascript graphs, with the timestamps for each point already converted.

    However, it turns out that the JavaScript Date object does its own client-side timezone conversion based on the user’s system timezone settings. This means that if the default date on the graph is 10:00 GMT and your local system timezone is Paris, then JavaScript will automatically change that to 11:00 GMT.

    This only works when the timestamp passed is in GMT. So it presents a problem when we have already done the timezone conversion server-side, i.e. the conversion will be calculated twice – first on the server, then again on the client.

    We could allow JavaScript to handle timezones and perform all the conversions. However, this would result in messed up links, because we used data points to redirect the user to the actual snapshots.

    Snapshots are provided in Unix timestamp format, so even if the JS did the conversion, the snapshot timestamp would still be incorrect. To completely remove the server side conversion and rely solely on JS would require more changes and a lot more JS within the interface.

    UTC-based workaround

    As such, we modified our getDate function to return the values in UTC—at least it is UTC as far as JS is concerned but in reality we’d have already done the conversion on the server. This effectively disables the JavaScript timezone conversion.

    The following code snippet converts the Unix timestamp in JavaScript provided by the server into a date representation that we can use to display in the charts:

    getDate: function(timestamp)
    {
    // Multiply by 1000 because JS works in milliseconds instead of the UNIX seconds
    var date = new Date(timestamp * 1000);
    
    var year = date.getUTCFullYear();
    var month = date.getUTCMonth() + 1; // getMonth() is zero-indexed, so we'll increment to get the correct month number
    var day = date.getUTCDate();
    var hours = date.getUTCHours();
    var minutes = date.getUTCMinutes();
    var seconds = date.getUTCSeconds();
    
    month = (month < 10) ? '0' + month : month;
    day = (day < 10) ? '0' + day : day;
    hours = (hours < 10) ? '0' + hours : hours;
    minutes = (minutes < 10) ? '0' + minutes : minutes;
    seconds = (seconds < 10) ? '0' + seconds: seconds;
    
    return year + '-' + month + '-' + day + ' ' + hours + ':' + minutes;
    }

    So this is how we handle timezone with JavaScript for the Server Density graphing engine. What is your experience with timezones in JavaScript?

  7. How GOV.UK Reduced their Incidents and Alerts

    Leave a Comment

    Did you watch last week’s HumanOps video—the one with Spotify? How about the one with Barclays?

    Keep reading gentle reader, this is not some Friends episode potboiler joke. We just can’t help getting pumped up with all the amazing HumanOps work that’s happening out there. Independent 3rd party events are now taking place around the world (San Francisco and Poznan most recently).

    So we decided to host another one closer to home in London.

    The event will take place at the Facebook HQ (get your invite). And for those of you who are not around London in November, fear not. We’ll fill you in right here at the Server Density blog.

    In the meantime, let’s take a look at the recent GOV.UK HumanOps talk. GOV.UK is the UK government’s digital portal. Millions of people access GOV.UK every single day whenever they need to interact with the UK government.

    Bob Walker, Head of Web Operations, spoke about their recent efforts to reduce their incidents and alerts (a core tenet of HumanOps). What follows is the key take-aways from his talk. You can also watch the entire video or download it in PDF format and read at your own time (see right below the article).

    GOV.UK does HumanOps

    After extensive rationalisation, GOV.UK have reached a stage where only 6 types of incidents can alert (wake them up) out of hours. The rest can wait until next morning.

    GOV.UK mirrors their website across disparate geographical locations and operates a managed CDN at the front. As a result, even if parts of their infrastructure fail, most of their website should remain available.

    Once issues are resolved, GOV.UK carries out incident reviews (their own flavour of postmortems). In reiterating the importance of blameless postmortems, bob said:

    Every Wednesday at 11:00AM they test their paging system. The purpose of this exercise is to not only test their monitoring system but also to ensure people have configured their phones to receive alerts!

    Want to find out more? Watch Bob Walker’s talk. And if you want the full transcript, go ahead and use the download link right below this post.

    See you in a HumanOps event!

  8. Spotify Engineering: Making Ops Human

    Leave a Comment

    If you’ve read us for awhile, then you’ve probably heard us sing the praises of HumanOps—a set of principles that restores our focus away from systems and towards humans. In equal measure.

    As it turns out, Server Density is not the only team out there getting excited about HumanOps. We recently wrote about Portia Tung from Barclays and all the exciting things she’s been working on.

    Today we’d like to shift our gaze to Spotify and Francesc Zacarias, one of their lead site availability engineers.

    What follows is the key take-aways from his HumanOps talk. You can watch the entire video (scroll down) or download it in PDF format and read at your own time (see below article).

    Spotify Engineering goes HumanOps

    According to Francesc, Spotify Engineering is a cross-functional organisation. What this means is that each engineering team includes members from disparate functions. What this also means is that each team is able to fully own the service they run in its entirety.

    Spotify is growing fast. From 100 services running on 1,300 servers in 2011, they now have 1400 services on 10,000 servers.

    In the past, the Spotify Ops team was responsible for hundreds of services. Given how small their team was (a handful of engineers) and how quickly new services were appearing, their Ops team was turning into a bottleneck for the entire organisation.

    While every member of the Ops team was an expert in their own specific area, there was no sharing between Ops engineers, or across the rest of the engineering organisation.

    You were paged on a service you didn’t know existed because someone deployed and forgot to tell you.

    Francesc Zacarias, Spotify Engineering

    Under the new Spotify structure, developers now own their services. In true devops fashion, building something is no longer separate from running it. Developers control the entire lifecycle including operational tasks like backup, monitoring and, of course, on call rotation.

    This change required a significant cultural shift. Several folks were sceptical about this change while others braced themselves for unmitigated disaster.

    In most instances however it was a case of “trust but verify.” Everyone had to trust their colleagues, otherwise the new structure wouldn’t take off.

    Now both teams move faster.

    Operations are no longer blocking developers as the latter handle all incidents pertaining to their own services. They are more aware of the pitfalls of running code in production because they are the ones handling production incidents (waking up to alerts, et cetera).

    Want to find out more? Check out the Spotify Labs engineering blog. And if you want to take the Spotify talk with you to read at your own pace, just use the download link below.

  9. Cloud vs dedicated pricing – which is cheaper?

    Leave a Comment

    Editor’s note: This is an updated version of an article originally published on GigaOm on 29/11/2013.

    Using cloud infrastructure is the natural starting point for any new project because it’s one of the ideal use cases for cloud infrastructure – where you have unknown requirements; the other being where you need elasticity to run workloads for short periods at large scale, or handle traffic spikes. The problem comes months later when you know your baseline resource requirements.

    As an example, let’s consider a high throughput database like the one we use here at Server Density. Most web applications have a database storing customer information behind the scenes but whatever the project, requirements are very similar – you need a lot of memory and high performance disk I/O.

    Evaluating pure cloud

    Looking at the costs for a single instance illustrates the requirements. In the real world you would need multiple instances for redundancy and replication but for now, let’s just work with a single instance.

    Amazon EC2 c3.4xlarge (30GB RAM, 2 x 160GB SSD storage)

    Pricing:

    $4,350 upfront cost
    
    $0.497 effective hourly cost

    Rackspace I/O1-30 (30GB RAM, 300GB SSD Storage)

    Pricing:

    $0.96/hr + $0.15/hr for managed infrastructure = $1.11/hr

    Databases also tend to exist for a long time and so don’t generally fit into the elastic model. This means you can’t take advantage of the hourly or minute based pricing that makes cloud infrastructure cheap in short bursts.

    So extend those costs on an annual basis:

    Amazon EC2 c3.4xlarge

    $4,350 + ($0.497 * 24 * 365) = $8,703.72

    Rackspace I/O1-30

    $1.11 * 24 * 365 = $9,723.60

    Dedicated Servers/Instances

    Another issue with databases is they tend not to behave nicely if you’re contending for I/O on a busy host, so both Rackspace and Amazon let you pay for dedicated instances. On Amazon this has a separate fee structure and on Rackspace you effectively have to get their largest instance type.

    So, calculating those costs out for our annual database instance would look like this:

    Amazon EC2 c3.4xlarge dedicated heavy utilization reserved. Pricing for 1-year term:

    $4,785 upfront cost
    
    $0.546 effective hourly cost
    
    $2 per hour, per region additional cost
    
    $4,785 + ($0.546 + $2.00) * 24 * 365 = $27,087.96

    Rackspace OnMetal I/O Pricing for 1-year term:

    $2.46575 hourly cost
    
    $0.06849 additional hourly cost for managed infrastructure
    
    Total Hourly Cost: $2.53424
    
    $2.53424 * 24 * 365 = $22,199.94

    Consider the dedicated hardware option…

    Given the annual cost of these instances, the next logical step is to consider dedicated hardware where you rent the resources and the provider is responsible for upkeep. Here at Server Density, we use Softlayer, now owned by IBM, and have dedicated hardware for our database nodes. IBM is becoming very competitive with Amazon and Rackspace so let’s add a similarly spec’d dedicated server from SoftLayer, at list prices:

    To match a similar spec we can choose the Monthly Bare Metal Dual Processor (Xeon E5-2620 – 2.0Ghz, 32GB RAM, 500GB storage). This costs $491/month or $5,892/year. This is 78.25 percent cheaper than Amazon and 73.46 percent cheaper than Rackspace before you add data transfer costs – SoftLayer includes 500GB of public outbound data transfer per month which would cost extra on both Amazon and Rackspace.

    … or buy your own

    There is another step you can take as you continue to grow — purchasing your own hardware and renting datacenter space i.e. colocation. But that’s the subject of a different post altogether so make sure you subscribe.

  10. Handling timezone conversion with PHP DateTime

    30 Comments

    Editor’s note: This is an updated version of an article originally published on 21/03/2009.

    Back in 2009 we introduced a new location preference feature for Server Density. Users could now specify their desired location, and then all dates/times automatically converted to their timezone (including handling of DST). We did that by using the DateTime class that was introduced with PHP 5.2.

    Your very first challenge related to timezones is to deal with how they are calculated relative to the server’s default timezone setting. Since PHP 5.1, all the date/time functions create times in the server timezone of the server. And as of PHP 5.2 you can set the timezone programmatically using the date_default_timezone_set() function.

    So, if you call the date() function—without specifying a timestamp as the second parameter and the timezone is set to GMT—then the date will default to the +0000 timezone. Equally, if you set the timezone to New York in winter time the timezone will be -0500 (-0400 in summer).

    The ins and outs of handling timezone conversion

    If you want the date in GMT, you need to know the offset of the date you’re working with so you can convert it to +0000, if necessary. When would you need to do this? Well, the MySQL TIMESTAMP field type stores the timestamp internally, using GMT (UTC), but always returns it in the server’s timezone. So, for any SELECT statements you will need to convert the timestamp you pass in your SQL to UTC.

    This might sound complicated but you can let the DateTime class do most of the hard work. You first need to get the user to specify their timezone. This will be attached to any DateTime object you create so the right offset can be calculated. The PHP manual provides a list of all the acceptable timezone strings.

    There is also a PHP function that outputs the list of timezones. Server Density uses this to generate a list of timezones as a drop-down menu for the user to select from.

    DateTimeZone Object

    Once you have the user’s timezone, you can create a DateTimeZone object from it. This will be used for all the offset calculations.

    $userTimezone = new DateTimeZone($userSubmittedTimezoneString);

    To convert a date/time into the user’s timezone, you simply need to create it as a DateTime object:

    $myDateTime = new DateTime('2016-03-21 13:14');

    This will create a DateTime object which has the time specified. The parameter accepts any format supported by strtotime(). If you leave it empty it will default to “now”.

    Note that the time created will be in the default timezone of the server. This is relevant because the calculated offset will be relative to that timezone. For example, if the server is on GMT and you want to convert to Paris time, it will require adding 1 hour. However, if the server is in the Paris timezone then the offset will be zero. You can force the timezone that you want $myDateTime to be in by specifying the second parameter as a DateTimeZone object. If,  for example, you wanted it to be 13:14 on 21st March 2016 in GMT, you’d need to use this code or something similar:

    $gmtTimezone = new DateTimeZone('GMT');
    $myDateTime = new DateTime('2016-03-21 13:14', $gmtTimezone);

    To double check, you can run:

    echo $myDateTime->format('r');

    which would output Mon, 21 Mar 2016 13:14:00 +0000.

    The final step is to work out the offset from your DateTime object to the user’s timezone so you can convert it to that timezone. This is where the $userTimezone DateTimeZone object comes in (because we use the getOffset() method):

    $offset = $userTimezone->getOffset($myDateTime);

    This will return the number of seconds you need to add to $myDateTime to convert it into the user’s timezone. Therefore:

    $userTimezone = new DateTimeZone('America/New_York');
    $gmtTimezone = new DateTimeZone('GMT');
    $myDateTime = new DateTime('2016-03-21 13:14', $gmtTimezone);
    $offset = $userTimezone->getOffset($myDateTime);
    echo $offset;

    This will print -14400, or 4 hours (because New York is on DST).

    DateTime::add

    As of PHP 5.3, you can also use DateTime::add method to create the new date just by adding the offset. So:

    $userTimezone = new DateTimeZone('America/New_York');
    $gmtTimezone = new DateTimeZone('GMT');
    $myDateTime = new DateTime('2016-03-21 13:14', $gmtTimezone);
    $offset = $userTimezone->getOffset($myDateTime);
    $myInterval=DateInterval::createFromDateString((string)$offset . 'seconds');
    $myDateTime->add($myInterval);
    $result = $myDateTime->format('Y-m-d H:i:s');
    Echo $result;

    The above would output 2016-03-21 09:14 which is the correct conversion from 2016-03-21 13:14 London GMT to New York time.

    So that’s how we handle PHP timezones at Server Density. What’s your approach?

Articles you care about. Delivered.

Help us speak your language. What is your primary tech stack?

Maybe another time