Time series data with OpenTSDB + Google Cloud Bigtable
CEO & Founder of Server Density.
Published on the 15th March, 2017.
For the last 6 years, we’ve used MongoDB as our time series datastore for graphing and metrics storage in Server Density. It has scaled well over the years and you can read about our setup here and here, but last year we came to the decision to replace it with Google Cloud Bigtable.
As of today, the migration is complete and all customers are now reading from our time series setup running on Google Cloud. We successfully completed a migration with over 100,000 writes per second into a new architecture with a new database on a new cloud vendor with no downtime. Indeed, all customers should notice is even faster graphing performance!
I presented this journey at Google’s Cloud Next conference last week, so this post is a writeup of that talk, which you can watch below:
The old architecture
Our server monitoring agent posts data back from customer servers over HTTPS to our intake endpoint. From here, it is queued for processing. This was originally based on MongoDB as a very lightweight, redundant queuing system. The payload was processed against the alerting rules engine and any notifications sent. Then it was passed over to the MongoDB time series storage engine, custom written in PHP. Everything ran on high spec bare metal servers at Softlayer.
Over the years, we rewrote every part of the system except the core metrics service. We implemented a proper queue and alerts processing engine on top of Kafka and Storm, rewriting it in Python. But MongoDB scaled with us until about a year ago, when the issues that had been gradually growing began to cause real pain.
- Time sink. Whilst we used a product, MongoDB, it is designed as a general purpose database and we had to implement a custom API and schema to have it handle time series data efficiently. This was taking a lot of time to maintain.
- We want to build more. The metrics service was custom built and as a small team, we didn’t have the time to build basic time series features like aggregation and statistics functions. We were focused on other areas of the product without time to enhance basic graphing.
- Unpredictable scaling. Contrary to popular belief, MongoDB does scale! However, getting sharding working properly is complex and replica sets can be a pain to maintain. You have to be very careful to maintain sufficient overhead so when you add a new shard, migrations can take place without impacting the rest of the cluster. It’s also difficult to estimate resource usage and predict what is needed to continue to maintain performance.
- Expensive hardware. To ensure queries are fast, we had to maintain huge amounts of RAM so that commonly accessed data is in memory. SSDs are needed for the rest of the data – tests we did showed that HDDs were much too slow.
Finding a replacement
In early 2016 we decided to evaluate alternatives. After extensive testing and evaluation of a range of options including Cassandra, DynamoDB, Redshift, MySQL and even flat files, we picked OpenTSDB running on Google Cloud Bigtable as the storage engine.
- Managed service. Google Cloud Bigtable is fully managed. You simply choose the storage type (HDD or SSD) and how many nodes you want, and Google deals with everything else. We would no longer need to worry about hardware sizing, component failures, software upgrades or any other infrastructure management tasks.
- OpenTSDB is actively maintained. All the features we wanted right now, and also want to build into the product are available as standard with OpenTSDB. It is actively developed so new things are regularly released, which would mean we could add features with minimal effort. We have also contributed fixes back to the project because it is open source.
- Linear scalability. When you buy a Bigtable node, you get 10,000 reads/writes per second at 6ms 99th percentile latency. We can easily measure our throughput and calculate it on a per customer basis, so we know exactly when to scale the system. Deploying a new node takes 1 click and will be online within minutes. Contrast this with ordering new hardware, configuring it, deploying MongoDB replica sets, adding the shard and then waiting for data to rebalance. Bigtable gives us linear scalability of both cost and performance.
- Specialist datastore. MongoDB is a good general purpose database, but Bigtable is optimised specifically for our data format. It learns usage patterns, distributing data around the cluster to optimise performance. It’s much more efficient for this type of data so we can see significant performance and cost improvements.
The first challenge for the migration was that it needed to communicate across providers – moving from Softlayer to Google Cloud. We tested a few options but since Server Density is built using HTTP microservices and every service is independent, we decided to implement it entirely on Google Cloud, exposing the APIs over HTTPS restricted to our IPs. Payloads still come into Softlayer and are queued in Kafka, but they are then posted over the internet from Softlayer to the new metrics service running on Google Cloud. Client reads are the same.
We thought this might cause performance problems but in testing, we only saw a slight latency increase because we picked a Google region close to our primary Softlayer environment. We are in the process of migrating all our infrastructure to Google Cloud so this will only be a temporary situation anyway.
Our goal was to deploy the new system with zero downtime. We achieved this by implementing dual writes so Kafka queues up a write to the new system as well as the old system. All writes from a certain date went to both systems and we ran a migration process to backfill the data from the old system into the new one. As the migration completed, we flipped a feature flag for each customer so it gradually moved everyone over to the new system.
The new system looks like this:
Using the Google load balancers, we expose our metrics API which abstracts the OpenTSDB functionality so that it can be queried by our existing UI and API. OpenTSDB itself runs on Google Container Engine, connecting via the official drivers to Google Cloud Bigtable deployed across multiple zones for redundancy.
What did we end up with?
A linearly scalable system, high availability across multiple zones, new product features, lower operational overhead and lower costs.
As a customer, you should notice faster loading graphs (especially if you’re using our API) right away. Over the next few months we’ll be releasing new features that are enabled by this move, the first you may have already noticed as unofficially available – our v2 agent can backfill data when it loses network connectivity or cannot post back for some reason!