MongoDB performance SSDs vs Spindle SAS Drives

By David Mytton,
CEO & Founder of Server Density.

Published on the 6th August, 2012.

For the storage of the historical time series data for our server monitoring service, Server Density, we have a cluster of MongoDB servers running across 2 data centres with Softlayer. There are 8 dedicated servers split into 4 shards with 2 nodes per shard (one per data centre). All 8 are of identical specification: Intel Xeon-SandyBridge E3-1270-Quadcore, 16GB RAM and 2Gbps networking running Ubuntu 10.04 LTS and MongoDB 2.0.6.

We initially deployed these machines using 100GB SSDs; specifically RealSSD P300 MTFDDAC100SAL-1N1AA. However, with significantly increasing data volume we needed to decide how to scale the data storage requirements – by adding new shards or by upgrading the disks on the machines. We wanted to see what kind of MongoDB performance SSDs would provide.

Larger SSDs are expensive, so we wanted to see what kind of performance impact we would see by replacing the SSDs with spinning disks. This would give us more room on the vertical scaling option (cheaper) before we needed to add more machines/shards and scale horizontally (more expensive). We expected some performance hit (obviously) but wanted actual metrics to understand the tradeoffs and make an informed decision.

So, we reprovisioned our MongoDB secondaries with Seagate Cheetah 15k SAS drives in RAID0 (speed, not redundancy) so we could test against real data (benchmarks against test/dummy data would not answer our questions). We chose these because they’re the fastest Softlayer offer, and have the same interface speed as the SSDs (6GB/s). These drives start at 73.4GB but go up to 600GB, which would allow us to keep the vertical scaling plan. We also set up another test with SSDs in RAID0 so we were comparing the equivalent setup, both using an Adaptec RAID controller.

Note that RAID0 doesn’t provide redundancy so a disk failure would take the node offline. We achieve redundancy through the MongoDB replica set instead.

Our RAM configuration is such that we are able to store both indexes and data from the last 24 hours in memory. Most customers query the most recent data most often, so we’re set up to return that fastest. Time ranges before the last 24 hours page to disk, so performance is important to ensure that users are waiting the minimum amount of time for the graphs to plot.

Since we wanted to test the page fault query speed and we split our data with a collection per day, we wrote some scripts to run queries against older collections to force the page fault. In all cases, the SSD was fastest, followed by the SSD in RAID0, then the SAS disks. The script reads every document out of the collection through a mongos (so the query hits all shards), around 300,000 documents in total.

SSD vs 15k SAS for MongoDB

Lower is better = how long it took to iterate through every document. With the data here, we can then ask the question – are the cost savings gained from not using SSDs worth making users wait an extra ~6 seconds for the data to load? If we were to make that cost saving, it also tells us where we can get quick performance improvements in the future, just by spending some more money (the easiest way, but not necessarily an option depending on the stage of your company!).

Having read this far, why not subscribe to our RSS feed or follow us on Twitter?

  • ahofmann

    Nice article!
    Just for clarity: 300,000k documents? Do you mean 300 thousand docs or 300 million docs? On which hardware was the RAID0 build? Or was it software RAID?

    • Sorry, that’s a typo – fixed to 300,000.

      It was hardware RAID using the same hardware RAIDed through an Adaptec RAID controller.

  • David, nice article, thanks for the wealth of information.

    As we are in very similar shoes (just moving to SoftLayer and running a MongoDB cluster in production ourselves), I have some questions that I’m hoping you can answer:

    * Are you running the config servers on the same servers or separately?
    * Are you considering putting more RAM into the machines or as long as your working set (indexes + data for last 24thrs) fits, 16GB is fine?
    * In case of a disk failure, wouldn’t be faster and easier to swap the drive and resync a RAID1 array instead of bringing up a new node? Wouldn’t it be stuck in RECOVERING trying to catch up for a long time? If that takes hours, during that time you have a single point of failure and losing the other node means data loss, which sounds pretty risky. How fast are you able to add a new member?
    * Why choose the more expensive HW RAID controllers instead of SW RAID (mdadm) for RAID0?
    * Given that you are running in RAID0 and you only achieve data redundancy through the replica set, are you using safe writes? If yes, do you also read from secondaries for better read performance?
    * What was the reason you went with 2Gbps (bonding) instead of 1Gbps? Uplink redundancy? Additional speed? Do you think it is worth it? We’re currently 1Gbps on all dedicated and virtual instances.

    That’s about it for now, but I’m sure as we go along in the migration there will be a few other ones so if you don’t mind, I would love to post them here.

    Thanks,
    Gabor

    • * Are you running the config servers on the same servers or separately?

      We run these separately on Softlayer virtual servers. This is for ease of management and they should be across multiple data centres for redundancy.

      * Are you considering putting more RAM into the machines or as long as your working set (indexes + data for last 24thrs) fits, 16GB is fine?

      We will probably upgrade the machines until we hit the limit of the hardware then scale out horizontally by adding more shards. However, it’s not just RAM that needs to be monitored as disk i/o is important too. No point adding more RAM if your disks are saturated.

      * In case of a disk failure, wouldn’t be faster and easier to swap the drive and resync a RAID1 array instead of bringing up a new node? Wouldn’t it be stuck in RECOVERING trying to catch up for a long time? If that takes hours, during that time you have a single point of failure and losing the other node means data loss, which sounds pretty risky. How fast are you able to add a new member?

      It depends on the data volume and the type of disks you have. Rebuilding a RAID array always takes time and has a performance hit just like a resync. The way to completely mitigate this is to have at least 3 nodes per replica set. Then if one fails, you can bring it back up and it will sync from the other secondary, not impacting the master.

      * Why choose the more expensive HW RAID controllers instead of SW RAID (mdadm) for RAID0?

      These were included by Softlayer as standard so we didn’t have to pay extra for them. I’d rather use a dedicated hardware device than have to configure software RAID myself. For the actual advantages vs disadvantages I didn’t look into this as we didn’t plan on using RAID at all. Some potential answers can be found here:

      http://www.centos.org/docs/4/html/rhel-sag-en-4/s1-raid-approaches.html
      http://www.cyberciti.biz/tips/raid-hardware-vs-raid-software.html
      http://backdrift.org/hardware-vs-software-raid-in-the-real-world-2
      http://www.adaptec.com/nr/rdonlyres/14b2fd84-f7a0-4ac5-a07a-214123ea3dd6/0/4423_sw_hwraid_10.pdf

      * Given that you are running in RAID0 and you only achieve data redundancy through the replica set, are you using safe writes? If yes, do you also read from secondaries for better read performance?

      We don’t use safe writes for the time series data because missing a few points doesn’t matter. This allows us to get very fast write performance with the durability tradeoff. In reality, with MongoDB journaling enabled this means there’s a 100ms window for data loss.

      * What was the reason you went with 2Gbps (bonding) instead of 1Gbps? Uplink redundancy? Additional speed? Do you think it is worth it? We’re currently 1Gbps on all dedicated and virtual instances.

      We wanted to remove the network as a potential bottleneck in all cases (by going for the maximum link speed offered). However, this is important specifically because we use replica sets for redundancy so when we need to do a resync, we want it to complete as quickly as possible. Data transfer rate is the limiting factor for most of the resync so having the fastest network connectivity helps here.

  • Gabor Ratky

    David,

    are you running with default syncdelay and/or dirty_expire_centisecs / dirty_writeback_centisecs? Do you see any performance impact when MongoDB or the OS flushes the dirty pages to disk?

    • It’s using all default settings and I didn’t monitor things like flushing dirty pages.

      • Gabor Ratky

        Did you ever see MongoDB locked for over a second, not accepting reads or writes? Basically a line in mongostat with 0’s everywhere… It happened to us when MongoDB was maxing out the HDDs while writing the data out.

        • Yeh this can happen if you hit the capacity of your disks and you’ll be able to confirm this by checking your disk i/o stats and seeing if % utilisation hits 100% (Server Density will report this for example, you can also use iostat at the command line).

          This should be improved in MongoDB 2.2 i.e. it won’t lock the whole server, just the database, but if you’re hitting disk capacity you need to look at upgrading the disks or sharding to spread the load.

  • Eh how is even possible single SSD to be faster then RAID0 SSD ?

  • Roy Bellingan

    Thank you!
    Finally some REAL data about the 15K Rpm vs SSD debate!

Articles you care about. Delivered.

Help us speak your language. What is your primary tech stack?

Maybe another time