Premium Hosted Website & Server Monitoring Tool.

(Sysadmin / Devops blog)

visit our website

Blog   >   MongoDB   >   On shortened field names in MongoDB

On shortened field names in MongoDB

Last night we got almost 10k unique visitors in just a few hours because of a post disputing a minor point of a post I wrote in July 2009. The problem was with the reduction of field names in our MongoDB documents so we could save disk space:

Schema-less

This means things are much more flexible for future structure changes but it also means that every row records the field names. We had relatively long, descriptive names in MySQL such as timeAdded or valueCached. For a small number of rows, this extra storage only amounts to a few bytes per row, but when you have 10 million rows, each with maybe 100 bytes of field names, then you quickly eat up disk space unnecessarily. 100 * 10,000,000 = ~900MB just for field names!

We cut down the names to 2-3 characters. This is a little more confusing in the code but the disk storage savings are worth it. And if you use sensible names then it isn’t that bad e.g. timeAdded -> tA. A reduction to about 15 bytes per row at 10,000,000 rows means ~140MB for field names – a massive saving.

The author disputed the value of the savings by calculating the cost of the disk space (using commodity hardware) and comparing it to the time taken from a developer on a certain salary:

Let me do the math for a second, okay?

A two terabyte hard drive now costs 120 USD. By my math, that makes:

1 TB = 60 USD
1 GB = 0.058 USD
In other words, that massive saving that they are talking about? 5 cents!

Regardless of the trollish nature of the post, the maths errors and typos, this is a valid point assuming we purchase our own commodity hardware in a single server. However, there are some important things that were missed:

  • This post is 14 months old. At the time we were using a single slice from Slicehost with their old pricing where disk space was limited and expensive.
  • The product had launched 1 month prior, we were entirely bootstrapped and June was our first month of revenue from converting beta customers. As such we had limited funds to buy upgrades.
  • Even having only been live for a few months we were storing millions of documents. MongoDB stores field names for every doc because it’s non-relational and has to so it can allow for its schema-less design. Now we’re storing over a billion documents a month so even at a saving of 1 byte per document, that’s almost 1GB. We knew we’d be scaling over the coming months so needed to account for that.

This post got a huge number of comments on the post itself, but also on Proggit where it’s still in the top few posts after a day! There are lots of good points made and perhaps the best one is that at the time I was concerned with disk space (because that’s what we were bound by at the time) but the real saving is RAM – MongoDB’s indexes should fit in memory for optimal performance.

Other good points were made:

  • Commodity hardware is cheap. Server grade hardware is not. After Slicehost we moved to Rackspace and had x5 15k 300GB SAS drives in RAID6. These are expensive.
  • We have now moved to a private cloud with Terremark and use Fibre SANs. Pricing for these is around $1000 per TB per month.
  • We are not using a single server – we have 4 servers per shard so the data is stored 4 times. See why here. Each shard has 500GB in total data so that’s 2TB = $2000 per month.
  • You can use a low level field mapper to avoid any developer mistakes and ambiguities. At small volumes it is extra work but when you reach larger scales it could be worth it.

Anyway, it was certainly fun to see the sudden spike when I checked our blog stats. I enjoyed how a tiny part of an old post was picked out and the ensuing discussion provided some great points about database server storage.

MongoDB and nosql is new (relative to older databases like MySQL etc) so these discussions are important so everyone can share knowledge and figure out the best ways to deploy these new types of databases into production.

  • Gabriel Lozano-Moran

    In all honestly, I believe Ayende is a d*ck. For most blog posts all he does is bash other developers pointing out how superior he is compared to the rest of us. He is been like that for years.

  • Marcus

    It would be interesting if you talked more about your architecture, number of servers monitored, and stuff like that. I assume you (actively) monitor about 1,000 servers. The intervals are every 5 minutes. So, the DB would only need to do about 17 writes a second. So, why do you need 15 servers? And why so much storage space?

    • http://www.serverdensity.com David Mytton

      We don’t comment on customer or server numbers but it’s far more complex than you’re assuming.

      • Marcus

        Understandable on the customers, but how about the writes needed to monitor a single server?

      • Joe

        Here is what I’m assuming, every new “ping” from my server to server density results in just one write, all that data can easily be inserted into mongodb with one query.

        Now the big questions is, do all servers ping SD at the same point in time? Or does the sd-agent (upon install) pick a random time between every 00:00:00 and 00:5:00 to send the data? I’m assuming that’s what they do otherwise if they have 1,000,000 servers monitored that would result in a million queries at once which could easily crash the server.

  • http://blogs.xingular.net/santiago Santiago Basulto

    Totally Agree. Haven’t you considered to move to something like AppEngine from Google? It has good prices.

    • http://www.serverdensity.com David Mytton

      We want to be able to completely control our infrastructure down the server level.

  • http://blogs.xingular.net/santiago Santiago Basulto

    David. Have you tested CouchDB? I need to implement a non-sql DB, and MongoDB seems good. But, reading some docs and articles i’ve found that CouchDB could suit well for my needs. The thing is: if you have tested it, why didn’t you use it, what things have you seen good and bad about it?

    Thanks!!!!!

  • http://jseed.org Chip Kaye

    Hi David – thanks for your various posts on MongoDB, helpful and illuminating stuff for newcomers. Could you say something about how a “low level field mapper” might work? Giving up descriptive filenames is pain I’d want to avoid if at all possible so I’d like to consider strategies for bridging the gap between filename length and optimized storage size. Thanks.

    • http://www.serverdensity.com David Mytton

      You would have a index of names use use in code and their corresponding MongoDB field names. Then you’d reference the full name in your code and do a replace at the time the query is sent to Mongo. It’d probably require some kind of wrapper over the top of the driver methods so you don’t execute them directly.

  • nifan

    It’s not so much about the disk space, it’s about the caching impact that overhead has.

    Sure diskspace is relatively cheap but, like you mentioned, gets a lot more expensive per MB when you get professional equipment. Or want good performance, a couple of sata disks are just not going to get you much IOPs.

    Far more important is the caching impact of these saved bytes… memory is expensive and for decent server hardware its double expensive, but its also very very fast.

    These small key names mean more values will fit in the various caching meaning faster queries. (query cache / os cache / raid cache / disk cache or ssd read caches if you have them)

    p.s. i am wondering if mongodb is considering optimalisations in this area

  • Tilo