On shortened field names in MongoDB
Last night we got almost 10k unique visitors in just a few hours because of a post disputing a minor point of a post I wrote in July 2009. The problem was with the reduction of field names in our MongoDB documents so we could save disk space:
This means things are much more flexible for future structure changes but it also means that every row records the field names. We had relatively long, descriptive names in MySQL such as timeAdded or valueCached. For a small number of rows, this extra storage only amounts to a few bytes per row, but when you have 10 million rows, each with maybe 100 bytes of field names, then you quickly eat up disk space unnecessarily. 100 * 10,000,000 = ~900MB just for field names!
We cut down the names to 2-3 characters. This is a little more confusing in the code but the disk storage savings are worth it. And if you use sensible names then it isn’t that bad e.g. timeAdded -> tA. A reduction to about 15 bytes per row at 10,000,000 rows means ~140MB for field names – a massive saving.
The author disputed the value of the savings by calculating the cost of the disk space (using commodity hardware) and comparing it to the time taken from a developer on a certain salary:
Let me do the math for a second, okay?
A two terabyte hard drive now costs 120 USD. By my math, that makes:
1 TB = 60 USD
1 GB = 0.058 USD
In other words, that massive saving that they are talking about? 5 cents!
Regardless of the trollish nature of the post, the maths errors and typos, this is a valid point assuming we purchase our own commodity hardware in a single server. However, there are some important things that were missed:
- This post is 14 months old. At the time we were using a single slice from Slicehost with their old pricing where disk space was limited and expensive.
- The product had launched 1 month prior, we were entirely bootstrapped and June was our first month of revenue from converting beta customers. As such we had limited funds to buy upgrades.
- Even having only been live for a few months we were storing millions of documents. MongoDB stores field names for every doc because it’s non-relational and has to so it can allow for its schema-less design. Now we’re storing over a billion documents a month so even at a saving of 1 byte per document, that’s almost 1GB. We knew we’d be scaling over the coming months so needed to account for that.
This post got a huge number of comments on the post itself, but also on Proggit where it’s still in the top few posts after a day! There are lots of good points made and perhaps the best one is that at the time I was concerned with disk space (because that’s what we were bound by at the time) but the real saving is RAM – MongoDB’s indexes should fit in memory for optimal performance.
Other good points were made:
- Commodity hardware is cheap. Server grade hardware is not. After Slicehost we moved to Rackspace and had x5 15k 300GB SAS drives in RAID6. These are expensive.
- We have now moved to a private cloud with Terremark and use Fibre SANs. Pricing for these is around $1000 per TB per month.
- We are not using a single server – we have 4 servers per shard so the data is stored 4 times. See why here. Each shard has 500GB in total data so that’s 2TB = $2000 per month.
- You can use a low level field mapper to avoid any developer mistakes and ambiguities. At small volumes it is extra work but when you reach larger scales it could be worth it.
Anyway, it was certainly fun to see the sudden spike when I checked our blog stats. I enjoyed how a tiny part of an old post was picked out and the ensuing discussion provided some great points about database server storage.
MongoDB and nosql is new (relative to older databases like MySQL etc) so these discussions are important so everyone can share knowledge and figure out the best ways to deploy these new types of databases into production.
Enjoy this post? You may also like Multi data center redundancy – sysadmin considerations