Choosing a non-relational database; why we migrated from MySQL to MongoDB

Choosing a Non-Relational Database

By David Mytton,
CEO & Founder of Server Density.

Published on the 25th July, 2009.

Update 23 Oct 2010: A discussion about the value of shortened field names has generated a lot of traffic to this post over the last 24 hours. A response has been posted here.

Update 28 Feb 2010: A followup was published here.

Until recently, our server monitoring application, Server Density, was running using MySQL for the backend. Although we primarily provide it as a hosted service, it has been written to work as a standalone application for customers that wish to install on their own servers. This means each customer had their own MySQL database.

We collect a lot of data – the monitoring agent reports back every 60 seconds and includes various statistics, of which the server snapshot has the most data (because it is collecting details on every running process). Over time, this results in millions of rows in the database, even for just 1 month of data, per server monitored.

Despite this huge amount of data, performance was not a problem. We were able to tune our queries and servers to ensure that results were returned quickly. The majority of queries are inserts with a minimal number of reads. We are also not doing any caching, yet. Once we reach that stage in our scaling plan there will be even fewer reads directly from the DB because the latest metric values will be stored in memory.

The problem we encountered was administrative. We wanted to scale using replication but found that MySQL had a hard time keeping up, especially with the initial sync. As such, backups became an issue, but we solved that. However, scaling MySQL onto multiple clustered servers as we plan to do in the future is difficult. You either do this through replication but that is only really suited to read-heavy applications; or using MySQL cluster. The cluster looks very good but I have read about some problems with it and was unsure of it’s suitability for our needs.

The current fashion is using key/value stores, also known as non-relational database management systems (non-RDBMS) or schema-less databases. As such I did some research into the options available.

The different options

Our requirements were a relatively stable product with a decent access interface, persistent disk based storage (many are in-memory only), a good community and some documentation. Based on that, the projects I reviewed were:

There are several good blog posts around that go into more detail for each project.

I did not consider “cloud” databases such as Amazon SimpleDB because I thought the latency would be too high.

Why we chose MongoDB

I tested most of these with real data and eventually chose MongoDB for a number of reasons:

Implementation details

Switching from relational to non-relational is time consuming but it is not difficult. That said, there are differences that you won’t necessarily be expecting. Some of these are specific to MongoDB but some will apply generally too:


This means things are much more flexible for future structure changes but it also means that every row records the field names. We had relatively long, descriptive names in MySQL such as timeAdded or valueCached. For a small number of rows, this extra storage only amounts to a few bytes per row, but when you have 10 million rows, each with maybe 100 bytes of field names, then you quickly eat up disk space unnecessarily. 100 * 10,000,000 = ~900MB just for field names!

We cut down the names to 2-3 characters. This is a little more confusing in the code but the disk storage savings are worth it. And if you use sensible names then it isn’t that bad e.g. timeAdded -> tA. A reduction to about 15 bytes per row at 10,000,000 rows means ~140MB for field names – a massive saving.

The database-per-customer method doesn’t work

MongoDB stores data in flat files using their own binary storage objects. This means that data storage is very compact and efficient, perfect for high data volumes. However, it allocates a set of files per database and pre-allocates those files on the filesystem for speed:

Each datafile is preallocated to a given size. (This is done to prevent file system fragmentation, among other reasons.) The first file for a database is .0, then .1, etc. .0 will be 64MB, .1 128MB, etc., up to 2GB. Once the files reach 2GB in size, each successive file is also 2GB.

Thus if the last datafile present is say, 1GB, that file might be 90% empty if it was recently reached.

This was a problem because MongoDB was frequently pre-allocating in advance when the data would almost never need to “flow” into another file, or only a tiny amount of another file. This is particularly the case with free accounts where we clear out data after a month. Such pre-allocation caused large amounts of disk space to be used up.

We therefore changed our data structure so that we had a single DB, thus making the most efficient use of the available storage. There is no performance hit for doing this because the files are split out, unlike MySQL which uses a single file per table.

Unexpected locking and blocking

In MongoDB, removing rows locks and blocks the entire database. Adding indexes also does the same. When we imported our data, this was causing problems because large data sets were causing the locks to exist for some time until the indexing had completed. This is a not a problem when you first create the “collection” (tables in MySQL) because there are only a few (or no) rows, but creating indexes later will cause problems.

Previously in MySQL we would delete rows by using a wide ranging WHERE clause, for example to delete rows by date range or server ID. Now in MongoDB we have to loop through all the rows and delete them individually. This is slower, but it prevents the locking issue.


In MySQL if a database (more likely a few tables) become corrupt, you can repair them individually. In MongoDB, you have to repair on a database level. There is a command to do this but it reads all the data and re-writes it to a new set of files. This means all data is checked and means you will probably have some disk space freed up as files are compacted but it also means the entire database is locked and blocked during the time it takes. With our database being around 60GB, this operation takes several hours.

Corruption will only really occur if you kill the database process whilst it is in the middle of an operation.


Our reasons for moving to MongoDB were not performance, however it has turned out that in many cases, query times are significantly faster than with MySQL. This is because MongoDB stores as much data in RAM as possible and so it becomes as fast as using something like memcached for the cached data. Even non-cached data is very fast.

We don’t have any precise numbers but in some cases are seeing cached query times around 7ms and non-cached around 50-200ms, depending on the query. Indexes help speed up queries in many cases but where our data is write intensive, indexes can slow things down.

Having a native C PHP module also helps with performance and means that all interactions are optimised at the code level. Other drivers are available for Python, Java, Ruby, C++ and Perl.

Community / commercial support

MongoDB is open source but is developed by a New York company, 10gen. This is useful because we can be sure that development will continue and bugs fixed. Indeed, the mailing list has been a very useful source of help for us during the migration. The documentation is good but some things are still unclear or not documented and being able to get a response from the mailing list from the developers within hours is very helpful.

MongoDB is quite a new project compared to the likes of MySQL and so there are fewer experienced people in the community. As such, we have also taken up a support contract with 10gen for guaranteed 24/7 phone & e-mail coverage so that should we have a problem, we will be able to get help quickly.

Test, and choose what is right for your application

The guys at Friendfeed are using MySQL and they have a lot more data than us. However, they use it like a key/value store and have a different access ratio. Every application is different. Whilst MySQL is suitable for Friendfeed, we found a better solution. You need to test each one to discover its suitability for your needs.

Indeed, whilst Server Density is now running entirely on MongoDB, our accounts system, invoicing and billing remains on MySQL. MongoDB is non-atomic. This doesn’t matter for our general application code – it’s not critical if a few rows do not get written – however this is not the same for our billing system. We use transactions to ensure everything runs correctly (e.g. we don’t bill customers twice) and so are still using MySQL InnoDB for that.

Our move to MongoDB has been interesting and we have encountered problems, but nothing that we were unable to work around. Performance has increased, our disk usage has decreased and we are now in a much better position to continue our scaling plans.

  • Interesting article. It’s always nice to read other people experiences, because that way yourself will also learn something new. I see that your needs are quite different from usual database use where read/write rate is quite different, but it would be nice to see some tests how MongoDB make out searching from bigger dataset.

    Becuase I work mostly with some web frameworks, I will wait if some of them will support non-relational database before I will try out MongoDB or some other.

  • name

    Mongo??! lol :)

  • hk

    Any comment on why you ended up choosing MongoDB over CouchDB?

    • Mongo has very simple querying, much like SQL. CouchDB requires map/reduce style queries which is much more complicated to work with. Mongo also has the PHP module which is a big advantage.

      • Robin Mehner

        CouchDB does not need an extra PHP module, because it communicates via HTTP and PHP has several possibilities to talk via HTTP (cURL, streams, sockets etc.). Also there are some classes around that wrap the access to CouchDB (like phpillow and some more).

        Don’t want to convince you, as stand on your side with your approach, but the “php module” argument isn’t really one against Couch I think :)

      • Indeed, the PHP module argument is not the only reason we decided to go with MongoDB, but it is a plus that is worth mentioning.

      • Tom Dean

        Unfortunately it’s not possible to achieve the performance of a C/C++ extension through PHP alone. I’m sure for many a library written in PHP is more than adequate, but for anything with serious throughput & volume, a C/C++ extension is the only responsible option (especially when that extension is Boost enhanced).

      • Actually it’s not so simple. You must measure. For example, there are several .erb template language implementations for ruby, and the fastest is one written in pure ruby, not c++. Why? Because it’s carefully written to avoid allocating intermediate results.

        C and C++ do not assure high performance, nor are they always necessary for high performance, particularly for IO heavy workloads. There is no “only responsible option”. You must profile for your workload.

        Boost has some excellent features in its’ library, but is frequently bemoaned for poor performance. I find the phrase “boost enhanced” rather silly. The libraries are tools, not an automatic go faster button.

  • Was thought out as an option?
    Also, might have been a good option considering the large amount of data being handled.

    • Both were considered but MongoDB had more advantages. We also didn’t really want to work with map/reduce for querying.

  • nathan

    when you outgrow your mongo take a look at SenSage or Vertica. Look up recent performance comparisons between Hadoop & Vertica … complex queries are 10x+ faster w/ Vertica and the same w/ SenSage.

    • Having quickly looked, these both seem like expensive enterprise product unsuitable for a startup.

      • Unofficially Vertica is about $25k per TB for production databases (development and testing are free). It is enterprise sales though (bleh) so you may be able to negotiate a better rate.

        Something to bear in mind is column stores such as Vertica compress data very effectively, with real world ratios of 10:1 being typical for transaction data and even higher rates for monitoring/logging data. So your dollar stretches a bit more than it might seem at first. These databases execute against the compressed data directly which also maximizes scanning throughput through the cpu/memory bus.

        If your queries involve large scans compressed column stores are dramatically higher performance than a star schema in a general purpose database. On the other hand, if the majority of your workload is fetching single object’s the advantage is reduced to just that from getting more capacity out of your ram.

        I know of at least one startup using Vertica on ec2 and they’re quite happy with it.

        I personally have not used it. Just wanted to pass along what I’ve heard and gleaned from the research literature.

        It’d be great if an open source alternative to Vertica/Teradata/etc existed. Mysql Cluster may get there someday.

  • What kind of replication are you looking for? How many nodes are you replicating to? I’m wondering, would Keyspace be good for you?

    • Keyspace was released after we’d complete most of the work for migration – I remember seeing it on HN. MongoDB is a more mature project it seems, even though both are relatively new. There is also a PHP module for MongoDB which is a big advantage for us.

  • Jim

    Excellent article! We have been using MongoDB with Ruby since February and have also been very pleased.

    Since you are not using 1 database per customer, how are you segregating your data?

    Do you add a collection per customer for all of your customer specific collections?
    Did you add a “customer_id” field to your collections?

    • We add a customer ID prefix to each collection that belongs to them.

      • Chris

        Great write-up! Thanks for sharing this information.

        Are there any particular reasons why y’all chose to use a collection prefix vs. a ‘customer_id’ field per document?

        Trying to figure out a similar situation. I reckon the only downside to using a collection prefix is the (soft?) collection limit per database.

        • Server Density has been developed so it can be deployed in separate installations in addition to our hosted service e.g. if a customer wishes to deploy onto their own servers. The logic for handling the hosted service is therefore outside the core application, which it wouldn’t be if we had identifiers in a single collection.

          It also makes it easier to shard based on user.

  • Tayssir John Gabbour

    Is there any reason MongoDB can’t collapse identical field names into a reference to a single string in shared memory? This is a very common optimization, so you’re not driven to illegible 2-3 character names.

    I suppose an application-level workaround is to define something like what does — a scheme which replaces big tablenames with smaller ones…

    • This mapping can also be handled at the code level with very little overhead.

  • Pingback: Daily Links for Sunday, July 26th, 2009()

  • Ross

    Wondering why you didn’t cosider staying relational w/ PostgreSQL. If you’re not going to go map/reduce w/ the queries, key/value stores seem a poor fit to me.

    • MongoDB isn’t a key/value store; it’s a document store – exactly what we need.

      Have a read through the post above and you’ll see the reasons why we switched.

  • Pingback: Top Posts «

  • Pingback: 451 CAOS Links (caostheory) 's status on Monday, 27-Jul-09 08:19:26 UTC -

  • James


    Thanks for the overview, three questions

    * Locks and blocks : can you expand on the impact this might have for client applications – for instance a web app feeding data into mongodb ?

    * The PHP module argument is pretty moot – for instance with Couchdb you can access the database via standard socket methods as it’s HTTP – you don’t have to use a PHP wrapper class – they just wrap-up common operations but with the obvious performance drag.
    It’d be interesting to compare Couchdb implementations using sockets Vs MongoDB implementations using the Mongo extension.

    Obviously a CouchDB php extension similar to the Mongo extension would be a useful way to avoid PHP wrappers around socket handlers.

    BTW: “pecl install mongo” installs the extension using pecl

    * “Mongo has very simple querying, much like SQL” – aggregated queries, for example, seem use a reduce method:
    Do you have an example of aggregated SQL-like queries in Mongo ?


    • Locking/blocking would mean that no data can be inserted or read and the client application would hang until it times out or the lock/block is removed.

      The PHP module argument is one of many that add to the overall decision to use MongoDB. In theory, connecting via the PHP module should be faster than setting up an HTTP connection in the PHP code itself because the connection setup will be done in native C rather than having to go through another layer (PHP code).

      We don’t use the MongoCollection Group method, just the query method. This was an advantage because we wanted to avoid the complexity of building map/reduce functions to access our data.

    • And by query method I really mean “find” e.g.

      $checksMemSwapCol->find(array(‘sId’ => (int)$serverId, ‘tA’ => array(‘$gte’ => $timeStartMongo, ‘$lte’ => $timeEndMongo), ‘tAG’ => $granularity), array(‘tA’, ‘vU’, ‘vF’))->sort(array(‘tA’ => 1))->limit(-$limit);

  • Valentin Kuznetsov

    I’m in process of evaluating MongoDB and I did compare it with CouchDB. I think main difference is support of SQL like queries (which I consider is big plus). It’s easy to understand and more flexible in my mind. I also found that CouchDB has one big disadvantage. View creation force data indexing. If you must insert a lot of data you’ll wait for its access until your view will index all data. It’s not required in Mongo. Just to give you idea, 1M records requires more then an hour of indexing time. We did tried bulk injection with parallel clients and ask for data (retrieve 1 doc) each time after bulk insert. That cause a crashes in CouchDB. So I found this instability and not yet solve it CouchDB. What I also interesting in Mongo is bulk injection. Did you tried to insert N docs at a time, where N of the order of 1K or more? What is a best strategy when you need to insert a lot of data? For those who are interesting we’re looking for cache solution with DB features (we want queries placed against data in cache). So we must trigger big update from back-end RDMS’s.

  • Pingback: links for 2009-07-27 « Donghai Ma()

  • What’s the reason for not choosing tokyo db?

    • None of the client libs I tested worked.

      • Maurício Linhares

        Tokyo is also horrible for complex querying. But it’s a nice key/value store anyway.

      • It’s not designed for complex querying, it’s a k/v store. You query the key and get the value and that’s it.

  • Pingback: Dev Blog AF83 » Blog Archive » Veille technologique : OS, Navigateurs, HTML, CSS, Ruby, Rails, Javascript, Langages, VCS, DB, Geo()

  • Pingback: links for 2009-07-29 | burningCat()

  • Excellent Article David, thank you for sharing your experience with us.

    I hope that in the future we will read another excellent article like this maybe with tips, solutions that you might find after using mongoDB for a long time.

    Again thank you and good luck.

  • Pingback: Some links about rdbms vs distributed key stores | Tof://()

  • A quite interesting analysis. I’m quite used to the non-relational, document-oriented database paradigm, having worked with Lotus Notes since 1993 (and before that, with more conventional relational DBs).

    Just to pick you up on one small point, though. In describing how you saved disk space by restricting the length of field names to a screamingly small 2 to 3 characters, you gave the example of saving 140 MB of disk space for field names. You called this as “a massive saving” and I would disagree. This amount of disk space is trivial these days, when terabyte drives only cost in the order of 100 dollars or so. If you were saving 140 GB then I might agree, but not when the unit of storage space is MB.

    I would seriously question whether saving maybe 10 or 20 dollars worth of disk space is worth the trouble caused by your having to battle with cryptically-short field names, a coding nightmare and even more so when it comes to subsequent code maintenance. I thought that we were well past such penny-pinching in these days of commodity hardware pricing.

    Tony Austin
    CEO, Asia/Pacific Computer Services

    • The hypothetical saving was 760MB for 10m documents, and that is per collection. Although we would only have that many documents for paying customers, spread over all our free customers the saving is quite large. Although the per GB disk usage cost is relatively low nowadays, any saving where it can be realistically achieved is worth it, particularly as a startup. It could be the difference between having to upgrade to large disks and not, which can be quite a big saving.

      • Fred

        Thanks for your insights and experiences with MongDB.

        Being used to RDBMSes, I’m surprised at these “problems” in MongoDB, as it’s touted to be a “superior”, more scalable system.

        Of course hundreds of MBs of unnecessary data are both wasteful, and potential I/O drains. Any good DBA will try to minimize unnecessary allocations, especially on file. Who knows when the DB will read or write these bytes?

        If you ask me, storing fieldnames per row sounds very very lame..

        Which is why I’m surprised to hear about the entire database being locked, and naming-strategies like it was the 80s all over again. I too would do the same thing of course, until the real problem in MongoDB is fixed.

        It sounds too little, to sacrifice my RDBMS comfort for, but I cherish the spirit of the search, and hope to see viable alternatives in the future.

      • I have to agree with Fred, Mongo, Couch etc too much of a hype than anything else at this point, …it reminds me of Data Gateway vs Active Record where AR is far superior and DG assumes only will only work with single views and 1 table at a time.

        There is a very important fundamental relating to data, called normalization which fails completely in these hyped data stores, not to mention the enormous integrity and security issues involved when a programmer screws up a data policy in his code, and there was no database model to enforce any rules etc.

        These schema-less dbs are more like a MAGICIAN’S HAT, chuck all your data in and later stick your hand in see what you get.

        But I also find it very interesting and would like to see how Mongo, Couch etc. evolve to solve the majority of real-world scenarios.

      • Ron

        I would not call these databases a mess, magic, or anything of the sort. I have been working on databases for over 15 years now. The problem was that we tried to go down a route with the RDBMS model. It was our hammer and every DB application was a nail. I have worked as a developer (and DBA with some) with Oracle, DB2, MSSQL, Postgres, MySQL, UniVerse, and Cache. All of these systems have their plus and minuses. However, my experience has been that the non-relational databases were faster, and easier to work with as a developer. It’s amazing what marketing and money can do for perceptions! ;-) An interesting side note on this, IBM has incorporated some of the fundamentals of Informix (NoSQL) into DB2; they also purchased UniVerse.

        Normalization is good, but from what I can tell, most people over-normalize, or under-normalize. The best short answer that I have found on the heart of the problem was a diagram that basically showed in a RDBMS the data has to be gathered to put into one cohesive structure to give to a user, whereas most of the NoSQL databases do not. This same problem occurs on the writes.

        It is a great discussion, but there is no panacea for database choice. To say RDBMS or NoSQL doesn’t have its place is unfair to either.

  • JP

    Hi, did you try other mysql engines besides Myisam before moving to Mongodb?

    • MyISAM was the most suitable for the type of usage we were exeperiencing – many reads and few rights. We used InnoDB (and still do) for the billing and customer systems where we need transactions.

      • morrel

        If your main concern was the need for fast loading, why didn’t you go the Oracle route, then?

  • david

    awesome i am loving mongodb too :)

  • For a future project, you guys may want to keep an eye on MckoiDDB. It’s an open source (GPLv3) distributed DB system – we just made a first public release a couple of days ago. It’s also a ‘schemaless’ key/value database at its roots and we have plans to support some highly structured data models over the base API. The first release contains a hybrid file system and table data model. The idea is, if a developer wants the efficiency of semi-structured data or the expressiveness of structured data models then our API is there to support both.

    The system is distributed, transactional and available and handles those things fairly transparently to the client side developer. Internally it borrows ideas about storing data from modern file systems and the distributed storage parts came from GoogleFS with some of my own ideas thrown in there too :) (it’s basically a log-structured file system with a BTree used to represent each instance snapshot). MckoiDDB is in its infancy but you may find the software interesting nonetheless.

  • Thanks for sharing! I’m similarly investigating a better data storage technology than MySQL for our non-relational data in big tables (40mil+ rows). Your information here has proven to be valuable indeed!

  • Pingback: The Server Density storage backend – utility storage from Rackspace « Boxed Ice Blog()

  • You also can look at Redis —
    It is very fast persistent key-value database with built-in net interface.

  • Pingback: Scheduled Maintenance 13th Oct 2009 12:00 GMT « Boxed Ice Blog()

  • Pingback: Preventing false positives « Boxed Ice Blog()

  • It would be cool to understand what kind of data you are storing to MongoDB (all of the server density stuff?) and how you deal with it. How do you denormalize it, how do you aggregate it, how you do real time query it.

    • All the Server Density data is stored in MongoDB – from the list of users added to all the log data. We don’t do any de-normalisation and MongoDB allows for very flexible querying so we have no issues querying against massive datasets – it’s all handled by MongoDB.

  • Pingback: NoSQL数据库探密 « 老纪博客()

  • Pingback: links for 2009-12-16 « Bloggitation()

  • Matthew

    Great article, how do you do relations (because the idea that any database has no relations is ludicrous)? So, for example if you have a Toy Store franchise with many Toy Stores and many Toys within those Toy Stores, is a “document” the individual toy stores and Toys is a hash within a toy store document? And if so, can you index those “subtables”? Or do you makes Toys a different collection, and store Toy Store ids within the Toys documents?

    I THINK that’s how MongoMapper does it, but that contradicts the non-relational paradigm. If you do it the first way, I’m not sure how many layers down you can go and continue to index, and get good performance. Thoughts?

    • There are no relations, hence the name “non relational database”. Of course you’re free to create your own associations by including IDs in fields but the database doesn’t know anything about them other than they’re integer fields.

      I’m not sure what you mean by “toys” behind a hash. MongoDB indexes the content of the field so if it’s a hash, you’d have to pass that hash as the key when doing queries or whatever.

      • Matthew

        Relations, associations, we’re just playing word games here. The point was that no database can be without “associations” (whether the DB knows about them or not is a moot point). My question is how do you retrieve information for associated collections without having access to joins? Do you query what you are looking for in one collection and then get it’s associated document in another collection?

        I guess I’m one of the few people trying to use document-oriented DBs in read-heavy environments because I almost never see much written about querying when it comes to nosql.

      • You would need to handle this yourself in your own code. Without joins you’d need to do individual queries one each collection to get the associated data. Where possible you should normalise your data (i.e. duplicate fields) so you can reduce the number of queries needed, but the query cost in MongoDB is very low anyway (assuming you have appropriate indexes of course).

        This is one of the big advantages of MongoDB over something like CouchDB as you can do ad-hoc queries rather than having to construct map reduce syntax (thereby knowing your query in advance).

        NB By “association” I didn’t mean in any technical sense.

  • ywarnier

    Great article, thanks for sharing, particularly the “unexpected differences”.

  • Alex


    You reviewed products :

    * Cassandra
    * CouchDB
    * Hypertable
    * MongoDB
    * Tokyo Cabinet
    * Project Voldemort

    but don’t gave any reason for non selecting one of the others. That would be profitable to share your experience. Personnaly my interest goes to mongo but would be curious to have your “study” on Voldemort and Cassandra for example.

  • Great!. Nice to here ur experience..
    As David says PHP Module things has a good point.

    when we use couch we’ve go it with http ,
    but when we using mongo we are going with tcp/ip and http is on the top of the tcp/ip

    And Mongodb has a shell which is using Javascript for it’s operation. that really reduce our learning curve to administer the db.

    and there is podcast done @ Floss weakly by a developer of mongodb (

    Aubowan! Thanks!

  • Barbara

    This may seem like a strange request but I am a college student studying internet applications and web programming. In our SQL course we have been asked to find 2 examples of digital or electronic databases – 1. a relational database (schools. libraries, banks were given as sample databases – I have chosen the Canadian Automoible Association; and -2. a non-relational database one. This last one has me totally stumped (remember I am very new to all this – first year – so be kind!). Can anyone help me out using simple, simple terminology in explaining how it works?

  • Pingback: Migrating Mnesia Records to MongoDB Documents using Erlang emongo driver «

  • Pingback: NoSQL数据库探讨之一 - 为什么要用非关系数据库? | AKi Cong()

  • Great article David.

    Since posting this, do you have any additional thoughts to share? I am working on a project where we have decided to use MongoDB and I am also looking to use it side by side with a very large MySQL db on another project.

    Since your post is around 6 months old, I was wondering if you have learned anything new? I know that the MongoDB dev team is very active and has been working on a lot of new features and bug fixes.

    • We’ve learned a fair bit about the best way to run MongoDB, but mostly around dealing with large data sets. Our primary database is currently just under 800GB. The most important thing is there is no single server durability – you must use replication across data centres. See

      • Thanks – and thanks for the follow up too :)

      • I am testing MongoDB on our dev system. I have read that there is no way to specify the upper limit of amount of RAM that mongodb takes ( as it is memory mapped files). Also the virtual memory would be equal to the data size. Could you share some details on how the 800GB is distributed ( in terms of number of servers and their configuration) and also if you are encountering any swapping issues?

  • Thanks for sharing your experience. I am a little surprised at the choice to move to a non-relational database based on the specific issues you identified with MySQL. Did you consider using a better relational database that did not suffer from those weaknesses? I ask because there are many very good RDBMSs that could have addressed those issues without requiring you to completely re-architect your application. Just curious.

    • Such as?

      • morrel

        We’ve got over 2TB of data in an Oracle Database, with user loading between 10GB and 50GB of data per day, while others are querying the database.

        For scaling purposes, we’ve got RAC.

        Works fine, and no need to separate loading and querying, as locks are row-based.

        • Oracle is not a serious choice for startups due to the cost.

      • PostgreSQL?

  • Pingback: Nicht-relationale Datenpersistenz in Java mit MongoDB | SemanticBlog()

  • Pingback: Notes from a production MongoDB deployment « Boxed Ice Blog()

  • Pingback: Living Cosmos » Blog Archive » Choosing a non-relational database; why we migrated from MySQL to MongoDB « Boxed Ice Blog()

  • Pingback: Caffeine Driven Development » Blog Archive » L33t Links #83()

  • Pingback: Notes from a production MongoDB deployment « Monterail Devblog()

  • Illan

    Q: Why do you think that the Amazon SimpleDB latency would be too high? Did you do any comparisons?

    • There’s latency by the fact that you’d be connecting over the internet (compared to running our own database servers on LAN).

  • Pingback: MongoDB monitoring « Boxed Ice Blog()

  • Pingback: Primeiro projeto com Python, Django e MongoDB - Christiano Anderson()

  • Excellent article, i’ve been playing with the idea of creating a Postcode Geo-coding webservice and now that Heroku have got MongoDB support in beta i’m definitely gonna try this out.


  • Pingback: MongoDB Webinar: Approaching 1 Billion Documents slides and audio « Boxed Ice Blog()

  • Pingback: MongoDB Webinar: Approaching 1 Billion Documents slides and audio « Boxed Ice Blog()

  • Pingback: MongoUK MongoDB London conference discount « Boxed Ice Blog()

  • jay greasley

    Very interesting and useful post. Another difference between couch and mongo is that mongo stores documents as bson (binary json effectively) and couch uses json.

  • Pingback: Map reduce and MongoDB « Boxed Ice Blog()

  • Pingback: MongoDB vs CouchDB « Newsicare()

  • Thank for the knowledgeable article.

    We just made a short article for comparing MogoDB to CouchDB based on your well-written article and other articles.


  • Can you tell me the data bloating that you got when you shifted from myisam to mongodb? Since mongodb stores the redundant column names with all records – it should be huge ? I have been looking at mongodb but am reluctant in trying it – reason being that the disk size requirements would be huge for large data sets ?


    • I don’t have any figures for this but you can easily calculate it based on the length of each field name * the number of documents.

  • Thomas Heywood

    We cut down the names to 2-3 characters. This is a little more confusing in the code but the disk storage savings are worth it. And if you use sensible names then it isn’t that bad e.g. timeAdded -> tA. A reduction to about 15 bytes per row at 10,000,000 rows means ~140MB for field names – a massive saving.

    A two terabyte hard drive goes for about $150. A developer costs you between $50k-$80k per year. And you make your code unreadable to save up on what – the size of two .mp3 albums?

    That’s something I’d expect to read in Dilbert.

    • Right, but we don’t buy our own hardware, and at the time this was written we were using Slicehost who had limited disk space (and we had limited funds).

      It doesn’t make the code unreadable because it’s easy to understand what the field names mean, although of course it’s better if they’re verbose. If you use a mapping layer it doesn’t matter

      And in Mongo the field names are stored per doc.

      • Henley Wing

        I agree with you. It’s easier to say “it’s just a few TB” when you’re an employee in a startup/enterprise and not paying the costs, or if you’re a well established company with funds to spend. In the early stages, space and costs are critical things to control.

    • Russ

      A 1TB disk may well be $100 at your local PC shop but an enterprise 1 TB disk is a hell of a lot more than that.

      Plus we haven’t even mentioned server room space, energy consumption, RAID redundancy etc etc.

      • Yep. And like David said, mapping is a simple, trivial task. You can have your cake and eat it too.

  • Hey David.

    I’ve been reading a lot about Mongo trying to implement it. In “MongoDB: The Definitive Guide” by O’Reilly, page 35, sais: “MongoDB leaves some padding around a doument to allow for changes in size”. Don’t know if you knew that, but might be useful. Maybe you can find some sort of configuration, that avoid this over-space (if your documents don’t change).

  • William

    How are you backing up your 800GB MongoDB database?
    What happens when yous excede a 1TB disk file or a tape?
    Have you found a space saving solution to dumping the database to bson files?

    • We don’t backup the whole DB, only critical collections we’d need to restore for disaster recovery.

      Dumping to BSON would be less storage than the DB as they wouldn’t include indexes and are compacted vs the live data files.

      • william

        Please say more about backing up “only critical collections”. You do this with a custom app and queries? My need is for archival data so everything except the indexes are critical. Can you recommend Mongo for this type of app?

  • I’m approaching for the first time to noSQL theme and this post and all the comments added many interesting points of view and some more points in favor of MongoDB. The comparison with CouchDB stop here for the moment, I’ll do now some code experiments… Thanks to all for sharing :-)

  • KB

    Hi All,
    I happen to come across thru’ the discussions and its very enriching for me. Thanks to all of you. Has anyone tried The Brainwave Platform?

    Also I would like to know some performance benchmarks of Mongo DB, if any one can help me it would be great


  • ram

    did you consider sharding your mysql database, say by customer id? Each database can contain say info about N customers and will be on a single server. You could then have a mapping which identifies where the customer info is given the customer id. This really helps in scaling relational databases.

  • Jignesh

    As you mentioned that you have more Read Operations than Write Operation.My question is that how are you populating the Mongo DB Collections.I think there must be sufficient write operations as well.

  • Jignesh

    The audio in this presentation is barely audible.

    Please can you provide another alternative?


  • Excellent write-up, thanks. I see there is a lot of interest on this topic…

    I have written a new administrative front-end for WordPress with Mass Publishing in mind. I had looked at the flexibility of Postgres because some of the interactive objects I’d like to include in the system architecture (in the social media realm) aren’t typical data.

    Then when I explained what I had to one of my mentors, he said “it sounds like you really need to take the RDBMS out of the equation.” I immediately saw the usefulness in that based on the needs of my application- speed being one of them. Like everything, technology should be fit to the task. Wouldn’t take a cadillac limo offroad would you?

    Thanks for this complete perspective on your mission and your journey to an effective solution for your company. It has been valuable as I have started making decisions about the database I will use before I develop a new framework and break out of WordPress altogether.

  • gm

    > in some cases are seeing cached query times around 7ms and non-cached around 50-200ms

    That doesn’t sound any faster than Postgresql; when data is in memory, responses are often sub-1ms. Is MySQL really so slow that this seems fast?

  • I have realized I can’t drill a hole with a hammer and I have switched to a drill, OMG :DDD