MongoDB schema design pitfalls


By David Mytton,
CEO & Founder of Server Density.

Published on the 21st February, 2013.

One of things that makes MongoDB easy to get started with is you don’t have to think about schema design – just shove data in and it’ll let you query it. That helps initial development and has benefits down the line when you want to change your document structure. That said…

…so just like any database, to improve performance and make things scale, you still have to think about schema design. This has been well covered elsewhere (here, here and here) so here are some more in depth considerations to avoid the pitfalls of MongoDB schema design:

1. Avoid growing documents

If you add new fields or the size of the document (field names + field values) grows past the allocated space, the document will be written elsewhere in the data file. This has a hit on performance because the data has to be rewritten. If this happens a lot then Mongo will adjust its padding factor so documents will be given more space by default. But in-place updates are faster.

You can find out if your documents are being moved by using the profiler output and looking at the moved field. If this is true then the document has been rewritten and you can get a performance improvement by fixing that (see below).

2. Use field modifiers

One way to avoid rewriting a whole document and modifying fields in place is to specify only those fields you wish to change and use modifiers where possible. Instead of sending a whole new document to update an existing one, you can set or remove specific fields. And if you’re doing certain operations like increment, you can use their modifiers. These are more efficient on the actual communication between the database as well as the operation on the data file itself.

3. Pay attention to BSON data types

BSON logo

A document could be moved even by changing a field data type. Consider what format you want to store your data in e.g. if you rewrite (float)0.0 to (int)0 then this is actually a different BSON data type, and can cause a document to be moved.

4. Preallocate documents

If you know you are going to add fields later, preallocate the document with placeholder values, then use the $set field modifier to change the actual value later. As noted above, be sure to preallocate the correct data type – beware: null is a different type!

However, trigger the preallocation randomly because if you’re suddenly creating a huge number of new documents, that too will have an impact e.g. if you create a document for each hour, you want to do them in advance of that hour balanced over a period of time rather than creating them all on the hour.

5. Field names take up space

This is less important if you only have a few million documents but when you get up to billions of records, they have a meaningful impact on your index size. Disk space is cheap but RAM isn’t, and you want as much in memory as possible.

6. Consider using _id for your own purposes

Every collection gets _id indexed by default so you could make use of this by creating your own unique index. For example if you have a structure based on date, account ID and server ID like we do with our server monitoring metrics storage for Server Density, you can use that as the index content rather than having them each as separate fields. You can then query by _id with the single index instead of using a compound index across multiple fields.

7. Can you use covered indexes?

If you create an index which contains all the fields you would query and all the fields that will be returned by that query, MongoDB will never need to read the data because it’s all contained within the index. This significantly reduces the need to fit all data into memory for maximum performance. These are called covered queries. The explain output will show indexOnly as true if you are using a covered query.

8. Use collections and databases to your advantage

You can split data up across multiple collections and databases:

  • Dropping a whole collection is significantly faster than doing a remove() on the documents within it. This can be useful for handling retention e.g. you could split collections by day. A large number of collections usually makes little difference to normal operations, but does have a few considerations such as namespace limits.
  • Database level locking lets you split up workloads across databases to avoid contention e.g. you could separate high throughput logging from an authentication database.

Test everything

Make good use of the system profiler and explain output to test you are doing what you think you are doing. And run benchmarks of your code in production, over a period of time. There are some great examples of problems uncovered with this in this schema design post.

Free eBook: The 9 Ingredients of Scale

From two students with pocket money, to 20 engineers and 80,000 servers on the books, our eBook is a detailed account of how we scaled a world-class DevOps team from the ground up. Download our definitive guide to scaling DevOps and how to get started on your journey.

Help us speak your language. What is your primary tech stack?

What infrastructure do you currently work with?

  • Sean

    I like the idea of using ‘_id’ but then you cannot select data based on date for your example. Or am I missing something ?

    • Yes that is true. You can get the date after you return the doc but can’t query on it without another field.

  • Avoid growing documents impress me a lot, in my production a lot of nmoved happen when updating, mongo locks increase dramatically, and increase disk io a lot.

  • I don’t think field name have much impact on index size. I think index only store the values of the indexed fields for each row, not with field name.

  • great article indeed, I was just planing for data modeling in mongoDb on a project. A great help for me. Thanks

  • Alvaro

    I was thinking about point 6 and wondering if a compound index with account id, server id, and date, in this order, is more performant when looking for records, as there is some kind of hierarchy index search (first searching by account id, then server id, and then date), instead of looking in the whole records where you do not have this advantage…

  • Roberto Andrew

    Diagrams for MongoDB:

    MongoDB stores information based on JSON documents. The complexity of the documents increase if the number of sub-documents increase. Mastering such a documents using diagrams may make work much easier.
    This will be represented in DbSchema as three entities, one for each sub-document.
    DbSchema does discover the schema by scanning the data.

Articles you care about. Delivered.

Help us speak your language. What is your primary tech stack?

Maybe another time