MongoDB schema design pitfalls

By
David Mytton,
CEO & Founder of Server Density.
Published on the 21st February, 2013.
One of things that makes MongoDB easy to get started with is you don’t have to think about schema design – just shove data in and it’ll let you query it. That helps initial development and has benefits down the line when you want to change your document structure. That said…
Remember, "schemaless" doesn't mean you don't need to design your schema! #mongodb
— Derick Rethans (@derickr) February 11, 2013
…so just like any database, to improve performance and make things scale, you still have to think about schema design. This has been well covered elsewhere (here, here and here) so here are some more in depth considerations to avoid the pitfalls of MongoDB schema design:
1. Avoid growing documents
If you add new fields or the size of the document (field names + field values) grows past the allocated space, the document will be written elsewhere in the data file. This has a hit on performance because the data has to be rewritten. If this happens a lot then Mongo will adjust its padding factor so documents will be given more space by default. But in-place updates are faster.
You can find out if your documents are being moved by using the profiler output and looking at the moved
field. If this is true
then the document has been rewritten and you can get a performance improvement by fixing that (see below).
2. Use field modifiers
One way to avoid rewriting a whole document and modifying fields in place is to specify only those fields you wish to change and use modifiers where possible. Instead of sending a whole new document to update an existing one, you can set or remove specific fields. And if you’re doing certain operations like increment, you can use their modifiers. These are more efficient on the actual communication between the database as well as the operation on the data file itself.
3. Pay attention to BSON data types
A document could be moved even by changing a field data type. Consider what format you want to store your data in e.g. if you rewrite (float)0.0
to (int)0
then this is actually a different BSON data type, and can cause a document to be moved.
4. Preallocate documents
If you know you are going to add fields later, preallocate the document with placeholder values, then use the $set
field modifier to change the actual value later. As noted above, be sure to preallocate the correct data type – beware: null
is a different type!
However, trigger the preallocation randomly because if you’re suddenly creating a huge number of new documents, that too will have an impact e.g. if you create a document for each hour, you want to do them in advance of that hour balanced over a period of time rather than creating them all on the hour.
5. Field names take up space
This is less important if you only have a few million documents but when you get up to billions of records, they have a meaningful impact on your index size. Disk space is cheap but RAM isn’t, and you want as much in memory as possible.
6. Consider using _id
for your own purposes
Every collection gets _id
indexed by default so you could make use of this by creating your own unique index. For example if you have a structure based on date, account ID and server ID like we do with our server monitoring metrics storage for Server Density, you can use that as the index content rather than having them each as separate fields. You can then query by _id
with the single index instead of using a compound index across multiple fields.
7. Can you use covered indexes?
If you create an index which contains all the fields you would query and all the fields that will be returned by that query, MongoDB will never need to read the data because it’s all contained within the index. This significantly reduces the need to fit all data into memory for maximum performance. These are called covered queries. The explain output will show indexOnly
as true
if you are using a covered query.
8. Use collections and databases to your advantage
You can split data up across multiple collections and databases:
- Dropping a whole collection is significantly faster than doing a
remove()
on the documents within it. This can be useful for handling retention e.g. you could split collections by day. A large number of collections usually makes little difference to normal operations, but does have a few considerations such as namespace limits. - Database level locking lets you split up workloads across databases to avoid contention e.g. you could separate high throughput logging from an authentication database.
Test everything
People misinterpret @mongodb scalability. It's not easy per se, it's just easier. Still requires thought, understanding and testing
— David Mytton (@davidmytton) February 5, 2013
Make good use of the system profiler and explain output to test you are doing what you think you are doing. And run benchmarks of your code in production, over a period of time. There are some great examples of problems uncovered with this in this schema design post.