Map reduce and MongoDB

By David Mytton,
CEO & Founder of Server Density.

Published on the 21st June, 2010.

MongoDB Logo

As a long time PHP developer and inactive member of the PHP documentation team, I am used to the excellent manual and clear examples for every aspect of the language. The lack of comparable documentation is always something I notice when using other languages; even if the documentation is good, it rarely compares to the PHP docs.

Of course, PHP has been around for a long time and expecting other, much newer projects to have the same level of coverage is unfair. So when I started to play around with the aggregation and map reduce functions within MongoDB, the database we use to power our hosted server monitoring tool, Server Density, it took some time and much trial and error to figure out how to use it. This was partly because I’d not used map reduce before, and partly because the documentation and examples seemed to assume a minimum level of experience.

But with a bit of searching and running queries myself, I was able to figure it out. This short post will provide a few examples of what I discovered, and point to a few useful resources elsewhere, with the hope of helping anyone else who wants to use map reduce and MongoDB.

The problem

We store a lot of numerical time series data for each server that reports into Server Density. I wanted to extract some statistics about the average values. For example, getting the average CPU load over a specific time period.


This problem can be solved quite easily using a group query. This allows you to do almost exactly what you can do with a map reduce function – perform operations on the returned data. In my case, grouping every document in a collection and then performing an average calculation on the results would look like this:

[sourcecode language=”javascript”]
initial: {count: 0, running_average: 0},
reduce: function(doc, out)
finalize: function(out)
out.average = out.running_average / out.count;

which generates output directly to the console:

[sourcecode language=”javascript”][
"count" : 204856,
"running_average" : 131204.59999999776,
"average" : 0.6404723317842669

You can also specify a cond parameter to the group() function and it will allow you to query the collection so it will only return documents matching that condition. This would allow you to specify a date range, for example

However, the group() function does not work in a sharded environment. Although not in production, we are planning to move to sharding when it is ready and so it made no sense to write code that made use of a function that will not be supported in the future.

Further, “the result is returned as a single BSON object and for this reason must be fairly small – less than 10,000 keys, else you will get an exception.” (from the docs) Map reduce is therefore the only option.

Map reduce

When we were evaluating database options this time last year, one of the reasons I dismissed CouchDB was its requirement that all queries be map reduce. MongoDB allows for ad-hoc queries which suited our requirements much better. But MongoDB added map reduce in v1.2 and is perfect for this kind of use case, even if it is a little more complicated to understand.

This is not a tutorial on the fundamentals of map reduce but it basically consists of two steps – get the data (map) then perform some intended function (reduce), with an optional one time end operation (finalize). The result is stored in a new collection so you can query it separately. This allows you to run jobs in the background whilst the older results are still being used.

In our case, the direct translation of the previous group query is:

[sourcecode language=”javascript”]res = db.runCommand( {
mapreduce: ‘loadAverages’,
map: function() {
emit( { }, { sum: this.v, recs: 1 } );
reduce: function(key, vals) {
var ret = { sum: 0, recs: 0 };
for ( var i = 0; i < vals.length; i++ ) {
ret.sum += vals[i].sum;
ret.recs += vals[i].recs;
return ret;
finalize: function (key, val) {
val.avg = val.sum / val.recs;
return val;
out: ‘result1’,
verbose: true
} );[/sourcecode]

This drops the results directly into a new collection called result1. I can then output the results by querying that collection:

[sourcecode language=”javascript”]> db[res.result].find()
{ "_id" : { }, "value" : { "sum" : 131204.59999999776, "recs" : 204856, "avg" : 0.6404723317842669 } }[/sourcecode]

It this case it has executed over every document in the collection but I can provide many other parameters to the map reduce function to narrow the results down (e.g. by date range) – query, sort and limit are supported.

A few things to be aware of

As of the current stable version (1.4.x):

  • Map reduce seems quite slow over a lot of documents. This is because it runs using the Javascript engine which is single threaded. The query above was executed on 204,865 documents and took 24,385ms to complete.
  • Sharding will help with this problem because the job will be distributed across servers (if your data is sharded of course).
  • Map reduce “almost” blocks in MongoDB 1.4. It doesn’t technically block since it “yields” where necessary, but we saw some blocking-like effects during the time the query was running. The next release addresses some of this to make it friendlier to other queries. See here.
  • You can run the map reduce query on your slave by executing the command db.getMongo().setSlaveOk() before your query. However, the group query did not appear to work even with this flag set. I have reported this as a bug.
  • Group is usually faster the map reduce. It our examples above it completed in about 3 seconds. This is expected. (source).
  • There is still development work planned to improve map reduce. See here.

Translate SQL to MongoDB map reduce

Perhaps the most useful resource I found was a PDF describing how to directly translate a SQL query into a MongoDB map reduce query. This was created by Rick Osborne but I have uploaded a copy here as it’s too useful to risk it disappearing!

MongoDB map reduce PDF screenshot

Other useful resources

Recommendation to doc writers

Start very simple and build up to the advanced queries, clearly explaining each element and the expected result. There is a lot of documentation that explains the really cool advanced stuff you can do but skips over the basis, which are just as important (if not more so).

Articles you care about. Delivered.

Help us speak your language. What is your primary tech stack?

Maybe another time