Premium Hosted Website & Server Monitoring Tool.

(Sysadmin / Devops blog)

visit our website

Blog   >   MongoDB   >   Map reduce and MongoDB

Map reduce and MongoDB

MongoDB Logo

As a long time PHP developer and inactive member of the PHP documentation team, I am used to the excellent manual and clear examples for every aspect of the language. The lack of comparable documentation is always something I notice when using other languages; even if the documentation is good, it rarely compares to the PHP docs.

Of course, PHP has been around for a long time and expecting other, much newer projects to have the same level of coverage is unfair. So when I started to play around with the aggregation and map reduce functions within MongoDB, the database we use to power our hosted server monitoring tool, Server Density, it took some time and much trial and error to figure out how to use it. This was partly because I’d not used map reduce before, and partly because the documentation and examples seemed to assume a minimum level of experience.

But with a bit of searching and running queries myself, I was able to figure it out. This short post will provide a few examples of what I discovered, and point to a few useful resources elsewhere, with the hope of helping anyone else who wants to use map reduce and MongoDB.

The problem

We store a lot of numerical time series data for each server that reports into Server Density. I wanted to extract some statistics about the average values. For example, getting the average CPU load over a specific time period.

Group

This problem can be solved quite easily using a group query. This allows you to do almost exactly what you can do with a map reduce function – perform operations on the returned data. In my case, grouping every document in a collection and then performing an average calculation on the results would look like this:

db.sd_boxedice_checksLoadAvrg.group(
{
     initial: {count: 0, running_average: 0},
     reduce: function(doc, out) 
     {
          out.count++;
          out.running_average+=doc.v;
     },
     finalize: function(out)
     {
          out.average = out.running_average / out.count;
     }
}
);

which generates output directly to the console:

[
	{
		"count" : 204856,
		"running_average" : 131204.59999999776,
		"average" : 0.6404723317842669
	}
]

You can also specify a cond parameter to the group() function and it will allow you to query the collection so it will only return documents matching that condition. This would allow you to specify a date range, for example

However, the group() function does not work in a sharded environment. Although not in production, we are planning to move to sharding when it is ready and so it made no sense to write code that made use of a function that will not be supported in the future.

Further, “the result is returned as a single BSON object and for this reason must be fairly small – less than 10,000 keys, else you will get an exception.” (from the docs) Map reduce is therefore the only option.

Map reduce

When we were evaluating database options this time last year, one of the reasons I dismissed CouchDB was its requirement that all queries be map reduce. MongoDB allows for ad-hoc queries which suited our requirements much better. But MongoDB added map reduce in v1.2 and is perfect for this kind of use case, even if it is a little more complicated to understand.

This is not a tutorial on the fundamentals of map reduce but it basically consists of two steps – get the data (map) then perform some intended function (reduce), with an optional one time end operation (finalize). The result is stored in a new collection so you can query it separately. This allows you to run jobs in the background whilst the older results are still being used.

In our case, the direct translation of the previous group query is:

res = db.runCommand( {
     mapreduce: 'loadAverages',
     map: function() {
          emit( { }, { sum: this.v, recs: 1 } );
     },
     reduce: function(key, vals) {
          var ret = { sum: 0, recs: 0 };
          for ( var i = 0; i < vals.length; i++ ) {
               ret.sum += vals[i].sum;
               ret.recs += vals[i].recs;
          }
          return ret;
     },
     finalize: function (key, val) {
          val.avg = val.sum / val.recs;
          return val;
     },
     out: 'result1',
     verbose: true
} );

This drops the results directly into a new collection called result1. I can then output the results by querying that collection:

> db[res.result].find()
{ "_id" : { }, "value" : { "sum" : 131204.59999999776, "recs" : 204856, "avg" : 0.6404723317842669 } }

It this case it has executed over every document in the collection but I can provide many other parameters to the map reduce function to narrow the results down (e.g. by date range) – query, sort and limit are supported.

A few things to be aware of

As of the current stable version (1.4.x):

  • Map reduce seems quite slow over a lot of documents. This is because it runs using the Javascript engine which is single threaded. The query above was executed on 204,865 documents and took 24,385ms to complete.
  • Sharding will help with this problem because the job will be distributed across servers (if your data is sharded of course).
  • Map reduce “almost” blocks in MongoDB 1.4. It doesn’t technically block since it “yields” where necessary, but we saw some blocking-like effects during the time the query was running. The next release addresses some of this to make it friendlier to other queries. See here.
  • You can run the map reduce query on your slave by executing the command db.getMongo().setSlaveOk() before your query. However, the group query did not appear to work even with this flag set. I have reported this as a bug.
  • Group is usually faster the map reduce. It our examples above it completed in about 3 seconds. This is expected. (source).
  • There is still development work planned to improve map reduce. See here.

Translate SQL to MongoDB map reduce

Perhaps the most useful resource I found was a PDF describing how to directly translate a SQL query into a MongoDB map reduce query. This was created by Rick Osborne but I have uploaded a copy here as it’s too useful to risk it disappearing!

MongoDB map reduce PDF screenshot

Other useful resources

Recommendation to doc writers

Start very simple and build up to the advanced queries, clearly explaining each element and the expected result. There is a lot of documentation that explains the really cool advanced stuff you can do but skips over the basis, which are just as important (if not more so).

  • giltotherescue

    Interesting post. We decided upon Cassandra for our non-relational data store, which unfortunately does not have map reduce baked in.

    One point which caught me off guard is that in your tests, map reduce is single-threaded/slow. I suppose I had not considered the downside of relying upon Mongo’s built in map-reduce–the number of workers can only be as large as the number of Mongo servers in your cluster. Is that accurate? It seems that in Mongo’s case, map reduce is more of a query language than a framework for distributed processing. If Mongo allowed you to map those functions to outside workers, now that would be powerful.

    As for me, I’ll be racking a Hadoop cluster in the very near future.

    • http://www.serverdensity.com David Mytton

      Yeh it’s single threaded which is a limitation of the JS based engine they use.

  • http://www.23divide.com Trystan Kosmynka

    Thanks for the article David.

    This is a bit out of context but I am sure you guys have some insight.

    I am curious what your approach has been when storing calculated data. Is it stored in separate report type documents?

    Calculated data could also be considered time series data, is each snapshot considered a document in a calculated values collection? Or is each value and its timestamp in a values array in a “Avg CPU load” document?

    The beauty of document stores is that you can organize the data in many ways. This also creates a bit of a learning curve in that no way seems completely correct and no way incorrect. Just curious what you guys have found and prefer.

    Thanks!

    • http://www.serverdensity.com David Mytton

      We store with a document per value, with a key so we can pull the same values from across time ranges (each doc has a timestamp). I’m not sure it really matters how you store so long as you can query and get the results you want. You can easily adjust the structure at any time.

  • http://gravatar.com/anzenews andrew

    If you emit all the values under the same key in Map phase, the whole bunch of data goes to a single reducer, right? So the Reduce function is called just once, on one machine. Correct me if I’m wrong, but this is not a scalable way to perform this operation.

    If you want to make it scalable you need to group the values and emit 10, 20, 100 or so keys.

    Hope it makes sense. I come from the world of Hadoop, not MongoDB, so please correct me if I’m wrong. But MR concepts should be the same.

  • Matt

    Hi David, regarding the comment that “the group() function does not work in a sharded environment” …

    I’m assuming this problem still is in effect? I’m part of a team that is implementing mongo, and we are seeing slower than expected result sets for mapReduce. Our data store is going to need to facilitate close to real time data aggregation of user behavior throughout the day, and M/R seems like a necessary evil. I’m finding it hard to escape the use of M/R, though we are still in early stages and have not yet put sharding in place. Also, we are not yet using a data caching system like memcach, though I don’t know how useful that can be when you need real time data. I’ve used it previously on more static data sets.

    Would you expect sharding to greatly/not so greatly improve the speed of mapReduce? Any thoughts would be appreciated.

    • http://www.serverdensity.com David Mytton

      m/r is slow because it’s single threaded. Sharding speeds things up because it can be run in parallel but it’s still slow.

  • Andrew

    M/R is slow – period. It is meant for scalable processing of huge amounts of data, not for fast processing. And never for real-time processing.
    That said, I haven’t tried MongoDB’s implementation, but I wouldn’t hold my breath. Their strength is in indexed access to sharded data. Use Hadoop for M/R, they have provided the drivers so that should be no problem…

  • Matt

    David and Andrew, thank you for the feedback, and the PDF is priceless. Have not yet taken a look at Hadood, so that is next.