Premium Hosted Website & Server Monitoring Tool.

(Sysadmin / Devops blog)

visit our website

Blog   >   MongoDB   >   Automating partitioning, sharding and failover with MongoDB

Automating partitioning, sharding and failover with MongoDB

Two of the most eagerly anticipated features for MongoDB, the database backend we use for our server monitoring service, Server Density, are auto sharding and replica sets. Sharding will allow us to let MongoDB handle distribution of data across any number of nodes to maximise use of disk space and dynamically load balance queries. Replica sets provides automated failover and redundancy so you can be sure your data exists on any number of servers across multiple data centres.

This functionality has been in development for some time and it finally entering stable in the upcoming v1.6.0 release, due out in a few days. This post will take you through the basics of setting up a MongoDB cluster using auto sharding and ensuring you have full failover using replica sets.

Starting up the replica set

You can have any number of members in a replica set and your data will exist in full on each member of the set. This allows you to have servers distributed across data centres and geographies to ensure full redundancy. One server is aways the primary to which reads and writes are sent, with the other members being secondary and accepting reads only. In the event of the primary failing, another member will take over automatically.

The video embedded below shows the setup process.

You need a minimum of 2 members in each set and they must both be started before the set becomes available. We will start the first one on server1A now:

./mongod --rest --shardsvr --replSet set1/server1B
  • --rest This enables the admin web UI which is useful for viewing the status of the set. It is publicly accessible via HTTP on port 28017 so ensure it is properly firewalled.
  • --shardsvr This enables sharding on this instance, which will be configured later.
  • --replSet This uses the form setname/serverList. You must give each set a name (“set1″) and specify at least 1 other member for the set. You do not need to specify them all – if the instance gets terminated, it will re-read the config from the specified servers when it comes back up.

When in production you will want to use the --fork and --logpath parameters so that mongod spawns off into a separate process and continues running when you close your console. They’re not used here so we can see the console output. Further tips about running MongoDB in the real world can be found here.

The naming convention we are using for the serverhostname is server[set][server], so this is server A in set 1 that will be connecting to server B in set 1. This just makes it a little easier to explain but in the real usage, these will need be actual hostnames that resolve correctly.

If you are running the instances on different ports, you must specify the ports as part of the parameters e.g. --replSet set1/server1B:1234,server1C:1234

You will see mongod start up with the following output to the console:

Sun Aug  1 04:27:15 [initandlisten] waiting for connections on port 27017
Sun Aug  1 04:27:15 [initandlisten] ******
Sun Aug  1 04:27:15 [initandlisten] creating replication oplog of size: 944MB... (use --oplogSize to change)
Sun Aug  1 04:27:15 allocating new datafile data/local.ns, filling with zeroes...
Sun Aug  1 04:27:15 [startReplSets] replSet can't get local.system.replset config from self or any seed (yet)
Sun Aug  1 04:27:15 done allocating datafile data/local.ns, size: 16MB,  took 0.036 secs
Sun Aug  1 04:27:15 allocating new datafile data/local.0, filling with zeroes...
Sun Aug  1 04:27:15 done allocating datafile data/local.0, size: 64MB,  took 0.163 secs
Sun Aug  1 04:27:15 allocating new datafile data/local.1, filling with zeroes...
Sun Aug  1 04:27:16 done allocating datafile data/local.1, size: 128MB,  took 0.377 secs
Sun Aug  1 04:27:16 allocating new datafile data/local.2, filling with zeroes...
Sun Aug  1 04:27:19 done allocating datafile data/local.2, size: 1024MB,  took 3.019 secs
Sun Aug  1 04:27:25 [startReplSets] replSet can't get local.system.replset config from self or any seed (yet)
Sun Aug  1 04:27:35 [startReplSets] replSet can't get local.system.replset config from self or any seed (yet)
Sun Aug  1 04:27:43 [initandlisten] ******
Sun Aug  1 04:27:43 [websvr] web admin interface listening on port 28017
Sun Aug  1 04:27:45 [initandlisten] connection accepted from 127.0.0.1:43135 #1
Sun Aug  1 04:27:45 [startReplSets] replSet can't get local.system.replset config from self or any seed (yet)

Next, we start the second member of the set (server1B). This will be connecting to server1A, the instance we just set up.

./mongod --rest --shardsvr --replSet set1/server1A

you will see similar console output on both the servers, something like:

Sun Aug  1 04:27:53 [websvr] web admin interface listening on port 28017
Sun Aug  1 04:27:55 [initandlisten] connection accepted from server1A:38289 #1
Sun Aug  1 04:27:56 [initandlisten] connection accepted from 127.0.0.1:48610 #2
Sun Aug  1 04:27:56 [startReplSets] replSet can't get local.system.replset config from self or any seed (EMPTYCONFIG)
Sun Aug  1 04:27:56 [startReplSets] replSet   have you ran replSetInitiate yet?

Initialising the replica set

Now the two instances are communicating, you need to initialise the replica set. This only needs to be done on one of the servers (either, it doesn’t matter) so from the MongoDB console on that server:

./mongo localhost:27017
MongoDB shell version: 1.5.7
connecting to: localhost:27017/test
> cfg = {         
... _id : "set1",
... members : [
... { _id : 0, host : "server1A:27017"},
... { _id : 1, host : "server1B:27017"}
... ] }
> rs.initiate(cfg)
{
	"info" : "Config now saved locally.  Should come online in about a minute.",
	"ok" : 1
}

Here I created the config object and specified the members manually. This is because I wanted to specify the ports but you can just execute

rs.initiate()

and the current server, plus any members you specified on the command line parameters when starting mongod will be added automatically. You may also want to specify some extra options, all of which are documented here.

Perhaps the most important of these extra options is priority. Setting this for each host will allow you to determine in which order they become primary during failover. This is useful if you have 3 servers, 2 in the same data centre and 1 outside for disaster recovery. You might want the second server in the same DC to become primary first; setting its priority higher than the outside server allows this.

> cfg = {         
_id : "set1",
members : [
{ _id : 0, host : "server1A:27017", priority : 2},
{ _id : 1, host : "server1B:27017", priority : 1}
{ _id : 2, host : "server1C:27017", priority : 0.5}
] }

The web console enabled by the --rest parameter can be accessed at the standard port 28017 e.g. http://example.com:28017. This shows the live status of the mongod instance.

MongoDB REST UI - Replica Set Status

Adding another server to the set

Adding a new server (server1C) to the replica set is really easy. Start the instance up specifying any one of the other members (or all of them) as part of the parameters:

./mongod --rest --shardsvr --replSet set1/server1A

then on the primary server, connect to the MongoDB console and execute the add command:

./mongo localhost:27017
MongoDB shell version: 1.5.7
connecting to: localhost:27017/test
> rs.add("server1C:27017")
{ "ok" : 1 }

This server will then become part of the set and will immediately start syncing with the other members.

Setting up sharding

Now we have our 3 member replica set, we can configure sharding. This has 3 parts:

  • Shard servers – the mongod instances. We have already set these up with the --shardsvr parameter when starting each mongod.
  • Config servers – these are mongod instances run with a --configsvr parameter that store the meta data for the shard. As per the documentation, “a production shard cluster will have three config server processes, each existing on a separate machine. Writes to config servers use a two-phase commit to ensure an atomic and replicated transaction of the shard cluster’s metadata.”
  • mongos – processes that your clients connect to which route queries to the appropriate shards. They are self contained and will usually be run on each of your application servers.

Sharding Diagram

The video embedded below shows the setup process.

Config servers

Having already set up the shard servers above, the next step is to set up the config servers. We need 3 of these, which will exist on each of our shard servers but can be on their own, lower spec machines if you wish. They will not require high spec servers as they will have relatively low load, but should be positioned redundantly so a machine or data centre failure will not take them all offline.

./mongod --configsvr --dbpath config/
  • --configsvr This enables config server mode on this mongod instance.
  • --dbpath Since I am running this config server on a server that already has another mongod instance running, a separate data path is specified. This isn’t necessary if the config server is running on its own.

This is executed on each of our 3 servers already running the shards. The console output will be something like this:

Sun Aug  1 08:14:30 db version v1.5.7, pdfile version 4.5
Sun Aug  1 08:14:30 git version: 5b667e49b1c88f201cdd3912b3d1d1c1098a25b4
Sun Aug  1 08:14:30 sys info: Linux domU-12-31-39-06-79-A1 2.6.21.7-2.ec2.v1.2.fc8xen #1 SMP Fri Nov 20 17:48:28 EST 2009 x86_64 BOOST_LIB_VERSION=1_41
Sun Aug  1 08:14:30 [initandlisten] diagLogging = 1
Sun Aug  1 08:14:30 [initandlisten] waiting for connections on port 27019
Sun Aug  1 08:14:30 [websvr] web admin interface listening on port 28019

nothing else happens until you connect the mongos to the config server.

Router processes

The router process is what you connect your clients to. They download the meta data from the config servers and then route queries to the correct shard servers. They stay up to date and are independent of each other so require no redundancy per se. Specify each config server in comma separated form:

./mongos --configdb server1A,server1B,server1C
Sun Aug  1 08:19:44 mongodb-linux-x86_64-1.5.7/bin/mongos db version v1.5.7, pdfile version 4.5 starting (--help for usage)
Sun Aug  1 08:19:44 git version: 5b667e49b1c88f201cdd3912b3d1d1c1098a25b4
Sun Aug  1 08:19:44 sys info: Linux domU-12-31-39-06-79-A1 2.6.21.7-2.ec2.v1.2.fc8xen #1 SMP Fri Nov 20 17:48:28 EST 2009 x86_64 BOOST_LIB_VERSION=1_41
Sun Aug  1 08:19:44 SyncClusterConnection connecting to [server1A:27019]
Sun Aug  1 08:19:44 SyncClusterConnection connecting to [server1B:27019]
Sun Aug  1 08:19:44 SyncClusterConnection connecting to [server1C:27019]
Sun Aug  1 08:19:54 [websvr] web admin interface listening on port 28017
Sun Aug  1 08:19:54 [Balancer] SyncClusterConnection connecting to [server1A:27019]
Sun Aug  1 08:19:54 SyncClusterConnection connecting to [server1A:27019]
Sun Aug  1 08:19:54 waiting for connections on port 27017
Sun Aug  1 08:19:54 [Balancer] SyncClusterConnection connecting to [server1B:27019]
Sun Aug  1 08:19:54 SyncClusterConnection connecting to [server1B:27019]
Sun Aug  1 08:19:54 [Balancer] SyncClusterConnection connecting to [server1C:27019]
Sun Aug  1 08:19:54 SyncClusterConnection connecting to [server1C:27019]

Create the shard

Now you have all the mongo server instances running, you need to create the shard. Connect to the mongos instance using the MongoDB console, switch to the admin database and then add the shard.

./mongo
MongoDB shell version: 1.5.7
connecting to: test
> use admin
switched to db admin
> db.runCommand( { addshard : "set1/server1A,server1B,server1C", name : "shard1" } );         
{ "shardAdded" : "shard1", "ok" : 1 }
> db.runCommand( { listshards : 1 } );
{
	"shards" : [
		{
			"_id" : "shard1",
			"host" : "server1A,server1B,server1C"
		}
	],
	"ok" : 1
}

Note that the list of server hostnames includes the replica set name in the form [setName]/[servers]. The set name is what you called the replica set when you started the mongod instance with --shardsvr above. In our case we called it set1.

There are a number of config options here including the ability to set a maximum size on the shard so you can control disk space usage.

Shard the database

We can now finally use the shard by enabling sharding on a database and executing a couple of test queries. Here we will use the test database using the MongoDB console connected to the mongos instance:

> use admin
switched to db admin
> db.runCommand( { enablesharding : "test" } );    
{ "ok" : 1 }
> use test
switched to db test
> db.hats.insert({hats: 5})
> db.hats.find()           
{ "_id" : ObjectId("4c5568021fd8e7e6a0636729"), "hats" : 5 }

You can confirm this is working from the console output on each of the shards themselves:

Sun Aug  1 08:26:42 [conn6] CMD fsync:  sync:1 lock:0
Sun Aug  1 08:26:42 [initandlisten] connection accepted from 10.255.62.79:38953 #7
Sun Aug  1 08:26:42 allocating new datafile data/test.ns, filling with zeroes...
Sun Aug  1 08:26:42 done allocating datafile data/test.ns, size: 16MB,  took 0.046 secs
Sun Aug  1 08:26:42 allocating new datafile data/test.0, filling with zeroes...
Sun Aug  1 08:26:42 done allocating datafile data/test.0, size: 64MB,  took 0.183 secs
Sun Aug  1 08:26:42 [conn6] building new index on { _id: 1 } for test.hats
Sun Aug  1 08:26:42 [conn6] Buildindex test.hats idxNo:0 { name: "_id_", ns: "test.hats", key: { _id: 1 } }
Sun Aug  1 08:26:42 [conn6] done for 0 records 0secs
Sun Aug  1 08:26:42 [conn6] insert test.hats 248ms
Sun Aug  1 08:26:42 [conn6] fsync from getlasterror
Sun Aug  1 08:26:42 allocating new datafile data/test.1, filling with zeroes...
Sun Aug  1 08:26:42 done allocating datafile data/test.1, size: 128MB,  took 0.402 secs

Sharding the collection

The database test is sharded now but the documents will only exist on a single shard. To actually use the automated partitioning of data, you need to shard at the collection level. For example, setting the shard key on a timestamp will cause MongoDB to partition the data across shards based on that timestamp e.g. day 1 on shard 1, day 2 on shard 2, etc.

In our example above, we only have 1 shard and the test document is very simple but we could create a shard key on the number of hats:

> use admin
switched to db admin
> db.runCommand( { shardcollection : "test.hats.hats", key : { hats : 1 } } )
{ "collectionsharded" : "test.hats.hats", "ok" : 1 }

In the mongos console you will see

Mon Aug  2 22:10:55 [conn1] CMD: shardcollection: { shardcollection: "test.hats.hats", key: { hats: 1.0 } }
Mon Aug  2 22:10:55 [conn1] enable sharding on: test.hats.hats with shard key: { hats: 1.0 }
Mon Aug  2 22:10:55 [conn1] no chunks for:test.hats.hats so creating first: ns:test.hats.hats at: shard1:set1/server1A,server1B,server1C lastmod: 1|0 min: { hats: MinKey } max: { hats: MaxKey }

The default chunk size is 50MB so data will not start to be distributed to multiple shards until you hit that.

Notes on failover

  • Terminating a mongod instance, either a config server or a shard server, will have no effect on the availability of the data and the ability to perform all operations on it. A detailed explanation of what happens when certain servers fail can be found here.
  • However, in our case, if 2 out of 3 of the members of a replica set fail, the set will become read only even though there is a server remaining online. (source).
  • As such, a 3 member replica set with 2 members in one data centre and 1 member in another has a point of failure if you want the set to remain fully operational in the event of a DC outage – the 1 server on its own. To protect against this you would need to have 4 members – 2 per data centre.
  • Multiple replica sets per shard are not supported (source)

Future expansion

Now this is set up, adding additional shards is as easy as provisioning a new replica set and using the addShard command on that new set. The data will be balanced automatically so you have real horizontal scaling.

We have not yet deployed sharding and replica sets into production with Server Density – this is on our roadmap so I’ll be reporting back in a couple of months when we have been using it for some time. Subscribe to stay up to date!

  • http://endlesspaths.com Phil Plante

    Great setup guide for these new features. Would you mind sharing the hardware/software setup you’ve got here?

    I am going to be setting up an environment similar to this on Rackspace once 1.6 is final. Are you on a cloud?

    • http://www.serverdensity.com David Mytton

      We currently use physical servers from Rackspace. Slide 12 at http://www.slideshare.net/boxedice/mongouk-1-billion-documents has the details. Let me know if you have any questions about it, happy to share.

      • http://endlesspaths.com Phil Plante

        Yeah I saw those slides. Are you seeing any significant performance increase with sharding and map reduce queries? For what I am setting up I hope to get sharding distributing the datasets enough to use map reduce in near real time. Most of the datasets are small so it should be possible with a shardSize of 1mb.

        What are you setting the shardSize to?

      • http://www.serverdensity.com David Mytton

        We’re not using sharding yet because 1.6 isn’t out stable. Improved map reduce performance is something we’re hoping will happen once we move to sharding, but it will take a little time for us to migrate as there are some backend changes we need to make.

        I have yet to decide on the shard size. Our largest collection will be several 100Gbs and the smaller probably under 50Mb. In that case we’ll need to reduce the shardSize to ensure all collections get sharded.

  • Nick Bimpasis

    Thank you for such a nice guide! This makes me feel a lot more comfortable about using Mongo in the future.

  • Pang Tat Sean

    Nice guide. I tried the implementation as stated in the article using single server, but somehow, every time i stop one of the shards, it fails. The following is my configuration, is anything wrong with that?

    1. execute StartShard_1.bat
    ——————————————————————————————————–
    mongod –shardsvr –dbpath D:/mongodb1.5.6/data/a –port 10000 –replSet foo/10.70.0.122:10001

    2. execute StartShard_2.bat
    ——————————————————————————————————–
    mongod –shardsvr –dbpath D:/mongodb1.5.6/data/b –port 10001 –replSet foo/10.70.0.122:10000

    3. Connect to “mongo 10.70.0.122:10000″ and initiate the replication

    config = {_id: ‘foo’, members: [
    {_id: 0, host: '10.70.0.122:10000'},
    {_id: 1, host: '10.70.0.122:10001'}]
    }

    rs.initiate(config)

    4. execute StartConfig.bat
    ————————————————————————————————–
    mongod –configsvr –dbpath D:/mongodb1.5.6/data/config –port 20000

    5. execute StartRouter.bat
    ————————————————————————————————–
    mongos –configdb 10.70.0.122:20000

    6. execute mongo in dos window
    use admin
    db.runCommand( { addshard : “foo/10.70.0.122:10000,10.70.0.122:10001″, maxSize:2000 } )
    db.runCommand( { enablesharding : “NormTests” } )
    db.runCommand( { shardcollection : “NormTests.WeatherData”, key : {WeatherDate : 1} } )

    7. insert record to NormTests through C# client. everything works just fine.

    8. Now, i closed one of the shards, and the Router seems starting having problems to connect to the closed shard.

    Any clue will be much appreciated.

    • http://www.serverdensity.com David Mytton

      You have created a replica set with 2 members, and that replica set is part of 1 shard. If you terminate either of the replica set members then the remaining one will switch to read only mode. You must have a majority of replica set members up for them to operate normally. This is why our example above uses 3 members so 1 can go down and the other 2 remain only. Even in this case, you’d need 4 members for complete redundancy, 2 per data centre.

  • Martin Karlsch

    Cool stuff … works fine for me until I try to query the db :-(

    Out setup looks like this:

    six servers (ed1,ed2,ed3,ed4,ed5,ed6)

    ed1+ed2, ed3+ed4, ed5+ed6 are replica sets (seems to work fine)

    db.runCommand( { listshards : 1 } );

    {
    “shards” : [
    {
    "_id" : "shard1",
    "host" : "set1/ed1:10000,ed2:10000"
    },
    {
    "_id" : "shard2",
    "host" : "set2/ed3:10000,ed4:10000"
    },
    {
    "_id" : "shard3",
    "host" : "set3/ed5:10000,ed6:10000"
    }
    ],
    “ok” : 1
    }

    use admin
    db.runCommand( { enablesharding : “test” } );

    If I enable sharding for a collection I loose the ability to query via find, count still works.

    use test
    db.b.insert({x:1})
    db.b.find()
    >> { “_id” : ObjectId(“4c5acc499497bcffcdc740dc”), “x” : 1 }
    use admin
    switched to db admin
    db.runCommand( { shardcollection : “test.b”, key : { _id : 1 } } )
    >> { “collectionsharded” : “test.b”, “ok” : 1 }
    use test
    db.b.count()
    >> 1
    db.b.find()
    >> error: { “$err” : “assertion client/dbclientcursor.cpp:196″, “code” : 0 }

    Log says ..

    [conn2] Assertion failure connector == conn->get() client/dbclientcursor.cpp 196
    [conn2] ~ScopedDBConnection: _conn != null
    [conn2] UserException: assertion client/dbclientcursor.cpp:196

    Have you seen anything like this before? I tried several things like compiling master and v1.6 branch for 32 and 64 bits, remove all dbs and recreate and so on … still no luck.

    (Debian, Linux ed1 2.6.26-1-amd64 #1 SMP Sat Jan 10 17:57:00 UTC 2009 x86_64 GNU/Linux)

  • Michael

    Can you please explain this limitation? “Multiple replica sets per shard are not supported”, does it mean I can either shard or replicate a specific collection but not both?

    • http://www.serverdensity.com David Mytton

      No, it means that a shard can only be on a single replica set. E.g. you can’t do

      db.runCommand( { addshard : “set1/server1A,server1B,set2/server2A,server2C”, name : “shard1″ } );

      If this worked then you’d have the shard existing on 2 replica sets; if one went down then the shard would still be accessible because it is on 2. You can’t do this.

      When you shard a collection, it will be distributed across multiple shards (assuming you have more than 1). Think of a shard as the top level with the replica set below it. You could have 2 shards on 2 separate replica sets, or 2 shards on the same replica set. The replica set handles redundancy, the shard handles splitting the data. A single document within a collection will only ever exist on 1 shard, but could be replicated on multiple members of a replica set.

  • http://组成 james

    Suppose many shard consist of a cluster.
    If a single shard is down,what happen? Can i query on other shard?

  • Mark Brassard

    Excellent walk-through to setup a mongoDB with replication and sharding. Can’t find how to do this correctly anywhere on MongoDB.org and in fact, they point people to your blog if they want to know how to do it. Little lax on Mongo’s part but I digress.

    Having stated that Mongo doesn’t do great docs, they also have no steps on how to replace a failed server in the replication/sharding setup. I have only seen examples where they simply “restart” the old server. Have you had to replace any servers in your above setup? And if so, what were the steps you used to do it. I’ve tried reconfiguring the replication with the replSetReconfig. Everything appears to be working. (I see the logs on the new server attempting to sync up the data from the other servers.). But when I try and startup a mongos process (mongos –configdb server1_master,server1_secondary,server1_new) on the server1_new, it dies and complains that the config for server1_new and the config for server1_master are not in sync.

    I have scoured high and low for anyone to have solved this but to no avail.

    Thanks.

    Mark

  • http://www.jasimmk.com Jasim Muhammed

    Great one David :)

    will you plz share me if there is some way to benchmark the performance of the entire MongoDB System, How much request it can handle?

    • http://www.serverdensity.com David Mytton

      Not that I know of.

  • shashikant soni

    thanks for the fantastic post. i was able to setup very fast with these guidelines.

  • Seidenthal Steve

    thank you very much for this amazing 100% working post.