Update: We hosted a live Hangout on Air with Paul Done from MongoDB discussing how to monitor MongoDB. We’ve made the slides and video available, which can be found embedded at the bottom of this blog post.
We use MongoDB to power many different components of our server monitoring product, Server Density. This ranges from basic user profiles all the way to high throughput processing of over 30TB/month of time series data.
All this means we keep a very close eye on how our MongoDB clusters are performing, with detailed monitoring of all aspects of the systems. This post will go into detail about the key metrics and how to monitor your MongoDB servers.
Key MongoDB monitoring metrics
There is a huge range of different things you should keep track of with your MongoDB clusters, but only a few that are critical. These are the monitoring metrics we have on our critical list:
Oplog replication lag
The replication built into MongoDB through replica sets has worked very well in our experience. However, by default writes only need to be accepted by the primary member and replicate down to other secondaries asynchronously i.e. MongoDB is eventually consistent by default. This means there is usually a short window where data might not be replicated should the primary fail.
This is a known property, so for critical data, you can adjust the write concern to return only when data has reached a certain number of secondaries. For other writes, you need to know when secondaries start to fall behind because this can indicate problems such as network issues or insufficient hardware capacity.
Replica secondaries can sometimes fall behind if you are moving a large number of chunks in a sharded cluster. As such, we only alert if the replicas fall behind for more than a certain period of time e.g. if they recover within 30min then we don’t alert.
In normal operation, one member of the replica set will be primary and all the other members will be secondaries. This rarely changes and if there is a member election, we want to know why. Usually this happens within seconds and the condition resolves itself but we want to investigate the cause right away because there could have been a hardware or network failure.
Flapping between states should not be a normal working condition and should only happen deliberately e.g. for maintenance or during a valid incident e.g. hardware failure.
Lock % and disk i/o % utilization
As of MongoDB 2.6, locking is on a database level, with work ongoing for document level locking in MongoDB 2.8. Writes take a global database lock so if this situation happens too often then you will start seeing performance problems as other operations (including reads) get backed up in the queue.
We’ve seen high effective lock % be a symptom of other issues within the database e.g. poorly configured indexes, no indexes, disk hardware failures and bad schema design. This means it’s important to know when the value is high for a long time, because it can cause the server to slow down (and become unresponsive, triggering a replica state change) or the oplog to start to lag behind.
However, it can trigger too often, so you need to be careful. Set long delays e.g. if the lock remains above 75% for more than 30 minutes and if you have alerts on replica state and oplog lag, you can actually set this as a non-critical alert.
Related to this is how much work your disks are doing i.e. disk i/o % utilization. Approaching 100% indicates your disks are at capacity and you need to upgrade them i.e. spinning disk to SSD. If you are using SSDs already then you can provide more RAM or you need to split the data into shards.
Non-critical metrics to monitor MongoDB
There are a range of other metrics you should keep track of on a regular basis. Even though they might be non-critical, they will help avoid issues escalating to critical production problems if dealt with and investigated.
Memory usage and page faults
Memory is probably the most important resource you can give MongoDB and so you want to make sure you always have enough! The rule of thumb is to always provide sufficient RAM for all of your indexes to fit in memory, and where possible, enough memory for all your data too.
Resident memory is the key metric here – MongoDB provides some useful statistics to show what it is doing with your memory.
Page faults are related to memory because a page fault happens when MongoDB has to go to disk to find the data rather than memory. More page faults indicate that there is insufficient memory, so you should consider increasing the available RAM.
Every connection to MongoDB has an overhead which contributes to the required memory for the system. This is initially limited by the Unix ulimit settings but then will become limited by the server resources, particularly memory.
High numbers of connections can also indicate problems elsewhere e.g. requests backing up due to high lock % or a problem with your application code opening too many connections.
Shard chunk distribution
MongoDB will try and balance chunks equally around all your shards but this can start to lag behind if there are constraints on the system e.g. high lock % slowing down moveChunk operations. You should regularly keep an eye on how balanced the cluster is.
Tools to monitor MongoDB
Now you know the things to keep an eye on, you need to know how to actually collect those monitoring statistics!
Monitoring MongoDB in real time
MongoDB includes a number of tools out of the box. These are all run against a live MongoDB server and report stats in real time:
- mongostat – this shows key metrics like opcounts, lock %, memory usage and replica set status updating every second. It is useful for real time troubleshooting because you can see what is going on right now.
- mongotop – whereas mongostat shows global server metrics, mongotop looks at the metrics on a collection level, specifically in relation to reads and writes. This helps to show where the most activity is.
- rs.status() – this shows the status of the replica set from the viewpoint of the member you execute the command on. It’s useful to see the state of members and their oplog lag.
- sh.status() – this shows the status of your sharded cluster, in particular the number of chunks per shard so you can see if things are balanced or not.
MongoDB monitoring, graphs and alerts
Although the above tools are useful for real time monitoring, you also need to keep track of statistics over time and get notified when metrics hit certain thresholds – some critical, some non-critical. This is where a monitoring tool such as Server Density comes in. We can collect all these statistics for you, allow you to configure alerts and dashboards and graph the data over time, all with minimal effort.
If you already run your own on-premise monitoring using something like Nagios or Munin, there are a range of plugins for those systems too.
MongoDB themselves provide free monitoring as part of the MongoDB Management Service. This collects all the above statistics with alerting and graphing, similar to Server Density but without all the other system, availability and application monitoring.