MongoDB, CPU cores and vmmemctl
Written by David Mytton
Having reduced our MongoDB connection overhead, we continued to see response time spikes correlated to spikes of the MongoDB queues in mongostat. These were occurring randomly and at all other times, performance was normal.
We soon discovered that during the spikes, load averages on the
mongod nodes would be very high. CPU % usage would be low because we had 8 cores per
mongod node but
mongod was still waiting for CPU time (indicated by the high load average). Our application is write intensive and unlike reads, MongoDB writes do not take advantage of multi-cores so it is very sensitive to CPU wait.
Our infrastructure is virtualised on the Terremark Enterprise Cloud, which is based on VMWare. As is normal for such an environment, the host machine reclaims unused resources for other guests and does this for memory using a kernel process called
The purpose of vmmemctl is to trigger guest-level page out of blocks of memory which can then be provided back to the host for allocation to VMs which are requesting additional memory. This does involve some clock time on the CPUs but generally does not cause a noticeable performance impact.
We worked with Terremark support to see if there was a problem on the host and it turned out we were seeing issues with CPU Ready time.
CPU Ready is the amount of time a VM is sitting with active tasks pending awaiting a turn at the host CPU resources. Larger devices incur larger CPU Ready times due to the fact that all VPUs would have to be scheduled against the host resources effectively together. VMware’s recommendation would be to reduce this device to a single CPU or at most 2 if processes running on the VM are capable of hyperthreading.
This recommendation was implemented and we saw an immediate improvement – the spikes disappeared and
vmmemctl stopped eating up CPU. It seems likely that the main cause was the CPU wait caused by the overhead of multiple cores that weren’t actually in use, with
vmmemctl being a symptom of this. However, it’s worth noting how
vmmemctl works in relation to MongoDB:
The balloon driver only reclaims memory that is not being actively used to provide it to VMs that are actively requesting utilization of memory. This is calculated through “idle tax,” which determines based on utilization how much of memory granted to a device is going unused. That memory is flagged as available in the case of memory pressure to be ballooned out by vmmemctl as needed. Normally, this is again not performance impacting aside from a brief CPU burn while memory pages in/out as relevant. This is greatly preferable to memory swap activity, which is when memory paging is written to disk (and in the case of this infrastructure, that means writing to SAN). As long as the memory balloon method can accommodate memory needs in a pool, swapping doesn’t come into play.
VMware’s vmmemctl is essentially querying the guest’s memory management for memory blocks not currently flagged to particular, active processes. If the guest OS considers a memory page to be in active use, it won’t be a valid target for vmmemctl to request. The vmmemctl process does not take action without reference to the guest OS; rather, it requests memory pages be assigned to it BY the guest OS, effectively causing the guest to treat those pages as though they are in active use and thus reserved for vmmemctl. Pages reserved for vmmemctl are lower priority than other active memory needs at the guest level, so if the memory pages are needed by other active processes, vmmemctl notifies the hypervisor to return the blocks and then releases them back to the guest OS.
The thing to note here is in relation to MongoDB memory mapping files which may appear unused but actually be in use. If memory is paged out then this can affect performance. Since we solved the problem by adjusting the VM CPU configuration, it’s difficult to know if the
vmmemctl activity also contributed to the performance issues.
The advice from 10gen is to run at least the primary
mongod nodes on physical machines where reclaiming memory and other resource contention is not an issue. This is something we’re now looking into as Terremark are able to deploy physical hardware and connect it to our cloud environment in a hybrid model.