Premium Hosted Website & Server Monitoring Tool.

(Sysadmin / Devops blog)

visit our website

Blog   >   MongoDB   >   Removing Memcached because it’s too slow

Removing Memcached because it’s too slow

Update There has been a lot of discussion of this post around whether this is a problem with Memcached or something else. The post content is accurate but in hindsight, the use of “Memcached” should really have been “Membase + Moxi”. Membase provides additional tools on top of Memcached so whilst Memcached itself wasn’t slow, the use of the Moxi proxy to provide failover was. The argument is that what comes out of the box with regards failover with MongoDB is similar to what Membase + Moxi provides so whilst it’s accurate that Memcached was replaced because it was faster to use MongoDB than have to rely on the Moxi proxy, this should’ve been clearer.

We’ll shortly be deploying some changes to the Server Density codebase to remove Memcached as component in the system. We currently use it for 2 purposes:

  1. UI caching: the initial load of your account data e.g. server lists, alert lists, users lists, are taken directly from the MongoDB database and then cached until you made a change to the data, when we invalidate the cache.
  2. Throttling: the performance impact of the global lock in MongoDB 1.8 was such that we couldn’t insert our monitoring postback data directly into MongoDB – it had to be inserted into Memcached first then throttled into MongoDB via a few processor daemons (as opposed to larger numbers of web clients).

Performance map

This has worked well for over a year now but with the release of MongoDB 2.0, the impact of the global lock is significantly reduced because of much smarter yielding. This is only set to get better with database level locking in 2.2 and further concurrency improvements in future releases.

We’ve already removed throttling from other aspects of our codebase but our performance metrics show that we’re now finally able to remove Memcached completely, because directly accessing MongoDB is significantly faster. Indeed, our average database query response time is 0.43ms compared to 24.2ms from Memcached.

Database throughput

Response time

We have a number of MongoDB clusters and these figures are for our primary data store where all application data lives (separate from our historical time series data). There are x2 shards made up of x4 data nodes in each shard, x2 per data centre (Washington DC and San Jose in the US). They are dedicated servers running Ubuntu 10.04 LTS with 8GB RAM, Intel Xeon-SandyBridge Quad Core 3.4Ghz CPUs, 100GB SSDs for the MongoDB data files and connected to a 2Gbps internal network.

Nodes

Removing Memcached as a component simplifies our system even further so our core technology stack will only consist of Apache, PHP, Python, MongoDB and Ubuntu. This eliminates the need for Memcached itself running on a separate cluster, the Moxi proxy to handle failover, additional monitoring for another component and a different scaling profile. Getting memcached libraries for PHP and Python is also a pain if you want to use officially supported packages (through Ubuntu LTS) especially when you want to use later releases. And we can get rid of that extra 24ms of response time.

  • Evan

    It sounds like the real issue was either Moxi or your dynamic language client libraries. Memcached 99th percentile latency should be in the single digit milliseconds.

    • http://www.serverdensity.com David Mytton

      Regardless of where the problem is within Moxi or the client library, that latency doesn’t exist when using MongoDB so it provides another reason to remove Memcached rather than try and diagnose where the issue is in some poor 3rd party code.

      • Evan

        That’s a sloppy way to look at it. People (including those in your own organization) would be better informed if you had actually found the source of the performance fault, rather than blaming it on the most visible part of the stack.

        Of course if you absolutely need a component you should remove it. But the lesson you derived wasn’t “simplify”, it’s “memcached is slow”, which is false.

    • Bill Getas

      It’s so hilarious, reading the immediate doubters, when anyone who’s ever futzed around with memcached on PHP knows what a total POS its drivers are, what a friggin messed up state they’ve been in for years, and what a (literal) waste of time it is to have another network-accessible stack of anything in the way actually is. I say ‘good riddance’ to memcached… and hope and watch for mongo to develop a shim or flag to allow some portion of mongo memory to be used as a ‘keep in memory’ for exactly the purpose of a local, high-speed cache.

      • Bill Getas

        Based on this article, apparently the only straight-ahead one on the net, we’ve decided to finally set memcached out to pasture. Maybe it will come back, tho since it was never actually fully present or correctly working anyway, well, what have we really lost?

  • http://twitter.com/grossberg Joe Grossberg (@grossberg)

    Hmm … memcached taking 50x as long as Mongo doesn’t pass the “smell test” … perhaps your metrics tools are giving inaccurate results or there is some weird network latency when hitting memcached?

    • http://www.serverdensity.com David Mytton

      The metrics aren’t inaccurate – they’re collected using New Relic which uses the same method as for the MongoDB metrics. If there is any “weird network latency” it’s Memcached specific because it’s the same network being used for the MongoDB nodes.

  • http://twitter.com/masukomi masukomi (@masukomi)

    >This eliminates the need for Memcached itself running on a separate cluster

    Those last 5 words hold the keys to understanding this article.

    • http://www.serverdensity.com David Mytton

      MongoDB is also running on a separate cluster.

  • http://twitter.com/unixx Brandon Huey (@unixx)

    It’s highly unlikely that memcached is responsible for your 24ms latency. Perhaps you are misusing your client library. My experience with PHP + memcached is that there can be a -very- wide performance fluctuation from one version and distribution (pecl memcache, memcached, etc) to the next.

    • http://www.serverdensity.com David Mytton

      That’s a problem with the client library then. I shouldn’t need to switch libraries to get better performance. The MongoDB library works properly so that just gives me another reason to get rid of Memcached.

      • http://twitter.com/unixx Brandon Huey (@unixx)

        But, you didn’t solve your problem. You just removed something from your architecture that you suspected was broken without proving anything.

        That is atypical Memcached performance and almost assuredly points to an implementation error or poor metrics.

        The first step would be to get higher resolution metrics than New Relic can provide.

  • http://twitter.com/SaltwaterC Ștefan Rusu (@SaltwaterC)

    The metrics tool is New Relic, and it proves to be pretty accurate. In our case, from a cluster of Couchbase servers running memcache buckets, on t1.micro EC2 instances, we get about 5.4 ms of average latency. The architecture is php-fpm + pecl memcache + Moxi (on each frontend web server) + the above described cluster. From six micros, which sometimes have a rather crappy network support, we get about 24k cpm. Most of the cached objects are around 10-15 kB.

    Having removed a layer of complexity is great. I wish I could do that some day. But it wouldn’t blame that latency it on Memcache alone.

  • David Mitchell

    For comparison, what was the specs on the memcache servers and how were they segmented?

    • http://www.serverdensity.com David Mytton

      We run Memcached on the same spec servers as most of our MongoDB clusters which rules out any difference in hardware spec as a source of the problem.

  • http://melbournedevops.wordpress.com Marcus

    I’m pretty sure the metrics reported by New Relic are the averaged aggregate of time spent in all memcache calls per request and not the average response time for a single memcache command. We have New Relic instrumentation on a small subset of our web pool as well as some detailed internal metrics also. Hopefully I can help clarify what you might be seeing.

    For comparison sake here at Etsy we’re all PHP on the front with memcache sitting between our MySQL shards and the last of our Postgres infrastructure. We functionally partition our memcache pools (we have three).

    Our heavy hitter pool is 8 physical servers, 320GB of memory allocated to cache 305GB of which is active. We’re doing around 240k gets/sec, and overall around 320k commands/sec in total on this pool.

    The average get latency (per request) is around 0.56ms (that’s half a millisecond). incr, cas, append and set latency hovers around 0.3ms. These measurements are captured on the web app side (statsd wrappers around the memcache commands) so that time also includes network RTT. Both web servers and memcached are 1Gbe connected to our 10Gbe core and it’s all physical hardware.

    Now taking at look at our production New Relic metrics, we see around 28-30ms avg. reported in memcache per request. In aggregate this is about right. We memcache the heck out of EVERYTHING and your average Etsy page runs in the ballpark of around 60-70 memcache commands per page. Looking at our individual command latency – say for ease of calculation sake – 0.5ms then 70 x 0.5 = 35ms, ta da!

    One thing i’d recommend is taking a look at is the total number of memcache calls you are making per request. If you actually are making a single request that’s costing 24ms then something most certainly feels broken.

    If you are executing more than one memcache command per request I suspect the per command latencies will tend more towards that of a typical memcache configuration. Reducing your original 24ms would be a case of reducing the number of calls – something we are working on ourselves :)

    Hope that helps, feel free to ping me if you want any more detail.

    Cheers,
    Marcus

    • http://www.serverdensity.com David Mytton

      The New Relic map view in the screenshots of the post is showing the average full time spent in Memcached calls throughout the request. You can then drill down further to look at individual calls in a single request to see where the time is being spent.

      If we take the example of our most common request – a postback from one of our monitoring agents – we do 1 call to Memcached (add) and 10 calls to MongoDB. 43% of the transaction time (16.5ms average over the last 30 minutes as I write this) is spent in that Memcached add compared to 4.1% (1.59ms) across all 10 MongoDB calls. As you said, this explains the low average query time from MongoDB (1.59ms / 10 calls = 0.16ms).

      Obviously something isn’t right here because Memcached is supposed to be fast – it’s all in memory. I mentioned in the post that to get the same kind of functionality we have with MongoDB in terms of failover and redundancy then we have to run multiple components for Memcached. There is a separate Memcached cluster (e.g. rather than running locally on each app server) and we use Moxi to proxy the connections to handle the failover. I expect that adds to the query time which doesn’t exist with MongoDB because failover is built into the database rather than as an extra layer you have to query through.

      We could spend time debugging this to find out where the bottleneck is but if it is indeed with Moxi, we’d still have to use it to handle failover or rearchitect how we use Memcached so that failure of a specific node is handled in a different way.

      The point is that in our case it’s better to remove Memcached as a component rather than spend that time investigating when it’s probably pointless. Simplifying the components is a better outcome for us. If we had to keep Memcached for some reason then that would be incentive to investigate the root cause. The title of the post is deliberately extreme – headlines are designed to draw attention ;)

  • http://twitter.com/alexnobert Alex Nobert (@alexnobert)

    Since you deleted my previous comment…

    David, why don’t you explain to the ops people to whom you are marketing this product why we should trust a company that either does not have ops of their own or has one that ignores fundamental principles of ops?

    Also what are the scaling implications of moving your UI cache to your main datastore?

    • http://www.serverdensity.com David Mytton

      I deleted your comment because it was essentially trolling[1]. I’m happy to discuss criticisms of the post but yours was the only snarky comment, which was unnecessary.

      From a purely technical perspective, spending time researching the root cause would be interesting and I would want to determine where the actual problem was if we wanted to fix it. This research might be appropriate if we were a larger team or needed to keep memcached as part of the stack. However, removing a component to simplify the architecture is a much better outcome and better use of time. It is most likely as result of using moxi to provide failover because this is an additional layer inbetween the code and memcached itself. If you use memcached purely as a cache then you can set up the memcached instance on each app server to eliminate network traffic and the need for something like moxi. But because we were originally use it to throttle into MongoDB, it made more sense to have a separate cluster.

      We have a dedicated ops engineer but with 10 employees, I decided against assigning 10% of the company manpower to this. There are more important projects which require engineering time so investigating the cause of this wasn’t the best use of time given we wanted to remove it anyway.

      There are always tradeoffs between what you want to do as an engineer and what makes most sense to allocate time to.

      As regards scaling implications for the UI cache, there is no cache. Memcached as a layer inbetween the UI and the database is unnecessary given the response time we get from the database. Also, our usage patterns means that we don’t have a heavy load on the UI (a monitoring service tends to be reactive rather than something you use constantly) – it’s handling the incoming agent postbacks where the technical challenges lie.

      [1] Can be read at http://twitpic.com/9opos6

      • http://twitter.com/alexnobert Alex Nobert (@alexnobert)

        Soooo… you are complaining about a snarky comment to a blog post titled “Removing memcached because it’s too slow” which you admit to be “deliberately extreme” and is completely misleading/blatantly false. Right, okay. Moving on now.

      • Bill Getas

        LOL You’ve spent more time answering bozo comments than pressing on with the removal of memcache. I’ve long suspected something very wrong with memcache + PHP…the very long insanity of the driver situation, with no apparent end in sight, the fractured nature of it all, the lack of central anything (except for the memcached itself, useless without driver), and the weirdness of having to twist around all kinds of local logic to make it work…all to get, as we’ve seen, very dubious benefit and possibly even drawback. Minimizing calls to memcache?!? Huh? I thought the whole point was to use the cache. We’ve been finding that it’s far better to give more memory to mongo and use mongo for all storage outside of what’s running in PHP instances. We’ve ditched practically _everything_ else on many servers, and, as noted above, the rest will be gone soon enough.

  • ssoroka

    Likely source of issues: writing to memcached is running out of ram and swapping to disk, or you’re running it in a VM, or something equally naive. :D

  • Will Gant

    We’re running into the same thing. However, our architecture is on Windows using the C# client. We’re using memcache as our second level NHibernate cache (which may well be causing the issue, as I suspect it may be chatty). Our latency spikes (as shown by New Relic) are periodic and are so regular that you could set a clock by them. Some of our clients hit RSS feeds coming out of the site that are stored in memcache as well (not using NHibernate). These feeds are cached for fifteen minutes, but we also use ASP.NET output caching. We’re not sure how we’re going to handle it yet.

    One thing that intrigued me though. I noticed that this discussion is fairly recent. Is it possible that New Relic made a UI change that is leading us to believe there is a problem when there isn’t (or, alternatively, they recently pushed a code change that revealed this particular class of problems more clearly)? Usually, when I find a thread like this it is several months stale, so it seems odd to me.

  • http://charsyam.wordpress.com charsyam

    Hi, Do you check Network position? I had an experience about network position. one of my tester server responsed around 15ms, but real servers responsed less than 1ms. In my case, I used moxi also. but its response time is usually amazing, and do you check Memeory Status? when Memcached is targeted by swapping. it can be slow.

  • http://gravatar.com/nevtech Troels H

    Where are youre monitoring graphs etc, taken from? Looks really clean and simple!

    • https://blog.serverdensity.com David Mytton

      That’s from New Relic.