The Reddit problem: Learning from mistakes

Reddit has a very interesting post about what not to do when trying to build a scalable system. While the error is tragic, I think its an excellent design mistakes to learn from.

Though the post lacked detailed technical report, we might be able to recreate what happened. They mentioned they are using MemcacheDB datastore, with 2GB RAM per node, to keep some data which they need very often.

Dont be confused between MemcacheDB and memcached. While memcached is a distributed cache engine, MemcacheDB is actually a persistent datastore. And because they both use the same protocol, applications often use memcached libraries to connect to MemcacheDB datastore. redditheader

Both memcached and MemcacheDB rely on the clients to figure out how the keys are distributed across multiple nodes. Reddit chose MD5 to generate the keys for their key/value pairs. The algorithm Reddit used to identify which node in the cluster a key should go to could have been dependent on the number of nodes in the system. For example one popular way to identify which node a key should be on would be to use the “modulo” function. For example key “k” could be stored on the node “n” where “n=k modulo 3”. [ If k=101, then n=2 ]

image

Though MemcacheDB uses BDB to persist data, it seems like they heavily relied on keeping all the data in RAM. And at some point they might have hit the upper limit on what could be cached in RAM which caused disk i/o which resulted in slower response times. In a scalable architecture one should have been able to add new nodes and the system should have been able to scale.

Unfortunately though this algorithm works beautifully during regular operation, it fails as soon as you add or remove a node (when you change n). At that point you can’t guarantee that all the data you previously stored on node n would still be on the same node.

And while this algorithm may still be ok for “memcached” cache clusters, its really bad for MemcacheDB which requires “consistent hashing”.

[RedditArchDiagramWhiteBG.png]

Reddit today announced that they have increased RAM on these MemcacheDB servers from 2GB to 6GB, which allows 94% of their DB to be kept in memory. But they have realized their mistake (they probably figured this out long time back) and are thinking about how to fix it. The simplest solution of adding a few nodes requires re-hashing their keys which would take days according to their estimate. And of course just adding nodes without using some kind of “consistent hashing” is still not a scalable solution.

I personally learnt two things

  • Dont mix MemcacheDB and memcached. They are not designed to solve the same problem.
  • Don’t just simply replace memcached with MemcacheDB without thinking twice

There are many different products out there today which do a better job at scaling, so I won’t be surprised if they abandon MemcacheDB completely as well.

Share:
  • Digg
  • del.icio.us
  • Facebook
  • Google Bookmarks
  • DZone
  • HackerNews
  • Reddit
  • RSS
  • StumbleUpon
  • Suggest to Techmeme via Twitter
  • Twitter
  • FriendFeed
  • Slashdot
  • email

Related posts:

  1. Scaling updates for Feb 10, 2010
  2. Product: Northscale memcached and membase server
  3. You don’t have to be Google to use NoSQL
  4. Brewers CAP Theorem on distributed systems

You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

12 Tweets 2 Comments

6 Comments »

 
  • Igor says:

    I did not quite get Reddit mistake, what stoped them from using libketama as a consistent hashing for MemcacheDB?
    It’s part of almost every memcached client now, so undoubtedly they must have had it.

  • Royans says:

    Based on what I see, I think Reddit datastore predates libketama, and that they knew about this issue long time ago. BTW Reddit blog has entries from 2005 and looks like libketama was announced in 2007. But thats just a guess and would love for someone to confirm it.

  • Hey there. I’m the guy that wrote the post. Your assessment is spot on.

    We didn’t use libketama because it didn’t exist at the time (a few years ago) and also because it just wasn’t something people thought about. We could switch to it now, but as you said, we are going to drop memcacheDB, so there is no need to “fix” it. Also, the version of the memcached protocol that is the frontend for memcacheDB is old, and doesn’t support the binary protocol. Most of the memcached libraries that support ketama require the binary protocol (not all of them).

    So yeah, it looks like we’ll probably go to Cassandra in the very near feature.

  • Royans says:

    Jereny, Thanks for the comments. I’m not surprised by the plan to move to Cassandra. Best of luck !

  • antirez says:

    I’ve mixed feelings about this mistake.Even if your error was not using consistent hashing, what’s preventing you to modify the program in order to use two caching systems for a limited period (the time needed to rebuild the cache)?

    So you can use a better distribution strategy, and finally shut down the old cache and use the new.

    This comment was originally posted on Hacker News

  • rkt says:

    I think Reddit will be doing something like that. Jereny from Reddit responded to the thread and mentioned they might be moving to cassandra. Its however not a trivial task for a site with that kind of load and data to cutover :)

    This comment was originally posted on Hacker News

 

Leave a Reply

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Powered by WP Hashcash

Additional comments powered by BackType