November 30, 2006

Design to fail

Last night I went to an SDForum talk by two eBay architects Randy Shoup and Dan Pritchett on how they built, scaled and run their operation. The talk didn't have anything substantially different from what I've heard before, but was still impressive because they were applying some of the common thinking to their operations which runs over 15000 servers any given time. [ Slides ]
Here are a few interesting phrases I took away from the talk.

  • Scale out not up: Scaling up is not only expensive, it will also become impossible beyond a certain technical limitation. Scaling out, however is cheaper and practical.

  • Design to fail: Every QA team I know, do a whole batch of tests to make sure all components work as they should. Rarely have I seen a team which also does testing to see whether the servers stay up if certain parts of the application fail.

  • If you can't split it, you can't scale it: Ebay realized early on that anything which cannot be split into smaller components can't be scaled. A good example of such operation are the "joins" on multiple tables in a database. Relying on database to do joins across a large set of tables means that you can never partition those tables into different databases. And if you can't split it, you will have t

  • Virtualize components: If they can virtualize it, and create an abstraction layer to take care of these virtual components, then rest of the application need not worry about the actual server names, database names, table names etc. The Operations team can move components around to suite scalability needs.