Design to fail

Last night I went to an SDForum talk by two eBay architects Randy Shoup and Dan Pritchett on how they built, scaled and run their operation. The talk didn't have anything substantially different from what I've heard before, but was still impressive because they were applying some of the common thinking to their operations which runs over 15000 servers any given time. [ Slides ]
Here are a few interesting phrases I took away from the talk.

  • Scale out not up: Scaling up is not only expensive, it will also become impossible beyond a certain technical limitation. Scaling out, however is cheaper and practical.

  • Design to fail: Every QA team I know, do a whole batch of tests to make sure all components work as they should. Rarely have I seen a team which also does testing to see whether the servers stay up if certain parts of the application fail.

  • If you can't split it, you can't scale it: Ebay realized early on that anything which cannot be split into smaller components can't be scaled. A good example of such operation are the "joins" on multiple tables in a database. Relying on database to do joins across a large set of tables means that you can never partition those tables into different databases. And if you can't split it, you will have t

  • Virtualize components: If they can virtualize it, and create an abstraction layer to take care of these virtual components, then rest of the application need not worry about the actual server names, database names, table names etc. The Operations team can move components around to suite scalability needs.

Comments

[...] There is no silver bullet when it comes to solving problems for distributed system. Scalability is no difference. Royans suggests: If you need scalability urgently, vertical scalability or scaling up is the easiest choice. Unfortunately Vertical scaling, gets more and more expensive as you grow. On the other hand, horizontal scalability, or scaling out, doesn’t require you to buy more and more expensive servers. But it isn’t cheaper either. The application has to deal with problems such as â€Å“Split brain” and â€Å“hardware failureâ€Å“. [...]
[...]   æ¨ÂªÃ¥‘扩展ï¼Å’则不一å®Å¡Ã¨¦Ã§›¸Ã¥½“数量çš„昂è´ÂµÃ¦Å“Ã¥Å ¡Ã¥™¨Ã£€‚Ã¥› Ã¤¸ÂºÃ¥®Æ’Ã¥¯Ã¤»¥Ã©€Å¡Ã¨¿‡Ã¦™®Ã©€Å¡Ã§Å¡„机器å’Å’Ã¥­˜Ã¥‚¨Ã§¡¬Ã¤»¶Ã¤»¥Ã¨¾¾Ã¥Ë†°Ã¨§„模效应来解决问题ï¼Å’像早期的Yahoo!ï¼Å’Google都是这样。æ¨ÂªÃ¥‘扩展不ä»…仅是便å®Å“ï¼Å’Ã¥®Æ’是把应用程序构å»ÂºÃ¥Å“¨Ã¥¤Å¡Ã¤¸ÂªÃ¦Å“Ã¥Å ¡Ã¥™¨Ã§»„成çš„一ä¸ÂªÃ¥¤§serverçš„基础ä¸Å Ã¯¼Å’这æ ·Ã¥°±Ã¤¼Å¡Ã©Å¡Ã¤¹‹Ã¦Å“‰Ã¤¸¤Ã¤¸ÂªÃ¥¸¸Ã¨§Ã§Å¡„问题ï¼Å¡ â€Å“Split brainâ€Å“(功能ï¼Å’数据分割) Ã¥’Å’ â€Å“hardware failureâ€Å“(硬件æ•…éšœï¼Å’毕ç«Å¸Ã¦Å“ºå™¨Ã¥¤ÂªÃ¥¤Å¡Ã¤Âº†)。 [...]

Popular posts from this blog

Chrome Frame - How to add command line parameters

Creating your first chrome app on a Chromebook

Brewers CAP Theorem on distributed systems