Design to fail

Last night I went to an SDForum talk by two eBay architects Randy Shoup and Dan Pritchett on how they built, scaled and run their operation. The talk didn't have anything substantially different from what I've heard before, but was still impressive because they were applying some of the common thinking to their operations which runs over 15000 servers any given time. [ Slides ]
Here are a few interesting phrases I took away from the talk.

  • Scale out not up: Scaling up is not only expensive, it will also become impossible beyond a certain technical limitation. Scaling out, however is cheaper and practical.

  • Design to fail: Every QA team I know, do a whole batch of tests to make sure all components work as they should. Rarely have I seen a team which also does testing to see whether the servers stay up if certain parts of the application fail.

  • If you can't split it, you can't scale it: Ebay realized early on that anything which cannot be split into smaller components can't be scaled. A good example of such operation are the "joins" on multiple tables in a database. Relying on database to do joins across a large set of tables means that you can never partition those tables into different databases. And if you can't split it, you will have t

  • Virtualize components: If they can virtualize it, and create an abstraction layer to take care of these virtual components, then rest of the application need not worry about the actual server names, database names, table names etc. The Operations team can move components around to suite scalability needs.


[...] There is no silver bullet when it comes to solving problems for distributed system. Scalability is no difference. Royans suggests: If you need scalability urgently, vertical scalability or scaling up is the easiest choice. Unfortunately Vertical scaling, gets more and more expensive as you grow. On the other hand, horizontal scalability, or scaling out, doesn’t require you to buy more and more expensive servers. But it isn’t cheaper either. The application has to deal with problems such as “Split brain” and “hardware failure“. [...]
[...]   横向扩展,则不一定要相当数量的昂贵服务器。因为它可以通过普通的机器和存储硬件以达到规模效应来解决问题,像早期的Yahoo!,Google都是这样。横向扩展不仅仅是便宜,它是把应用程序构建在多个服务器组成的一个大server的基础上,这样就会随之有两个常见的问题: “Split brain“(功能,数据分割) 和 “hardware failure“(硬件故障,毕竟机器太多了)。 [...]

Popular posts from this blog

Latest Global COVID-19 stats

Brewers CAP Theorem on distributed systems

The pain of Load balancing applications