August 12, 2007

ETags and loadbalancers

A few weeks ago the company I work with noticed a weird problem with its CDN (Content Delivery Network) provider. They noticed that HEAD requests were being responded to by the CDN edge nodes using objects in the cache which had already expired. Whats worse is that even after an explicit content expiry notification was sent, the HEAD responses were still wrong. Long story short, the CDN provider had to setup bypass rules for the HEAD requests so that it always bypasses the cache. There was a slight performance overhead with this, but the workaround solved the problem.

Now while this was going on, one of the guys at the CDN support helping us mentioned something about Etags and why we should be using it. I didn't understand how Etags would solve the problem if the CDN itself had a bug which was ignoring expiry information, but I said I'll investigate.

Anyway, the traditional way of communicating object expiry is using the Last-Modified timestamp. ETags is another way of doing that, except that its more accurate.
A little more digging explained that ETags is not a hash of the contents of the file, but a combination of file's inode, file size and last-modified timestamp. This is definitely more accurate and I could see why this might be better than just having last-modified timestamp. But what the CDN support guy didn't mention is that if you are serving content from multiple webservers, even if you rsync the content between the servers, the Etags will always be different because rsync or any other standard copy commands don't have control over the inode number used.

A little more search on the net confirmed that this is a problem and that ETags should probably be shut off (or modified such that it doesn't use inodes) on servers behind loadbalancers.

Scaling PlentyOfFish.com

There is a very interesting interview with Markus Frind, the one man army behind the website PlentyOfFish.com. The site boasts of traffic higher than match.com, about 30 million page views a day, and runs on a single webserver with a couple of database servers. Markus has found interesting ways of surviving different kinds of problems he had. Here is the direct link to interview in wmv format.

Facebook internals

The code leaked during a facebook bug was posted online by an anonymous user. Though the source itself didn't look very damaging, it did damage the brand "facebook". But I won't go into that in this post, and instead I would like to discus the facebook internals here which alex.moskalyuk touched upon.

Alex pointed out that this is not the only code from facebook we have seen. Infact we already know a lot more about how facebook works internally than what most of us would find from the source code to the index.php published yesterday.

  1. PHP - This is no surprise. Though PHP is not developed at faceboook, Alex points out that facebook developers are involved atleast at some level in the development of the php.

  2. Apache - Neither should this be

  3. Mysql - Same here..

  4. Valgrind - This is a suite of tools for debugging and profiling Linux programs. With the tools that come with Valgrind, you can automatically detect many memory management and threading bugs, avoiding hours of frustrating bug-hunting, making your programs more stable. You can also perform detailed profiling, to speed up and reduce memory use of your programs. Other tools related to this which they user are callgrind/Calltree , KCachegrind and OProfile.

  5. APC - Facebook developers have talked about using Alternative PHP Cache in some presentations they have given in the past.

  6. Facebook Thrift - Thrift is a software framework for scalable cross-language services development. It combines a powerful software stack with a code generation engine to build services that work efficiently and seamlessly between C++, Java, Python, PHP, and Ruby. Thrift was developed at Facebook, and its been released as open source. More information can be found in this whitepaper.

  7. Memcached is a high-performance, distributed memory object caching system, generic in nature, but intended for use in speeding up dynamic web applications by alleviating database load. The use of this shouldn't come as a surprise since most of the new web2.0 companies, especially the ones using php and python have experimented or implemented it at some level.

  8. phpsh is another interesting tool facebook developers use internally. It is an interactive shell for php that features readline history, tab completion, quick access to documentation. It is ironically written mostly in python.

  9. Facebook has released a lot of code to support the facebook platform and to get users to develop for it.

  10. Facebook firefox plugin is the last one I'd like to mention here. This again is open source (since you can see the code once you open up the plugin yourself).


References