ETags and loadbalancers

A few weeks ago the company I work with noticed a weird problem with its CDN (Content Delivery Network) provider. They noticed that HEAD requests were being responded to by the CDN edge nodes using objects in the cache which had already expired. Whats worse is that even after an explicit content expiry notification was sent, the HEAD responses were still wrong. Long story short, the CDN provider had to setup bypass rules for the HEAD requests so that it always bypasses the cache. There was a slight performance overhead with this, but the workaround solved the problem.

Now while this was going on, one of the guys at the CDN support helping us mentioned something about Etags and why we should be using it. I didn’t understand how Etags would solve the problem if the CDN itself had a bug which was ignoring expiry information, but I said I’ll investigate.

Anyway, the traditional way of communicating object expiry is using the Last-Modified timestamp. ETags is another way of doing that, except that its more accurate.
A little more digging explained that ETags is not a hash of the contents of the file, but a combination of file’s inode, file size and last-modified timestamp. This is definitely more accurate and I could see why this might be better than just having last-modified timestamp. But what the CDN support guy didn’t mention is that if you are serving content from multiple webservers, even if you rsync the content between the servers, the Etags will always be different because rsync or any other standard copy commands don’t have control over the inode number used.

A little more search on the net confirmed that this is a problem and that ETags should probably be shut off (or modified such that it doesn’t use inodes) on servers behind loadbalancers.