November 22, 1998

When cache means cash

When Gordon Moore first coined the famous ``Moor's Law'' stating that processor power will double every eighteen months, only a few believed him. Interestingly the law is still valid in 1998. However if I now tell you that Internet is doubling every four months, you would again disbelieve me. But this is a fact, which everyone in the Internet industry has to live with. And if you plan to jump into this wild ISP industry you better get your dirty work of planning done before you start.

The nineties have been a decade of the HTTP and WWW. Starting with the first browser at CERN in 1991, the protocol has been particularly instrumental in the success of Internet around the world. And as this phenomenon grows at over 10 times a year, the only thing which limits it, is the infrastructure itself on which it runs on. VSNL itself, which has been in Indian ISP scene for around three years, has dramatically grown from a small bandwidth of a few 64KBPS links, to more than 125mbps today, and have plans to buy a couple of hundreds MBPS more in the coming months.

VSNL's realization of heavy usage of microsoft.com site by users in India resulted in their setting up a microsoft mirror site right here in India. But of course VSNL can't afford to setup a mirror site for all popular web-sites on the Internet.

A study on the Internet shows that about 30 to 70 percent of Internet traffic generated from a group of homogeneous set of client population could be redundant. For example all of my friends, and I'm sure most of you too would have search engines like altavista.digital.com or yahoo.com as the ``Default Home Page'' on your browser. And every time your browser opens up, the site pops up on your screen. Wouldn't it be great to have an ISP which is intelligent enough to remember the often downloaded sites on the Internet so that it needn't retrieve it every-time from that server in US. This is what ``Internet Caching'', or the ``Caching Technology'' is all about.

The first form of Caching started with the Internet browsers which started keeping a copy of site in a cache directory on the client itself for faster loading of pages which are repeatedly accessed. This was interesting achievement in itself, which improved a bit more with the second generation of cache softwares.

The Internet ``Proxy'' servers as we know today, were initially designed to proxy a group of clients hiding behind it, so that a single Internet connections could be used to browse by a group of clients. But the slow Internet connectivity forced the creators of these proxy servers to maintain a second level cache of all the browsers behind it. However unlike the simpler client end caching, these cache engines were much more enhanced in their ability to predict user preferences and periodically flushes out unused pages to optimize cache performance.

However it were the programmers and not scientists who designed these proxy cache engines. The work on ``cache'' module for these products were too early to be called an achievement. But as people around the world started analyzing httpd log files from web servers and routers, it was soon taken up as a mathematical challenge by adventurous scientists all over the world to predict traffic from a particular segment of homogeneous population. Starting with CERN httpd cache and later with ``HARVEST project'' the mathematical concepts started taking practical shape. Today there are more than 30 odd caching products available today for use of which many are customized version of SQUID, which was actually derived from the HARVEST project.

While Netscape and Microsoft sell their own version of Software cache engines, Companies like Network appliances, CISCO and Skycache have taken this a step further with hardware products based on the same which are ready to use out of the box appliances to do heavy duty traffic caching.

However, when we talk about ISPs, we have to separate the boys with men. Most of the Caching products are designed with small-scale networks or at most large enterprises in mind. Enforcing rules and mechanism for optimal use of cache can be enforced in these environments. But when one talk about ISPs providing services to consumers, they may not exactly be in a position to individually make people adhere to standards or policies, which might be essential for implementation of the cache services.

Both Microsoft and Netscape caches softwares which are designed for Enterprise in mind have one of the biggest drawbacks of forcing the user to manually set a proxy server address on the browser. To some this is not a problem as it's a onetime setting which needs to be set by the client, however the new ISP users in India would have enough at hand to learn already. On the other hand products from CISCO, inktomi and cobaltnet works so transparently for the end user that in most cases the clients may never know of its existence at all.

Products from CISCO , Cacheflow, Packetstorm and Cobaltnet are specialized hardware box or appliances which are optimized for caching. With a high speed storage device and heavy duty networking performance these appliances can actually be placed in front of the router to filter all traffic passing through a router. For even better performance you may like to prefer CISCO or Cacheflow over others because of their independence from an "OS" to run on. When an application talks directly with the hardware performance increase is dramatic.

If you already have surplus hardware in your organization and and if you are not willing to spend big money over a hardware you are not sure off, then you may look at using squid or other products from inktomi, digital or sun. Though you may not get performance as good as a cache appliance, one big benefit of software based cache engine is that you would easily be able upgrade your hardware resources without being forced to buy the whole hardware all over again if you ever need to upgrade.

To select a cache product, ISPs would need to make some basic technical decisions which would not only have a long lasting impact on the how they would later administer and maintain network but may also affect their business practice a bit. But the bottom like for all decisions is the financial implication, which it would have.

The default mechanism of implementing a cache box is by using ``http proxy'' protocol, which is the same as that used in Proxy software. However as I said before all ISP's may not like to ask their users to manually configure proxy server address on the browsers. But if this option is let to the users, those who do use this mechanism may find network access performance to improve dramatically. The other mechanism of doing the same is using ``Transparent caching'' mechanism which does caching without the user making any configuration at the client side. This of course means that the user does not have an option of switching over to a non-caching mode. Most of the hardware based cache boxes do support this and so do some of the software caches. The caches automatically detect http request in the traffic and transparently checks for the pages in its storage and passes back the page if found else retrieves the pages for the client. A major drawback of this mechanism is that failure of this box can result in total halt of all WWW related services.

For bigger ISP's the question lingering in their minds would be about how they should go about implementing multiple cache engines on multiple gateways working in tandem together. It would be a simple wastage of resources if all the caches in your organization keep multiple copies of the same documents with it. Depending on what cache engine you are using most of the them have protocols which allow interaction between multiple caches to share information and caches in some form. Going a step ahead one could also look for protocols which allow sharing of cached information between different cache boxes in case one is looking for sharing using cache boxes installed by the upstream ISP. A phenomenon fast spreading on the Internet among the ISPs is this concept of cache sharing an example of which is IRCACHE (http://ircache.nlanr.net). But the prerequisite for this kind of caching is the ability to speak popular tongue of the other caches. ICP (Internet Cache Protocol (v2) is the latest and probably the most supported protocol around. Most of the SQUID based softwares rely on this protocol to request pages (aka objects) from other caches. CARP, which is now supported by Microsoft Proxy (v2), is another protocol, which might catch on soon. CISCO primarily uses WCCP which is proprietary as of today. However CISCO can talk to squid and other caches using ICP also. If an ISP is looking for such a inter-ISP agreement, then its best for them to check the protocol support in the cache boxes.

Now that I mentioned CISCO I would like to impress upon you that though CISCO Cache engine is a bit new to the industry it has one neat feature which others don't have yet. CISCO's WCCP protocol is built right into 7000 series of CISCO router, and would soon be available on 3600s and 4500s. Presence of this protocol in routers helps the CISCO routers to transparently re-route traffic to CISCO Cache engines unlike other boxes which requires the boxes to be present between the client and the router. Apart from this feature of rerouting, the CISCO routers also allow you to transparently shut down caching mechanism or reroute it to other caches in case cache fails. WCCP protocol has been recently released for commercial products, but it might take a little while before CISCO WCCP is supported by other cache vendors

The final piece of information which you may like to research on, would be the User interface and the algorithm a cache engine uses to detect redundant data (hit rate). A small public domain SQUID cache running over Linux on a Pentium may be good enough for a small ISP of 50 to 100 simultaneous users. But if technical manpower is costly and R&D; is not something you would like to invest on, you may look at the commercial products which I mentioned before in the article.

Caching technology also its set of problems. The most prominent of which is that of possibility that the data retrieved by the browser might be old. Adding to this problem is the fact that web-sites don't properly implement usage of expiry-stamps on the Web Pages. Another problem is concerning copyright laws which prohibit keeping copy of information in print or digital form. There also have been cases where organization have sued ISPs for accidentally blocking out their organization from the Internet which resulted in loss of revenue. The final and probably the most frightening problem of caching is, the ability of the caching software to keep copy of confidential information (including information passing through SSL) sent between the client and servers, which can be easily compromised by people who have access to it.

But at the end all that matters is how much attractive these caches are financially over the normal connections. Looking at infrastructure in India I'd see these products becoming very popular very soon.

22nd November 1998

No comments: