September 28, 2007

Scalable products: KFS released

Kosmix, a search startup has released source to C++ implementation of something which looks like a clustered file system. This looks very similar to Hadoop/HDFS, but the C++ factor will be a big performance boost.Kosmic

From Skrenta blog



    • Incremental scalability - New chunkserver nodes can be added as storage needs increase; the system automatically adapts to the new nodes.

    • Availability - Replication is used to provide availability due to chunk server failures.

    • Re-balancing - Periodically, the meta-server may rebalance the chunks amongst chunkservers. This is done to help with balancing disk space utilization amongst nodes.

    • Data integrity - To handle disk corruptions to data blocks, data blocks are checksummed. Checksum verification is done on each read; whenever there is a checksum mismatch, re-replication is used to recover the corrupted chunk.

    • Client side fail-over - During reads, if the client library determines that the chunkserver it is communicating with is unreachable, the client library will fail-over to another chunkserver and continue the read. This fail-over is transparent to the application.

    • Language support - KFS client library can be accessed from C++, Java, and Python.

    • FUSE support on Linux - By mounting KFS via FUSE, this support allows existing Linux utilities (such as, ls) to interface with KFS.

    • Leases - KFS client library uses caching to improve performance. Leases are used to support cache consistency.




If anyone has experience with KFS, or has more information please leave a comment here.

September 22, 2007

What is scalability ?

When asked what they mean by scalability, a lot of people talk about improving performance, about implementing HA, or even talk about a particular technology or protocol. Unfortunately, scalability is none of that. Don't get me wrong. You still need to know all about speed, performance, HA technology, application platform, network, etc. But that is not the definition of scalability.

Scalability, simply, is about doing what you do in a bigger way. Scaling a web application is all about allowing more people to use your application. If you can't figure out how to improve performance while scaling out, its okay. And as long as you can scale to handle larger number of users its ok to have multiple single points of failures as well.

There are two key primary ways of scaling web applications which is in practice today.

  • "Vertical Scalability" - Adding resource within the same logical unit to increase capacity. An example of this would be to add CPUs to an existing server, or expanding storage by adding hard drive on an existing RAID/SAN storage.

  • "Horizontal Scalability" - Adding multiple logical units of resources and making them work as a single unit. Most clustering solutions, distributed file systems, load-balancers help you with horizontal scalability.


Every component, whether its processors, servers, storage drives or load-balancers have some kind of management/operational overhead. When you try to scale that, its important to understand what percentage of the resource is actually usable. This measurement is called "scalability factor". If you loose 5% of a processor power every time you add a CPU to your system, then your "scalability factor" is 0.95. A scalability factor of 0.9 means you will only be able to use 90% of the resource.

Scalability can be further sub-classified based on the "scalability factor".

  • If the scalability factor stays constant as you scale. This is called "linear scalability".

  • But chances are that some components may not scale as well as others. A scalability factor below 1.0 is called "sub-linear scalability".

  • Though rare, its possible to get better performance (scalability factor) just by adding more components (i/o across multiple disk spindles in a RAID gets better with more spindles). This is called "supra-linear scalability".

  • If the application is not designed for scalability, its possible that things can actually get worse as it scales. This is called "negative scalability".


If you need scalability, urgently, going vertical is probably going to be the easiest (provided you have the bank balance to go with it). In most cases, without a line of code change, you might be able to drop in your application on a super-expensive 64 CPU server from Sun or HP and storage from EMC, Hitachi or Netapp and everything will be fine. For a while at least. Unfortunately Vertical scaling, gets more and more expensive as you grow.

Horizontal scalability, on the other hand doesn't require you to buy more and more expensive servers. Its meant to be scaled using commodity storage and server solutions. But Horizontal scalability isn't cheap either. The application has to be built ground up to run on multiple servers as a single application. Two interesting problems which most application in a horizontally scalable world have to worry about are "Split brain" and "hardware failure".

While infinite horizontal linear scalability is difficult to achieve, infinite vertical scalability is impossible. If you are building capacity for a pre-determined number of users, it might be wise to investigate vertical scalability. But if you are building a web application which could be used by millions, going vertical could be an expensive mistake.

But scalability is not just about CPU (processing power). For a successful scalable web application, all layers have to scale in equally. Which includes the storage layer(Clustered file systems, s3,etc), the database layer (partitioning,federation), application layer(memcached,scaleout,terracota,tomcat clustering,etc), the web layer, loadbalancer , firewall, etc. For example if you don't have a way to implement multiple load balancers to handle your future web traffic load, it doesn't really matter how much money and effort you put into horizontal scalability of the web layer. Your traffic will be limited to only what your load balancer can push.

Choosing the right kind of scalability depends on how much you want to scale and spend. In fact if someone says there is a "one size fits all" solution, don't believe them. And if someone starts a "scalability" discussion in the next party you attend, please do ask them what they mean by scalability first.

References

  1. Cost and Scalability

  2. My linear scalability is bigger than yours

September 18, 2007

Scaling Smugmug from startup to profitability

Smugmug.com, a 5 year old company with just 23 employees has 315000 paying customers and 195 million photographs. CEO & "Chief Geek" Don MacAskill has a nice set of slides where he talks about its 5 year journey during which it went from small startup to a profitable business. The talk was given during Amazon's "Startup project" so it talks mostly about how it uses AWS (Amazon Web services).

Other than wonderful fun loving employees who are also "super heroes" in his eyes, he talks about the how they are doubling the storage requirements on an yearly basis. They already have about 300 TB in use, and as of today all of that is on Amazon's S3. Don estimates that based on his estimates, for the storage they are using, they are saving about 500K per year which is pretty big for a small operation like theirs.

The Smugmug architecture has evolved over time. Internally it can serve images in 3 different ways.

  1. It could either "proxy" the request, where the app server would get the object from storage and serve it to the client.

  2. It could do a "redirect" where the app server could redirect the user to the right resource.

  3. And finally it could just serve the images directly from the storage using REST based APIs.


This flexibility allowed Smugmug try different ways of using internal and external/S3 storage. Interestingly even though they wanted to get away from self-managed internal storage, they noticed that S3 storage wasn't very cheap when you take into account the bandwidth utilization. After various permutation and combinations, they managed to setup a system where they continued to use S3 for primary storage, but kept 10% of the hottest objects locally on Smugmug's own servers to minimize bandwidth utilization on S3 service. This allowed them to get away with not buying 95% of the storage drives they were originally supposed to buy. 5% of 300TB is still 15TB, which is smaller but not small enough unfortunately.

Though storage management is easy with S3, it can at times make things difficult. Permission management was one of those things where they had to sacrifice speed/performance in favour of using "Proxy" mechanism which is more robust/reliable way of serving protected objects.

On reliability, Don mentioned that having multiple single points of failures is not really helpful if you want to provide near 100% availability. With S3 in picture, not only do they have to worry about connectivity from customer to Smugmug and Smugmug's own servers, they also have to worry about connectivity to Amazon and availability of Amazon services round the clock. Hence they had to design their app ground up to handle failures gracefully. For example, write failures to S3 is handled by recording the change locally which is then synced asynchronously. At other times when things break, its designed to either try again, or alert the right folks so that action can be taken.

While talking about nice-to-have-features, he mentioned that S3 shouldn't be confused with CDN. Its not a distributed caching service and it doesn't have global locations like a CDN has. Regardless, he said, S3 probably should have limited Cache or Stream ability which will boost performance and add value to an already invaluable service.

On the topic of Smugmug's future, Don mentioned they are flirting with the possibility of using EC2 in the near future. EC2 would probably be used as image processing computation nodes. Since the EC2 servers are located in the Amazon facility, it will save them bandwidth of transferring data from S3. And since they can turn off server instances on demand (and not pay for them while its offline) it will probably cut down their operating costs to maintain image processing servers as well.

At the end he mentioned a few services which he would like to see Amazon offer in future. Two important ones which he mentioned (and which I think are critical) are the absence of some kind of Database, and Loadbalanacer APIs.



PDF Slides of the same presentation here (34 slides in this one)

September 15, 2007

Custom search engine to search your OPML and Delicious bookmarks

Zoppr is a Custom Search engine which allows you to create custom Google search engine on the fly, by appending your bookmark page, wikipage, or any other kind of page with lots of interesting bookmarks/links on it. Once setup, google will search only across your bookmarks/links. For example this URL will help you search across an OPML file published somewhere on the internet http://www.zoppr.com/cse/http://share.opml.org/

Scalable web architectures

If you haven't noticed already there is a second blog which I maintain which is currently more busy than this particular blog. "Scalable web architectures" is a collection of posts about how web architectures which scale and technologies which make it happen.

Here are some of the posts on that blog



    Eins.de site serves about 1.2 million dynamic pages a day. He wrote a series of articles describing how they redesigned the site to scale for growth. I found these articles very informative with a extreemly mature discussion of the colorful world of scalability.

    Session, state and scalability




    If I could only give one recommendation to anyone building a brand new web application, I’d say “go stateless“. But going stateless is not the same as going session-less. One could implement a perfectly stateless web architecture which still uses sessions to authenticate, authorize and track user activity. And to complicate matters further, when I say stateless, I really mean that the server should be stateless, not the client.

    Loadbalancer for horizontal web scaling: What questions to ask before implementing one.




    Loadbalancers, by definition, are supposed to solve performance bottlenecks by distributing or balancing load between different components its managing. Though you would normally find loadbalancers in front of a webserver, a lot of different individuals have found other interesting ways of using it.

Scalability Stories (15th Sept) Mysql Proxy, Cluster Fire System, Facebook apps and Twitter

There have been a lot of interesting stories from last week for me to share. If you have interesting links you want to add to this post please forward them to me or post a comment to this post.

  • Sun is planning to acquire majority stake of "Cluster File Systems, Inc". [ Talk on Lustre File System ]

    • Sun intends to add support for Solaris Operating System (Solaris OS) on Lustre and plans to continue enhancing Lustre on Linux and Solaris OS across multi vendor hardware platforms. As previously announced in July 2007, Sun also plans to deliver Lustre servers on top of Sun's industry-leading open source Solaris ZFS solution, which is one of the fastest growing storage virtualization technology in the marketplace.



  • Making Facebook Apps scale on cheap : An interesting writeup By Surj Patel about Scalability issues Facebook itself and the 3rd Party apps on it have. Also discusses EC2 and S3 as an alternative solution to scale in a cost effective way.

    • Welcome to Amazon and S3 and EC2 — processing power (EC2) and storage (S3) on demand. These services let you access computational power and storage only when you need it and, better yet, pay only for what you use. The last time I checked, it was 10 cents an hour for the server, 10 cents for every gigabyte of data written and 18 cents per gigabyte read out – all for a virtual box with 1.7Ghz x86 processor/1.75Gbytes of RAM/250Mbs of bandwidth. Nor are you limited to one usage; use as many as you need or want and can afford.



  • Interesting blog post on Google Reader Numbers. They have made significant progress lately and thanks to the scalable architecture they now store 10 terabytes of raw feed data from 8 million feeds in their index.

  • Todd Hoff has an interesting writeup on Scaling Twitter: Making Twitter 10000 Percent Faster. And an interview with Biz Stone (Co-Founder of Twitter) here.

  • If you use Mysql and your app is not yet designed to handle federated database architecture, you should take a look at a new product in development called "Mysql Proxy"
      The most powerful feature is Read/Write Splitting which allows you to scale a application which is unaware of replication automatically cross several slaves without changes to your application. Instance Scale Out we say. The Proxy also became a 1st class citizen in the MySQL world with full docs, win32 support and easy to install.


September 12, 2007

Scaling Powerset using Amazon's EC2 and S3

The first thing most doc-com companies do before going public is setup an infrastructure to provide the service. And though it might sound straight forward to most of you, it can be a very expensive affair. To come up with the right kind of infrastructure for any new service a few key architectural decisions have to be made.

  1. Design the infrastructure and architecture to handle the traffic peaks (not average)

  2. Design with long term scaling in mind

  3. Design power and cooling infrastructure to support the servers

  4. Hire Hardware+Systems+network support staff ( more if 24/7 operations are required)

  5. Add buffers to support failures and short term grow requirements

  6. Take into account lead times of ordering and procuring hardware (which can be weeks if not months)..

  7. And a few others.. which I won't bore you with here.


The point is that the initial capital investment can be in millions even before the first customer starts using the service. And once the capital investment is made, it is very difficult to scale down the operations if the plans change.
Powerset Inc is a secretive search startup with ambitions of out-smarting Google in its own turf. Based in San Francisco, this search startup is working on building a better search engine using natural language processing capabilities to understand the users question a little better before answering it. And just like any other search company its technology is a CPU hungry beast just waiting to be unleashed. Powerset could have gone the way most dot-com companies have gone, but instead they decided to try out Amazon's EC2 (Elastic Cloud Computing) and S3(Simple Storage Service) to augment their computational needs.

Powerset has been repoted to be testing a 400 server instance EC2 cluster with Hadoop running Map/Reduce and HDFS (Hadoop Distributed File system). This does get a little tricky on EC2 because of absence of persistent storage (OS is re-initialized after every reboot). So they use a combination of HDFS and remote copying process to sync the data to their local network. Since there is no charge to move data from EC2 to S3, they have been thinking about implementing a native HDFS and S3 interface to move data around within Amazon's network itself.

EC2 is charged on a per instance per hour usage basis, which means Powerset can bring new nodes online during heavy demand and shut off unused nodes at a flick of a switch at night. Powerset guys also built their own EC2 image configured to automatically join the HDFS cluster after every boot up. In an event of a node failure, Hadoop can take care of data replication, and EC2 takes care of replacing the failed node with a new one.

Amazon EC2 costs 10 cents an hour per instance. If you have to run a 400 node cluster for 1 month thats only about 30000. Based on the performance benchmarks, it looks like the actual CPU throughput from each of the EC2 instance is roughly equivalent of 1Ghz PIII. 72 dollars a month for that kind of server is not too cheap, but just like car leasing, atleast u don't have to pay upfront and manage it.

So lets do the math. A regular AMD 64bit dual core, 2 cpu server with about 8GB of ram costs about 10000 USD which excluds the cost of hosting, power, cooling and maintainance. Based on some comments on Amazon forum this is about 2 to 3 times faster than the EC2's infrastructure.If you had to replace the CPU computation power of this new hardware with 8 to 12 server instances on EC2, you would have spent about 700 to 800 dollars a month. It will take a company using EC2 infrastructure close to 12 months before they would have to pay 10000 towards EC2 computational services for the same amount of computation power. And remember that 10000 amount didn’t take into account colocation, power, cooling and general server administration which can be significant as well. Also remember that 12 months is actually 12 months of actual computational usage.. which could over a period of 2 to 4 years depending on how often the instances are used.

However, I also have to point out that there are a few things to look out for. The maximum physical memory available is about 1.7GB which is relatively tiny if you are used to 8 to 16 GB of ram on a 64 bit hardware. And though CPU/Memory might scale horizontaly for some applications, cross-server communication can be extreemly expensive for some. Unless your application is designed to scale horizontally with under 1.7 GB of ram, I would seriously advice you against using EC2 until you figure out how to change that.

I've blogged about both S3 and EC2 before and it continues to facinate me. Success of companies like Slideshare.net, and the decision of companies like Powerset to use AWS is something which I'll watch closely over time.

References
Links about Hadoop, and how to use it on Amazon EC2/S3

Other References about Powerset and Amazon

September 10, 2007

P2P network scalability

Youtube is said to be pushing about 25 petabytes per month which is about 77 Gbps sustained data rate on an average. The bandwidth usage at the peaks would be even higher. Thanks to Limelight networks, Youtube doesn't really need to scale or provision for that kind of bandwidth and based on the some reports from 2006 it had cost them close to 4 million a month back then. Youtube and services like that have to invest a lot in their infrastructure before they can really launch their service and though using shared Content delivery networks is not ideal, its probably not a bad deal. In Youtube's case, it helped them survive until Google bought it out.

Newer Internet television service providers, however need not build their services around the traditional CDN model. Joost Network architecture presentation from Colm MacCarthaigh is an interesting example to discuss to prove my point. Joost was founded by the same guys who founded Kazaa and Skype . Kazaa was one of notorious P2P file sharing application (used the FastTrack protocol) which died after RIAA revolt. Skype, as it happens, also has its roots in P2P network [ Skype protocol , Skype scalability problems ] and has been doing pretty good over the years. So its no surprise that Joost chose P2P model again to distribute part of the content to its users. Joost has a cluster of servers which serve as "original seeders" or all content, and rely on the P2P network to distribute the popular content. The number of Joost servers, however, is not small because it still also has to address the "long tail" of requests which are not among the popular content.

Two of the most important network optimization ground rules, which I noticed from the talks, was that they decided against using firewalls or loadbalancers in its network. Thats good, because the firewalls and loadbalancers wouldn't have kept up with the bandwidth anyway. But even more impressive was that they designed the entire P2P application/network-algorithm to intelligently find and peer with nodes and supernodes closest to them. Joost tries to do this this in two different ways. The first one is using IP address (prefix aware) as proximity sensors (two IPs which start with similar set of numbers/octets will probably be in the same network). The second way to detect proximity is using Network AS Numbers which can work irrespective of what the IP addresses start with. [ Colm also mentioned about AS proximity detection below ]

A comment to blog @ ipdev.net by Colm himself
We have many gigs of transit, and are adding more. I'm not sure who claimed it's near HD quality, I like to think it's about NTSC, sometimes better, never quite PAL.We have some efforts in the code to save transit costs, there is very very basic prefix awareness, and we're adding AS-level awareness using live BGP data. I have looked at adding AS adjacency information, ie prefer AS-adjacent peers, but it's a lot of work and the US internet is relatively poorly mapped, so I don't think this will come soon.


Its possible that Joost might still require CDNs to serve the long-tail content, but the work they have done to build the P2P infrastructure would not only save them an a lot of mulah in the long run but would also allow them to easily scale to be larger than any of the current CDNs if they do get that big.

Interestingly companies like Microsoft are not sitting idle watching the world go by. Microsoft has been working on something called Avalanche and I think they already have a prototype client out which you can download and try it out yourself.

Microsoft Secure Content Downloader


Some MSCD clients may be connected to each other via peer connections, forming a ‘cloud’ of clients. Pieces of the file you are downloading are sent through these peer connections between clients, as well as through connections with the file server. As a member of the cloud, your computer both serves as a client and server to other members of the cloud. Data destined for the cloud may be routed through your computer and sent to other cloud members. The other cloud members connected to you will be able to access only pieces of the file you are downloading via MSCD – they have no access to any other data on your computer.

You are only connected to other clients while you are downloading a file via MSCD. When the file has finished downloading – or when you pause or cancel the download, or exit the application – you disconnect from the cloud. Once you disconnect from the cloud, you will no longer have any connections to any other members in the cloud and no data will be routed through your computer.The Microsoft Secure Content Downloader (MSCD) is a peer-assisted download manager capable of securely downloading specific files. MSCD is intended for consumers who are downloading from a home PC, or business users whose computers are not behind a corporate firewall. If you use MSCD from behind a corporate firewall, you may be unable to download content, and may adversely affect other clients' ability to download content.

Of course there are also other rumors that apple is trying this out... but you know how these things go.


Anyway, the point is that in spite of occasional gliches P2P is probably the way to go if you want to cut long term costs of CDN. Personally, I believe that Skype had no other way out. I mean can you think off all the phone calls in the world going through the same first phone exchange in New Haven, Connecticut where it all started ? P2P models are still evolving and its hard to imagine there will be a one-solution-fits-all. But if you know one, please let me know.

September 09, 2007

Sharding: Different from Partitioning and Federation ?

Ive been hearing this word "sharding" more and more often, and its spreading like fire. Theo Schlossnagle, the author of "Scalable internet architecutres" argues that federation is form of partitioning, and that sharding is nothing but a form of partitioning and federation. Infact, according to him, Sharding has already been in use use for a long time.

I'm not a dba, and I don't pretend to be one in my free time either, so to understand the differences I did some research and found some interesting posts.

The first time I heard about "Sharding" was on Been Admininig's blog about Unorthodox approach to database design (Part I and Part II). Here is the exact reference...
Splitting up the user data so that User A exists on one serverwhile User B exists on another server, each server now holds a shard ofthe data in this federated model.

A couple of months ago Highscalability.com picked it up and made it sound (probably unintentionally) that sharding is actually different from Federation and Partitioning. Todd's post also points at Flickr using sharding.The search for Flickr architecture lead me to Colin Charles' post about Federation at Flickr: A tour of the Flickr architecture where he does mention shards as a component of Federation key. Again no mention of Sharding being anything new.
Federation Key Components:

  • Shards: My data gets stored on my shard, but the record ofperforming action on your comment, is on your shard. When making acomment on someone elses’ blog

  • Global Ring: Its like DNS, you need to know where to go and whocontrols where you go. Every page view, calculate where your data is,at that moment of time.

  • PHP logic to connect to the shards and keep the data consistent (10 lines of code with comments!)



Based on the discussions on these and other blogs, "Shards" sounds more like a terminology used to describe fragments of data which is federated across multiple databases instead of an architecture by itself. I think Theo Schlossnagle has a valid argument. If any of you disagree I'm interested to hear what you have to say. A clearer definition between sharding and federation would be very helpful as well.

Here are more references to Shard/Sharding.



      September 07, 2007

      Adventures of scaling eins.de

      Patrick Lenz, founder and lead developer of freshmeat.net was also responsible for the relaunch of another website eins.de which recently moved from php to ruby. Eins.de site serves about 1.2 million dynamic pages a day. He wrote a series of articles describing how they redesigned the site to scale for growth. I found these articles very informative with a extreemly mature discussion of the colorful world of scalability.Here is the final summary of what they ended up after 4 months of optimizations

      Systems optimization

      Code optimization

      September 06, 2007

      Scalability stories from Sept 6th 2007


      1. A 55 minute talk by ' Stewart Smith' from MySQL AB, about Mysql Clusters. He talks about NDB storage engine and synchronous replication between storage nodes. Also talks about new features in 5.1 including cluster to cluster replication, disk based data and a bunch of other things. And another Mysql talk on Google about Performance Tuning Best practices for Mysql.

      2. An interesting talk with Leah Culver about how Pownce was created. They use LAMP(Python) stack with Perlbal, Memcached, Django, AIR with Amazon S3 as the backend storage.

      3. Discussion about  High-Availability Mongrel Packs using Seesaw

      4. A blog about "Future of Data Center Computing" mentions Terracotta Sessions. If you read my previous post about "Session, state and scalabililty" and understood the problem, do look at this as a solution as well

      5. EC2 and S3 are being used more than before. Unfortunately because storage on EC2 doesn't persist across reboots creative ways of keeping the data alive has to be found. Here is a talk about Redundant Mysql Replication using EC2 and S3.

      6. Found an interesting post by IBM engineers on how to setup a web cluster in 5 easy steps


      Blogged with Flock

      September 01, 2007

      Session, state and scalability

      In my other life I work with a medium scale web application which has had many different kinds of growing problems over time. One of the most painful one is the issue about "statelessness". If I could only give one recommendation to anyone building a brand new web application, I'd say "go stateless". But going stateless is not the same as going session-less. One could implement a perfectly stateless web architecture which still uses sessions to authenticate, authorize and track user activity. And to complicate matters further, when I say stateless, I really mean that the server should be stateless, not the client.

      Basic authentication
      Most interactive web applications today which allow you to manipulate data in some form require authentication and authorization mechanisms to control access. In the good old days "Basic Authentication" was the most commonly used authentication mechanism. Once authenticated, the browser would send the credentials to the server with every subsequent HTTP request. Over time, cookie and URL based session tracking mechanisms took over, which is what we are using in most of our web based applications today. In a one-server farm, you probably don't need to pay attention to how these authentication, authorization or tracking mechanisms works. But once you have multiple servers in the farm, you do need to verify that a clients authenticated to one of the server is considered authenticated by all of the servers. Since, in "Basic Authentication", the browser sends the authentication credentials with every request it doesn't matter which server your session requests goes to. Each request is treated as a new session and credentials are verified every single time.

      Transition to sessions
      As traffic and security concerns grew, some of the smarter web applications switched to using "sessionids" in cookies and URLs. In this design, servers issue a sessionid (like a train/plane ticket) for every successful authentication and requests the client to send the sessionid instead of the authentication credentials for every subsequent request. This not only reduces the chances of someone sniffing your credentials, it also reduces the work done on the server because all it has to do now is check if the sessionid was issued by it in the last few minutes/hours (and is active), instead of validating the user against the all the users in the database for every single request.

      Session replication
      To understand which HTTP requests are part of which session, the server has to issue a unique HTTP session identifier to each of the user sessions. As it happens most web application servers also maintain an internal "sessionid" for each session, so inserting this sessionid in the cookie/url seemed to be the most natural thing to do. Unfortunately, if you have done any web programming, you will notice that unless you do something special, these default sessionids are local to each application server. Without doing special magic, like setting up application server in clustering mode, or persisting session information on shared resource (database/NFS) it would be difficult for one server to know which sessions are active on other application servers. This is where the scalability problem comes in. Replicating a lot of session information between 2 nodes might be simple, and 10 nodes might be possible.. but this architecture is not very scalable without a little extra work.

      Anatomy of a session
      "Sessions", have three primary purposes.

      1. It helps group together HTTP requests coming from same browser.

      2. It can cache authentication credentials and key user info after user authenticates.

      3. It allows caching of other information in session which might be temporary in nature.


      If you have ever tried to create a new user account on a website, chances are that you would have gone through a couple of pages before the registration was completed. Most web applications keep the information from intermediate pages in temporary session variables within the application server. Developers love to use temporary session variables for such information and this approach of storing temporary information can grow into a beast which cannot be replicated or shared using shared resource. To design a scalable website now, not only would one have to persist the sessionid and authentication credentials onto some kind of shared resource, one would also have to figure out what to do with these temporary session variables.

      Since sessions are created only once per session, and since authentication happens only once too, it should be possible to keep that kind of information in a central database or shared network cache like memcached. The temporary session variables on the other hand can be very noisy (changes frequently) and may need not be critical enough to be available to the other servers. One needs to find a balance between good performance and reliability. I know organizations which have decided to accept some level of data loss with temporary session variables to improve over all performance.

      But the key thing to remember here is that every time a session is created or authenticated it should be copied over to some shared scalable resource. It could be something like a cluster of memcached servers, or a cluster of replicated database servers. This will guarantee that if the user does switch application servers the requests can be authenticated by checking this shared resource before continuing with what they were doing.

      Session vs State
      The information about where the user is or was within an application at any given moment is called the "state". If that information is kept only on the server and is not replicated you will probably see errors of some kind when servers fail or switches user to another server. If the application server does persist the sessionid and user credentials in a central database, then, as far as the session is considered it wouldn't matter which server the user goes to. Most of the new web applications today try to maintain the state information within the browser itself which frees up the server from the responsibility of storing, maintaining and replicating state information across all the servers.

      Final thoughts

      Loadbalancers: If you are doing what I did, you'd be calculating the probability of a user switching servers mid-session. In our environment we noticed about one in 50 servers fail every month. Such disasters will force all sessions created on that server to another server. Occasionally we also see issues with loadbalancer, which can reset all sticky sessions across all servers. This isn't that bad. But, if you are running a CPU intensive web application, session stickiness can at times make some servers more loaded than others and if your traffic grows enough to need multiple loadbalancers, or have multiple global sites running in active-active mode, you will eventually need the ability handle server switching/failovers.

      REST (Representational State Transfer): The REST architecture does recommend some of the ideas I addressed above, but in its strictest interpretation one is not allowed to create Cookies. In my personal opinion a Cookieless world, though ideal, is not pragmatic.