November 10, 2007

Scaling Early: Feedjit

Mark Maunder from Feedjit make an interesting presentation about scaling early. He focuses on some of the key operational issues related to the web server and server caching which I found very interesting.

SlideShare | View | Upload your own

November 04, 2007

Mysql on HDFS

A short thought provoking post by Mark Callaghan about running Mysql over HDFS. Its probably not ideal, but its an interesting thought regardless.

October 24, 2007

Scaling technorati - 100 million blogs indexed everyday

Indexing 100 million blogs with over 10 billion objects, and with a user base which is doubling every six months, technorati has an edge over most blog search engines. But they are much more than search, and any technorati user can explain you that. I recommend you read John Newton's interview with David Sifry which I found fascinating. Here are the highlights from the interview if you don't have time to read the whole thing

  • Current status of technorati

    • 1 terabyte a day added to its content storage

    • 100 million blogs

    • 10 billion objects

    • 0.5 billion photos and videos

    • Data doubling every six months

    • Users doubling every six months

  • The first version was supposed to be for tracking temporal information on low budget.

    • That version put everything in relational database which was fine since the index sizes were smaller then physical memory

    • It worked fine till about 20 million blogs

  • The next generation took advantage of parallelism.

    • Data was broken up into shards

    • Synced up frequently between servers

    • The database size reached largest known OLTP size.

      • Writing as much data as reading

      • Maintaining data integrity was important

        • This put a lot of pressure on the system

  • The third generation

    • Shards evolved

      • The shards were based on time instead of urls

      • They moved content to special purpose databases instead of relational database

    • Don't delete anything

    • Just move shards around and use a new shard for latest stuff

  • Tools used

    • Green plum - enables enterprises to quickly access massive volumes of critical data for in-depth analysis. Purpose built for high performance, large scale BI, Greenplum’s family of database products comprises solutions suited to installations ranging from departmental data marts to multi-terabyte data warehouses.

  • Should have done sooner

    • Should have invested in click stream analysis software to analyze what clicks with the users

      • Can tell how much time users spend on a feature

October 22, 2007

Scalability stories for Oct 22, 2007

  • Why most large-scale sites which scale are not written in java ?  -  ( What nine of the world's largest websites are running on) -  A couple of very interesting blogs to read.
  • Slashdot's setup Part 1 - Just in time for the 10 year anniversary.
  • Flexiscale - Looks like an amazon competitor in the hosting business.
  • Pownce - Lessons learned  - Lessons learned while developing Pownce, a social messaging web application
  • Amazon's Dynamo - Dynamo is internal technology developed at Amazon to address the need for an incrementally scalable, highly-available key-value storage system. The technology is designed to give its users the ability to trade-off cost, consistency, durability and performance, while maintaining high-availability. [ PDF ]. This paper presents the design and implementation of Dynamo, a highly available key-value storage system that some of Amazon’s core services use to provide an “always-on” experience.  To achieve this level of availability, Dynamo sacrifices consistency under certain failure scenarios. It makes extensive use of object versioning and application-assisted conflict resolution in a manner that provides a novel interface for developers to use.
  • Client side loadbalancing - Not a good idea if you don't know what you are getting into. But there are certain kinds of application which can benifit from this. BTW DNS,SMTP,NTP already have some kind of client side loadbalancing features implemented.

October 21, 2007

Crawling sucks.

I wrote my first crawler in a few lines of perl code to spider a website recursively about 10 years ago. Two years ago I wrote another crawler in a few thousand lines using java+php and mysql. But this time I wasn't really interested in competing with google, and instead crawled feeds (rss/atom). Google hadn't released its blog search engine at that time. Using the little java knowledge I had and 3rd part packages like Rome and some HML parsers I hacked up my first crawler in a matter of days. The original website allowed users to train the bayesian based engine to teach it what kind of news items you like to read and automatically track millions of feeds to find the best content for you. After a few weeks of running that site, I started having renewed appreciation for the search engineers who breath this stuff day in and day out. That site eventually went down... mostly due to design flaws I made early which I'm listing here for those who love learning.

  • Seeding problem- DMOZ might have a good list of seed URLs for a traditional crawler, but there wasn't a DMOZ like public data source for feeds which I could use. I ended up crawling sites providing OPMLs, and sites like for new feeds.

  • Spamming - It dawned on me pretty fast that sites like was probably not the best place to look for quality feeds. Every tom dick and harry were pinging and so were all the spammers. The spam statistics blew me away. 60% of the blogs were spam according to a few analysts in 2005. I'm sure this number has gone up now.

  • The Size - I thought just crawling feeds would be easy to manage, since that doesn't require images/css/etc to be archived. But I was so wrong. I crossed 40GB of storage within a week or two of crawling. I could always add new harddisk, but without the ability to detect spam nicely and without a scalable search platform this blog search engine was DOA.

  • I also underestimated number of posts I was collecting. I had to increase the byte size for some of the IDs in Java and Mysql.

  • Feed processing/Searching - Mysql is fast, but in the hands of a totally untrained professional like me, it can be a ticking time bomb. Though I made good start, I struggled to get a grip on how the indexing, inner/outer joins work. I had underestimated the complexities of databases.

  • Threading - I over-designed the threaded application without investing enough time to understand how threading works in java. That, with a few caching features I created became the memory leak I so much wanted to avoid. It was a mess :)

  • I liked PHP for its simplicity, and Java for its speed. Unfortunately my attempt to design the UI in PHP and leave the backend in Java didn't work very well because of my lack of experience with PHP-java interoperability.

  • Feed update frequency - Some feeds update faster than others. To calculate when you need to crawl next is an interesting problem by itself. Especially because some feeds update more frequently in certain parts of the day than others. Apparently google reader's backend checks for feed updates about every one hour. So if you have 10 million feeds to crawl thats about 2777 feed requests per second. There is no way I can do that from a single machine in my basement.

  • The worst annoying problem I had was not really my fault :) . I own a belkin wireless router which became extremely unstable when my crawlers ran. I had to resort to daily reboots of this device to solve the problem. And on busy days it required two.

The reason why I'm not embarrassed, blogging about my mistakes, is because I'm not a developer to begin with. And the second reason is that I'm about to take a second shot at it to see if I can do it better this time. The objective is not to build another search engine, but to understand and learn from your mistakes and do it better.

The first phase of this learning experience resulted in what you see currently on The initial prototype displayed here does limited crawling to gather feed updates. The feed update algorithm is already a little smarter (it updates some more often than others). The quality of the feeds should also be better because of human behind the engine who manually adds the feeds to the database. I also realized soon after my first attempt that I should have investigated JSON to make Java and PHP talk to each other. In the current version of blogofy the core engine is in PHP except the indexing/storage engine which will soon move to Solr(which is java based). PHP has JSON based hooks to talk to Solr which seems to work very well. Solr incase you didn't know is a very fast lucene based search engine which does much better than Mysql for the kind of search operations I would be doing. And yes, I replaced Belkin router with an apple wireless... so lets see how that works out to be.

Will send more updates if I make progress. If any of you have ideas on what else I should be doing please let me know.

October 16, 2007

EC2 for everyone. And now includes 64bit with 15GB Ram too.


Finally it happened. EC2 is available for everybody. And more than that they now provide servers with 7.5GB and 15GB of RAM per instance. Sweet.  

For a lot of companies EC2 was not viable due to high memory requirements of some of the applications. Splitting up such tasks to use less memory on multiple servers was possible, but not really cost and time efficient. The release of new types of instances removes that road block and would probably invoke significant curiosity from memory crunching application developers.

$0.10 - Small Instance (Default)

    1.7 GB of memory, 1 EC2 Compute Unit (1 virtual core with 1 EC2 Compute Unit), 160 GB of instance storage, 32-bit platform

$0.40 - Large Instance

    7.5 GB of memory, 4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each), 850 GB of instance storage, 64-bit platform

$0.80 - Extra Large Instance

    15 GB of memory, 8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units each), 1690 GB of instance storage, 64-bit platform

      Data Transfer

      $0.10 per GB - all data transfer in

      $0.18 per GB - first 10 TB / month data transfer out
      $0.16 per GB - next 40 TB / month data transfer out
      $0.13 per GB - data transfer out / month over 50 TB

      October 15, 2007

      Web Scalability dashboard

      [Blogofy: bringing feeds together ]

      I took a week's break from blogging to work on one of my long overdue personal projects. Even though I use Google Reader as my feed aggregator I noticed a lot of folks still prefer a visual UI to track news and feeds. The result of my experimentation of designing such a Visual UI to track feeds lead me to create Blogofy

      If you have an interesting blog on Web Scalability, Availability or Performance which you want included here please let me know. The list of blogs on the page is in flux at the moment and I might move the feeds around a little depending on user feedback and blog activity.

      September 28, 2007

      Scalable products: KFS released

      Kosmix, a search startup has released source to C++ implementation of something which looks like a clustered file system. This looks very similar to Hadoop/HDFS, but the C++ factor will be a big performance boost.Kosmic

      From Skrenta blog

        • Incremental scalability - New chunkserver nodes can be added as storage needs increase; the system automatically adapts to the new nodes.

        • Availability - Replication is used to provide availability due to chunk server failures.

        • Re-balancing - Periodically, the meta-server may rebalance the chunks amongst chunkservers. This is done to help with balancing disk space utilization amongst nodes.

        • Data integrity - To handle disk corruptions to data blocks, data blocks are checksummed. Checksum verification is done on each read; whenever there is a checksum mismatch, re-replication is used to recover the corrupted chunk.

        • Client side fail-over - During reads, if the client library determines that the chunkserver it is communicating with is unreachable, the client library will fail-over to another chunkserver and continue the read. This fail-over is transparent to the application.

        • Language support - KFS client library can be accessed from C++, Java, and Python.

        • FUSE support on Linux - By mounting KFS via FUSE, this support allows existing Linux utilities (such as, ls) to interface with KFS.

        • Leases - KFS client library uses caching to improve performance. Leases are used to support cache consistency.

      If anyone has experience with KFS, or has more information please leave a comment here.

      September 22, 2007

      What is scalability ?

      When asked what they mean by scalability, a lot of people talk about improving performance, about implementing HA, or even talk about a particular technology or protocol. Unfortunately, scalability is none of that. Don't get me wrong. You still need to know all about speed, performance, HA technology, application platform, network, etc. But that is not the definition of scalability.

      Scalability, simply, is about doing what you do in a bigger way. Scaling a web application is all about allowing more people to use your application. If you can't figure out how to improve performance while scaling out, its okay. And as long as you can scale to handle larger number of users its ok to have multiple single points of failures as well.

      There are two key primary ways of scaling web applications which is in practice today.

      • "Vertical Scalability" - Adding resource within the same logical unit to increase capacity. An example of this would be to add CPUs to an existing server, or expanding storage by adding hard drive on an existing RAID/SAN storage.

      • "Horizontal Scalability" - Adding multiple logical units of resources and making them work as a single unit. Most clustering solutions, distributed file systems, load-balancers help you with horizontal scalability.

      Every component, whether its processors, servers, storage drives or load-balancers have some kind of management/operational overhead. When you try to scale that, its important to understand what percentage of the resource is actually usable. This measurement is called "scalability factor". If you loose 5% of a processor power every time you add a CPU to your system, then your "scalability factor" is 0.95. A scalability factor of 0.9 means you will only be able to use 90% of the resource.

      Scalability can be further sub-classified based on the "scalability factor".

      • If the scalability factor stays constant as you scale. This is called "linear scalability".

      • But chances are that some components may not scale as well as others. A scalability factor below 1.0 is called "sub-linear scalability".

      • Though rare, its possible to get better performance (scalability factor) just by adding more components (i/o across multiple disk spindles in a RAID gets better with more spindles). This is called "supra-linear scalability".

      • If the application is not designed for scalability, its possible that things can actually get worse as it scales. This is called "negative scalability".

      If you need scalability, urgently, going vertical is probably going to be the easiest (provided you have the bank balance to go with it). In most cases, without a line of code change, you might be able to drop in your application on a super-expensive 64 CPU server from Sun or HP and storage from EMC, Hitachi or Netapp and everything will be fine. For a while at least. Unfortunately Vertical scaling, gets more and more expensive as you grow.

      Horizontal scalability, on the other hand doesn't require you to buy more and more expensive servers. Its meant to be scaled using commodity storage and server solutions. But Horizontal scalability isn't cheap either. The application has to be built ground up to run on multiple servers as a single application. Two interesting problems which most application in a horizontally scalable world have to worry about are "Split brain" and "hardware failure".

      While infinite horizontal linear scalability is difficult to achieve, infinite vertical scalability is impossible. If you are building capacity for a pre-determined number of users, it might be wise to investigate vertical scalability. But if you are building a web application which could be used by millions, going vertical could be an expensive mistake.

      But scalability is not just about CPU (processing power). For a successful scalable web application, all layers have to scale in equally. Which includes the storage layer(Clustered file systems, s3,etc), the database layer (partitioning,federation), application layer(memcached,scaleout,terracota,tomcat clustering,etc), the web layer, loadbalancer , firewall, etc. For example if you don't have a way to implement multiple load balancers to handle your future web traffic load, it doesn't really matter how much money and effort you put into horizontal scalability of the web layer. Your traffic will be limited to only what your load balancer can push.

      Choosing the right kind of scalability depends on how much you want to scale and spend. In fact if someone says there is a "one size fits all" solution, don't believe them. And if someone starts a "scalability" discussion in the next party you attend, please do ask them what they mean by scalability first.


      1. Cost and Scalability

      2. My linear scalability is bigger than yours

      September 18, 2007

      Scaling Smugmug from startup to profitability, a 5 year old company with just 23 employees has 315000 paying customers and 195 million photographs. CEO & "Chief Geek" Don MacAskill has a nice set of slides where he talks about its 5 year journey during which it went from small startup to a profitable business. The talk was given during Amazon's "Startup project" so it talks mostly about how it uses AWS (Amazon Web services).

      Other than wonderful fun loving employees who are also "super heroes" in his eyes, he talks about the how they are doubling the storage requirements on an yearly basis. They already have about 300 TB in use, and as of today all of that is on Amazon's S3. Don estimates that based on his estimates, for the storage they are using, they are saving about 500K per year which is pretty big for a small operation like theirs.

      The Smugmug architecture has evolved over time. Internally it can serve images in 3 different ways.

      1. It could either "proxy" the request, where the app server would get the object from storage and serve it to the client.

      2. It could do a "redirect" where the app server could redirect the user to the right resource.

      3. And finally it could just serve the images directly from the storage using REST based APIs.

      This flexibility allowed Smugmug try different ways of using internal and external/S3 storage. Interestingly even though they wanted to get away from self-managed internal storage, they noticed that S3 storage wasn't very cheap when you take into account the bandwidth utilization. After various permutation and combinations, they managed to setup a system where they continued to use S3 for primary storage, but kept 10% of the hottest objects locally on Smugmug's own servers to minimize bandwidth utilization on S3 service. This allowed them to get away with not buying 95% of the storage drives they were originally supposed to buy. 5% of 300TB is still 15TB, which is smaller but not small enough unfortunately.

      Though storage management is easy with S3, it can at times make things difficult. Permission management was one of those things where they had to sacrifice speed/performance in favour of using "Proxy" mechanism which is more robust/reliable way of serving protected objects.

      On reliability, Don mentioned that having multiple single points of failures is not really helpful if you want to provide near 100% availability. With S3 in picture, not only do they have to worry about connectivity from customer to Smugmug and Smugmug's own servers, they also have to worry about connectivity to Amazon and availability of Amazon services round the clock. Hence they had to design their app ground up to handle failures gracefully. For example, write failures to S3 is handled by recording the change locally which is then synced asynchronously. At other times when things break, its designed to either try again, or alert the right folks so that action can be taken.

      While talking about nice-to-have-features, he mentioned that S3 shouldn't be confused with CDN. Its not a distributed caching service and it doesn't have global locations like a CDN has. Regardless, he said, S3 probably should have limited Cache or Stream ability which will boost performance and add value to an already invaluable service.

      On the topic of Smugmug's future, Don mentioned they are flirting with the possibility of using EC2 in the near future. EC2 would probably be used as image processing computation nodes. Since the EC2 servers are located in the Amazon facility, it will save them bandwidth of transferring data from S3. And since they can turn off server instances on demand (and not pay for them while its offline) it will probably cut down their operating costs to maintain image processing servers as well.

      At the end he mentioned a few services which he would like to see Amazon offer in future. Two important ones which he mentioned (and which I think are critical) are the absence of some kind of Database, and Loadbalanacer APIs.

      PDF Slides of the same presentation here (34 slides in this one)

      September 15, 2007

      Custom search engine to search your OPML and Delicious bookmarks

      Zoppr is a Custom Search engine which allows you to create custom Google search engine on the fly, by appending your bookmark page, wikipage, or any other kind of page with lots of interesting bookmarks/links on it. Once setup, google will search only across your bookmarks/links. For example this URL will help you search across an OPML file published somewhere on the internet

      Scalable web architectures

      If you haven't noticed already there is a second blog which I maintain which is currently more busy than this particular blog. "Scalable web architectures" is a collection of posts about how web architectures which scale and technologies which make it happen.

      Here are some of the posts on that blog

 site serves about 1.2 million dynamic pages a day. He wrote a series of articles describing how they redesigned the site to scale for growth. I found these articles very informative with a extreemly mature discussion of the colorful world of scalability.

        Session, state and scalability

        If I could only give one recommendation to anyone building a brand new web application, I’d say “go stateless“. But going stateless is not the same as going session-less. One could implement a perfectly stateless web architecture which still uses sessions to authenticate, authorize and track user activity. And to complicate matters further, when I say stateless, I really mean that the server should be stateless, not the client.

        Loadbalancer for horizontal web scaling: What questions to ask before implementing one.

        Loadbalancers, by definition, are supposed to solve performance bottlenecks by distributing or balancing load between different components its managing. Though you would normally find loadbalancers in front of a webserver, a lot of different individuals have found other interesting ways of using it.

      Scalability Stories (15th Sept) Mysql Proxy, Cluster Fire System, Facebook apps and Twitter

      There have been a lot of interesting stories from last week for me to share. If you have interesting links you want to add to this post please forward them to me or post a comment to this post.

      • Sun is planning to acquire majority stake of "Cluster File Systems, Inc". [ Talk on Lustre File System ]

        • Sun intends to add support for Solaris Operating System (Solaris OS) on Lustre and plans to continue enhancing Lustre on Linux and Solaris OS across multi vendor hardware platforms. As previously announced in July 2007, Sun also plans to deliver Lustre servers on top of Sun's industry-leading open source Solaris ZFS solution, which is one of the fastest growing storage virtualization technology in the marketplace.

      • Making Facebook Apps scale on cheap : An interesting writeup By Surj Patel about Scalability issues Facebook itself and the 3rd Party apps on it have. Also discusses EC2 and S3 as an alternative solution to scale in a cost effective way.

        • Welcome to Amazon and S3 and EC2 — processing power (EC2) and storage (S3) on demand. These services let you access computational power and storage only when you need it and, better yet, pay only for what you use. The last time I checked, it was 10 cents an hour for the server, 10 cents for every gigabyte of data written and 18 cents per gigabyte read out – all for a virtual box with 1.7Ghz x86 processor/1.75Gbytes of RAM/250Mbs of bandwidth. Nor are you limited to one usage; use as many as you need or want and can afford.

      • Interesting blog post on Google Reader Numbers. They have made significant progress lately and thanks to the scalable architecture they now store 10 terabytes of raw feed data from 8 million feeds in their index.

      • Todd Hoff has an interesting writeup on Scaling Twitter: Making Twitter 10000 Percent Faster. And an interview with Biz Stone (Co-Founder of Twitter) here.

      • If you use Mysql and your app is not yet designed to handle federated database architecture, you should take a look at a new product in development called "Mysql Proxy"
          The most powerful feature is Read/Write Splitting which allows you to scale a application which is unaware of replication automatically cross several slaves without changes to your application. Instance Scale Out we say. The Proxy also became a 1st class citizen in the MySQL world with full docs, win32 support and easy to install.

      September 12, 2007

      Scaling Powerset using Amazon's EC2 and S3

      The first thing most doc-com companies do before going public is setup an infrastructure to provide the service. And though it might sound straight forward to most of you, it can be a very expensive affair. To come up with the right kind of infrastructure for any new service a few key architectural decisions have to be made.

      1. Design the infrastructure and architecture to handle the traffic peaks (not average)

      2. Design with long term scaling in mind

      3. Design power and cooling infrastructure to support the servers

      4. Hire Hardware+Systems+network support staff ( more if 24/7 operations are required)

      5. Add buffers to support failures and short term grow requirements

      6. Take into account lead times of ordering and procuring hardware (which can be weeks if not months)..

      7. And a few others.. which I won't bore you with here.

      The point is that the initial capital investment can be in millions even before the first customer starts using the service. And once the capital investment is made, it is very difficult to scale down the operations if the plans change.
      Powerset Inc is a secretive search startup with ambitions of out-smarting Google in its own turf. Based in San Francisco, this search startup is working on building a better search engine using natural language processing capabilities to understand the users question a little better before answering it. And just like any other search company its technology is a CPU hungry beast just waiting to be unleashed. Powerset could have gone the way most dot-com companies have gone, but instead they decided to try out Amazon's EC2 (Elastic Cloud Computing) and S3(Simple Storage Service) to augment their computational needs.

      Powerset has been repoted to be testing a 400 server instance EC2 cluster with Hadoop running Map/Reduce and HDFS (Hadoop Distributed File system). This does get a little tricky on EC2 because of absence of persistent storage (OS is re-initialized after every reboot). So they use a combination of HDFS and remote copying process to sync the data to their local network. Since there is no charge to move data from EC2 to S3, they have been thinking about implementing a native HDFS and S3 interface to move data around within Amazon's network itself.

      EC2 is charged on a per instance per hour usage basis, which means Powerset can bring new nodes online during heavy demand and shut off unused nodes at a flick of a switch at night. Powerset guys also built their own EC2 image configured to automatically join the HDFS cluster after every boot up. In an event of a node failure, Hadoop can take care of data replication, and EC2 takes care of replacing the failed node with a new one.

      Amazon EC2 costs 10 cents an hour per instance. If you have to run a 400 node cluster for 1 month thats only about 30000. Based on the performance benchmarks, it looks like the actual CPU throughput from each of the EC2 instance is roughly equivalent of 1Ghz PIII. 72 dollars a month for that kind of server is not too cheap, but just like car leasing, atleast u don't have to pay upfront and manage it.

      So lets do the math. A regular AMD 64bit dual core, 2 cpu server with about 8GB of ram costs about 10000 USD which excluds the cost of hosting, power, cooling and maintainance. Based on some comments on Amazon forum this is about 2 to 3 times faster than the EC2's infrastructure.If you had to replace the CPU computation power of this new hardware with 8 to 12 server instances on EC2, you would have spent about 700 to 800 dollars a month. It will take a company using EC2 infrastructure close to 12 months before they would have to pay 10000 towards EC2 computational services for the same amount of computation power. And remember that 10000 amount didn’t take into account colocation, power, cooling and general server administration which can be significant as well. Also remember that 12 months is actually 12 months of actual computational usage.. which could over a period of 2 to 4 years depending on how often the instances are used.

      However, I also have to point out that there are a few things to look out for. The maximum physical memory available is about 1.7GB which is relatively tiny if you are used to 8 to 16 GB of ram on a 64 bit hardware. And though CPU/Memory might scale horizontaly for some applications, cross-server communication can be extreemly expensive for some. Unless your application is designed to scale horizontally with under 1.7 GB of ram, I would seriously advice you against using EC2 until you figure out how to change that.

      I've blogged about both S3 and EC2 before and it continues to facinate me. Success of companies like, and the decision of companies like Powerset to use AWS is something which I'll watch closely over time.

      Links about Hadoop, and how to use it on Amazon EC2/S3

      Other References about Powerset and Amazon

      September 10, 2007

      P2P network scalability

      Youtube is said to be pushing about 25 petabytes per month which is about 77 Gbps sustained data rate on an average. The bandwidth usage at the peaks would be even higher. Thanks to Limelight networks, Youtube doesn't really need to scale or provision for that kind of bandwidth and based on the some reports from 2006 it had cost them close to 4 million a month back then. Youtube and services like that have to invest a lot in their infrastructure before they can really launch their service and though using shared Content delivery networks is not ideal, its probably not a bad deal. In Youtube's case, it helped them survive until Google bought it out.

      Newer Internet television service providers, however need not build their services around the traditional CDN model. Joost Network architecture presentation from Colm MacCarthaigh is an interesting example to discuss to prove my point. Joost was founded by the same guys who founded Kazaa and Skype . Kazaa was one of notorious P2P file sharing application (used the FastTrack protocol) which died after RIAA revolt. Skype, as it happens, also has its roots in P2P network [ Skype protocol , Skype scalability problems ] and has been doing pretty good over the years. So its no surprise that Joost chose P2P model again to distribute part of the content to its users. Joost has a cluster of servers which serve as "original seeders" or all content, and rely on the P2P network to distribute the popular content. The number of Joost servers, however, is not small because it still also has to address the "long tail" of requests which are not among the popular content.

      Two of the most important network optimization ground rules, which I noticed from the talks, was that they decided against using firewalls or loadbalancers in its network. Thats good, because the firewalls and loadbalancers wouldn't have kept up with the bandwidth anyway. But even more impressive was that they designed the entire P2P application/network-algorithm to intelligently find and peer with nodes and supernodes closest to them. Joost tries to do this this in two different ways. The first one is using IP address (prefix aware) as proximity sensors (two IPs which start with similar set of numbers/octets will probably be in the same network). The second way to detect proximity is using Network AS Numbers which can work irrespective of what the IP addresses start with. [ Colm also mentioned about AS proximity detection below ]

      A comment to blog @ by Colm himself
      We have many gigs of transit, and are adding more. I'm not sure who claimed it's near HD quality, I like to think it's about NTSC, sometimes better, never quite PAL.We have some efforts in the code to save transit costs, there is very very basic prefix awareness, and we're adding AS-level awareness using live BGP data. I have looked at adding AS adjacency information, ie prefer AS-adjacent peers, but it's a lot of work and the US internet is relatively poorly mapped, so I don't think this will come soon.

      Its possible that Joost might still require CDNs to serve the long-tail content, but the work they have done to build the P2P infrastructure would not only save them an a lot of mulah in the long run but would also allow them to easily scale to be larger than any of the current CDNs if they do get that big.

      Interestingly companies like Microsoft are not sitting idle watching the world go by. Microsoft has been working on something called Avalanche and I think they already have a prototype client out which you can download and try it out yourself.

      Microsoft Secure Content Downloader

      Some MSCD clients may be connected to each other via peer connections, forming a ‘cloud’ of clients. Pieces of the file you are downloading are sent through these peer connections between clients, as well as through connections with the file server. As a member of the cloud, your computer both serves as a client and server to other members of the cloud. Data destined for the cloud may be routed through your computer and sent to other cloud members. The other cloud members connected to you will be able to access only pieces of the file you are downloading via MSCD – they have no access to any other data on your computer.

      You are only connected to other clients while you are downloading a file via MSCD. When the file has finished downloading – or when you pause or cancel the download, or exit the application – you disconnect from the cloud. Once you disconnect from the cloud, you will no longer have any connections to any other members in the cloud and no data will be routed through your computer.The Microsoft Secure Content Downloader (MSCD) is a peer-assisted download manager capable of securely downloading specific files. MSCD is intended for consumers who are downloading from a home PC, or business users whose computers are not behind a corporate firewall. If you use MSCD from behind a corporate firewall, you may be unable to download content, and may adversely affect other clients' ability to download content.

      Of course there are also other rumors that apple is trying this out... but you know how these things go.

      Anyway, the point is that in spite of occasional gliches P2P is probably the way to go if you want to cut long term costs of CDN. Personally, I believe that Skype had no other way out. I mean can you think off all the phone calls in the world going through the same first phone exchange in New Haven, Connecticut where it all started ? P2P models are still evolving and its hard to imagine there will be a one-solution-fits-all. But if you know one, please let me know.

      September 09, 2007

      Sharding: Different from Partitioning and Federation ?

      Ive been hearing this word "sharding" more and more often, and its spreading like fire. Theo Schlossnagle, the author of "Scalable internet architecutres" argues that federation is form of partitioning, and that sharding is nothing but a form of partitioning and federation. Infact, according to him, Sharding has already been in use use for a long time.

      I'm not a dba, and I don't pretend to be one in my free time either, so to understand the differences I did some research and found some interesting posts.

      The first time I heard about "Sharding" was on Been Admininig's blog about Unorthodox approach to database design (Part I and Part II). Here is the exact reference...
      Splitting up the user data so that User A exists on one serverwhile User B exists on another server, each server now holds a shard ofthe data in this federated model.

      A couple of months ago picked it up and made it sound (probably unintentionally) that sharding is actually different from Federation and Partitioning. Todd's post also points at Flickr using sharding.The search for Flickr architecture lead me to Colin Charles' post about Federation at Flickr: A tour of the Flickr architecture where he does mention shards as a component of Federation key. Again no mention of Sharding being anything new.
      Federation Key Components:

      • Shards: My data gets stored on my shard, but the record ofperforming action on your comment, is on your shard. When making acomment on someone elses’ blog

      • Global Ring: Its like DNS, you need to know where to go and whocontrols where you go. Every page view, calculate where your data is,at that moment of time.

      • PHP logic to connect to the shards and keep the data consistent (10 lines of code with comments!)

      Based on the discussions on these and other blogs, "Shards" sounds more like a terminology used to describe fragments of data which is federated across multiple databases instead of an architecture by itself. I think Theo Schlossnagle has a valid argument. If any of you disagree I'm interested to hear what you have to say. A clearer definition between sharding and federation would be very helpful as well.

      Here are more references to Shard/Sharding.

          September 07, 2007

          Adventures of scaling

          Patrick Lenz, founder and lead developer of was also responsible for the relaunch of another website which recently moved from php to ruby. site serves about 1.2 million dynamic pages a day. He wrote a series of articles describing how they redesigned the site to scale for growth. I found these articles very informative with a extreemly mature discussion of the colorful world of scalability.Here is the final summary of what they ended up after 4 months of optimizations

          Systems optimization

          Code optimization

          September 06, 2007

          Scalability stories from Sept 6th 2007

          1. A 55 minute talk by ' Stewart Smith' from MySQL AB, about Mysql Clusters. He talks about NDB storage engine and synchronous replication between storage nodes. Also talks about new features in 5.1 including cluster to cluster replication, disk based data and a bunch of other things. And another Mysql talk on Google about Performance Tuning Best practices for Mysql.

          2. An interesting talk with Leah Culver about how Pownce was created. They use LAMP(Python) stack with Perlbal, Memcached, Django, AIR with Amazon S3 as the backend storage.

          3. Discussion about  High-Availability Mongrel Packs using Seesaw

          4. A blog about "Future of Data Center Computing" mentions Terracotta Sessions. If you read my previous post about "Session, state and scalabililty" and understood the problem, do look at this as a solution as well

          5. EC2 and S3 are being used more than before. Unfortunately because storage on EC2 doesn't persist across reboots creative ways of keeping the data alive has to be found. Here is a talk about Redundant Mysql Replication using EC2 and S3.

          6. Found an interesting post by IBM engineers on how to setup a web cluster in 5 easy steps

          Blogged with Flock

          September 01, 2007

          Session, state and scalability

          In my other life I work with a medium scale web application which has had many different kinds of growing problems over time. One of the most painful one is the issue about "statelessness". If I could only give one recommendation to anyone building a brand new web application, I'd say "go stateless". But going stateless is not the same as going session-less. One could implement a perfectly stateless web architecture which still uses sessions to authenticate, authorize and track user activity. And to complicate matters further, when I say stateless, I really mean that the server should be stateless, not the client.

          Basic authentication
          Most interactive web applications today which allow you to manipulate data in some form require authentication and authorization mechanisms to control access. In the good old days "Basic Authentication" was the most commonly used authentication mechanism. Once authenticated, the browser would send the credentials to the server with every subsequent HTTP request. Over time, cookie and URL based session tracking mechanisms took over, which is what we are using in most of our web based applications today. In a one-server farm, you probably don't need to pay attention to how these authentication, authorization or tracking mechanisms works. But once you have multiple servers in the farm, you do need to verify that a clients authenticated to one of the server is considered authenticated by all of the servers. Since, in "Basic Authentication", the browser sends the authentication credentials with every request it doesn't matter which server your session requests goes to. Each request is treated as a new session and credentials are verified every single time.

          Transition to sessions
          As traffic and security concerns grew, some of the smarter web applications switched to using "sessionids" in cookies and URLs. In this design, servers issue a sessionid (like a train/plane ticket) for every successful authentication and requests the client to send the sessionid instead of the authentication credentials for every subsequent request. This not only reduces the chances of someone sniffing your credentials, it also reduces the work done on the server because all it has to do now is check if the sessionid was issued by it in the last few minutes/hours (and is active), instead of validating the user against the all the users in the database for every single request.

          Session replication
          To understand which HTTP requests are part of which session, the server has to issue a unique HTTP session identifier to each of the user sessions. As it happens most web application servers also maintain an internal "sessionid" for each session, so inserting this sessionid in the cookie/url seemed to be the most natural thing to do. Unfortunately, if you have done any web programming, you will notice that unless you do something special, these default sessionids are local to each application server. Without doing special magic, like setting up application server in clustering mode, or persisting session information on shared resource (database/NFS) it would be difficult for one server to know which sessions are active on other application servers. This is where the scalability problem comes in. Replicating a lot of session information between 2 nodes might be simple, and 10 nodes might be possible.. but this architecture is not very scalable without a little extra work.

          Anatomy of a session
          "Sessions", have three primary purposes.

          1. It helps group together HTTP requests coming from same browser.

          2. It can cache authentication credentials and key user info after user authenticates.

          3. It allows caching of other information in session which might be temporary in nature.

          If you have ever tried to create a new user account on a website, chances are that you would have gone through a couple of pages before the registration was completed. Most web applications keep the information from intermediate pages in temporary session variables within the application server. Developers love to use temporary session variables for such information and this approach of storing temporary information can grow into a beast which cannot be replicated or shared using shared resource. To design a scalable website now, not only would one have to persist the sessionid and authentication credentials onto some kind of shared resource, one would also have to figure out what to do with these temporary session variables.

          Since sessions are created only once per session, and since authentication happens only once too, it should be possible to keep that kind of information in a central database or shared network cache like memcached. The temporary session variables on the other hand can be very noisy (changes frequently) and may need not be critical enough to be available to the other servers. One needs to find a balance between good performance and reliability. I know organizations which have decided to accept some level of data loss with temporary session variables to improve over all performance.

          But the key thing to remember here is that every time a session is created or authenticated it should be copied over to some shared scalable resource. It could be something like a cluster of memcached servers, or a cluster of replicated database servers. This will guarantee that if the user does switch application servers the requests can be authenticated by checking this shared resource before continuing with what they were doing.

          Session vs State
          The information about where the user is or was within an application at any given moment is called the "state". If that information is kept only on the server and is not replicated you will probably see errors of some kind when servers fail or switches user to another server. If the application server does persist the sessionid and user credentials in a central database, then, as far as the session is considered it wouldn't matter which server the user goes to. Most of the new web applications today try to maintain the state information within the browser itself which frees up the server from the responsibility of storing, maintaining and replicating state information across all the servers.

          Final thoughts

          Loadbalancers: If you are doing what I did, you'd be calculating the probability of a user switching servers mid-session. In our environment we noticed about one in 50 servers fail every month. Such disasters will force all sessions created on that server to another server. Occasionally we also see issues with loadbalancer, which can reset all sticky sessions across all servers. This isn't that bad. But, if you are running a CPU intensive web application, session stickiness can at times make some servers more loaded than others and if your traffic grows enough to need multiple loadbalancers, or have multiple global sites running in active-active mode, you will eventually need the ability handle server switching/failovers.

          REST (Representational State Transfer): The REST architecture does recommend some of the ideas I addressed above, but in its strictest interpretation one is not allowed to create Cookies. In my personal opinion a Cookieless world, though ideal, is not pragmatic.

          August 29, 2007

          Scalability Stories for Aug 30

          1. I found a very interesting story on how memcached was created. Its an old story titled "Distributed caching with memcached". I also found an interesting FAQ on memcached which some of you might like.

          2. Inside Myspace is another old story which follows Myspace's growth over time. Its a very long and interesting read which shouldn't be ignored.

          3. Measuring Scalability tries to put numbers to the problem of scalability. If you have to justify the cost of scalability to anyone in your organization, you should atleast skim through this page

          4. I found a wonderful story on the humble architecture of Mailinator and how it grew over time on just one webserver. It receives approx 5 million emails a day and runs the whole operation pretty much in memory with no logs or database to leave traces behind. And here is another page from the creator of Mailinator abouts its stats from Feb.

          5. Finally another very interesting presentation/slide on the topic of "Scalable web architecture" which focuses primary on LAMP architecture.

          August 28, 2007

          Thoughts on scalability

          Here is an interesting contribution on the topic from Preston Elder 

          I've worked in multiple extremely super-scaled applications (including ones sustaining 70,000 connections at any one time, 10,000 new connections each minute, and 15,000 concurrent throttled file transfers at any one time - all in one application instance on one machine).

          The biggest problem I have seen is people don't know how to properly define their thread's purpose and requirements, and don't know how to decouple tasks that have in-built latency or avoid thread blocking (and locking).

          For example, often in a high-performance network app, you will have some kind of multiplexor (or more than one) for your connections, so you don't have a thread per connection. But people often make the mistake of doing too much in the multiplexor's thread. The multiplexor should ideally only exist to be able to pull data off the socket, chop it up into packets that make sense, and hand it off to some kind of thread pool to do actual processing. Anything more and your multiplexor can't get back to retrieving the next bit of data fast enough.

          Similarly, when moving data from a multiplexor to a thread pool, you should be a) moving in bulk (lock the queue once, not once per message), AND you should be using the Least Loaded pattern - where each thread in the pool has its OWN queue, and you move the entire batch of messages to the thread that is least loaded, and next time the multiplexor has another batch, it will move it to a different thread because IT is least loaded. Assuming your processing takes longer than the data takes to be split into packets (IT SHOULD!), then all your threads will still be busy, but there will be no lock contention between them, and occasional lock contention ONCE when they get a new batch of messages to process.

          Finally, decouple your I/O-bound processes. Make your I/O bound things (eg. reporting via. socket back to some kind of stats/reporting system) happen in their own thread if they are allowed to block. And make sure your worker threads aren't waiting to give the I/O bound thread data - in this case, a similar pattern to the above in reverse works well - where each thread PUSHING to the I/O bound thread has its own queue, and your I/O bound thread has its own queue, and when it is empty, it just collects the swaps from all the worker queues (or just the next one in a round-robin fashion), so the workers can put data onto those queues at its leisure again, without lock contention with each other.

          Never underestimate the value of your memory - if you are doing something like reporting to a stats/reporting server via. socket, you should implement some kind of Store and Forward system. This is both for integrity (if your app crashes, you still have the data to send), and so you don't blow your memory. This is also true if you are doing SQL inserts to an off-system database server - spool it out to local disk (local solid-state is even better!) and then just have a thread continually reading from disk and doing the inserts - in a thread not touched by anything else. And make sure your SAF uses *CYCLING FILES* that cycle on max size AND time - you don't want to keep appending to a file that can never be erased - and preferably, make that file a memory mapped file. Similarly, when sending data to your end-users, make sure you can overflow the data to disk so you don't have 3mb data sitting in memory for a single client, who happens to be too slow to take it fast enough.

          And last thing, make sure you have architected things in a way that you can simply start up a new instance on another machine, and both machines can work IN TANDEM, allowing you to just throw hardware at the problem once you reach your hardware's limit. I've personally scaled up an app from about 20 machines to over 650 by ensuring the collector could handle multiple collections - and even making sure I could run multiple collectors side-by-side for when the data is too much for one collector to crunch.

          I don't know of any papers on this, but this is my experience writing extremely high performance network apps


          [ This post was originally written by Preston on slashdot. Reproduced here with his permission. If you have an interesting comment or a writeup you want to share on this blog, please forward it to me or submit a comment to this post ]


          August 27, 2007

          Loadbalancer for horizontal web scaling: What questions to ask before implementing one.

          A single server, today, can handle an amazing amount of traffic. But sooner or later most organizations figure out that they need more and talk about choosing between horizontal and vertical scaling. If you work for such an organization and also happen to manage networking devices, you might find a couple of loadbalancers on your desk one day along with a yellow sticky note with a deadline on it.

          Loadbalancers, by definition, are supposed to solve performance bottlenecks by distributing or balancing load between different components its managing. Though you would normally find loadbalancers in front of a webserver, a lot of different individuals have found other interesting ways of using it. For example I know some organizations have so much network traffic that they can't use just one sniffer or firewall to do their job. They ended up using some high end loadbalancers to intelligently loadbalance network traffic through multiple sniffers/firewalls attached to them. These are rare, so don't worry about it.

          But in general, if you are having a web scalability issue, you would probably start looking at hardware solutions first. And if you do, here are my recommendation on questions you need to ask before you investigate how to set it up.

          • Is the loadbalancer really needed ?

            • Identify the bottlenecks in the system. Adding webservers/appservers will not solve the problem if database is the bottleneck.

              • BTW, If you can replicate read only database to multiple servers, you might be able to use a loadbalancer to balance that traffic...

            • If most of the traffic is due to static content, investigate apache's ability to set better cachable headers. Or use a CDN(Content distribution network) like akamai/limelight to offload cachable objects.

            • DNS and SMTP are examples of protocols which are designed to be automatically loadbalanced. If an SMTP server fails to respond, the DNS protocol will help an SMTP client to find alternative SMTP servers for a destination. If your organization has controls over the protocol they are using, they should investigate the possibility of using DNS as the loadbalancing mechanism.

          • Is it a software or a hardware loadbalancer?

            • Most people can't think of loadbalancer as anything else but a dedicated hardware. Interestingly, chances are you are already using software loadbalancers in your network without knowing it (like the DNS loadbalancer I mentioned above). I've seen a decline in commercial web software loadbalancers in the recent years, which has been replaced by open source components like mod_jk loadbalancer, perlbal,etc.

            • In this particular writeup, I'm going to focus on hardware loadbalancer to keep it short.

          • Do you need failover ?

            • If you need loadbalancer failover, you should be buying in pairs

            • Some loadbalancer work in Active-Active mode, where both loadbalancers can be functional, while others allow only Active-Passive mode.

            • The term "failover" can mean many things. Make you you ask the right questions to understand what a loadbalancer really offers.

              • It could mean TCP session failover where every single TCP connection will be failed over.

              • It could also mean HTTP session failover (where one session is defined by one unique Cookie). If a loadbalancer supports only this mode, every single user connected to the loadbalancer will notice a blip when the primary loadbalancer dies. More often than not, this is what a loadbalancer vendor usually provides. And unfortunately not everyone in pre-sales is aware of this "minor" detail.

          • How do you want to do configuration management ?

            • Not all brands are made equal. Some are easy to manage and others aren't. I personally prefer CLI over GUI on any day.

            • But more than the CLI/GUI, I want the ability to revision control and compare different versions of configuration files. The current brand we use at work, unfortunately, doesn't provide an easy way to do this. If you are supporting a big operation with multiple loadbalancers and have to replicate same/similar setup in multiple places, then this is something you shouldn't compromise on.

          • How many servers are we talking about ?

            • If your loadbalancer device has ports on it to attach the servers physically to it, you should make sure you don't have more servers than the number of ports

            • In most cases, however, you can add a 100BaseT switch and add more servers to it. But regardless, having an approximate number of servers will help you decide some of the other questions later.

          • Will all of these server be part of the same farm/cluster ?

            • Some organizations setup different websites on the same loadbalancer and set it up in such a way that the loadbalancer distributes load to different farm/cluster of servers depending on which domain name the client requested for.

            • Some loadbalancers can also inspect the URL and send requests with different paths (in the same domain) to different server farms. For example "/images" could go to one server and "/signup" could go to another one.

            • There are still others who might keep a set of servers on standby mode (not active) waiting to be brought up automatically in an event of problems with the primary clusters.

          • Will all of the servers have the same weights when loadbalancing ?

            • For example are there some servers which should get more traffic than others because they are faster ?

          • Is session stickyness important ?

            • If a user ends up on a particular webserver, is there any reason why that user should continue to stay on that webserver ? There are different kinds of session stickiness which I can think off.

              • The most popular kind is probably IP based stickiness where traffic from same IP always goes to the same webserver. This was a great way of loadbalancing until companies like AOL decided that they will loadbalance outgoing traffic using different proxy servers, effectively changing source IP address of traffic coming from the same client.

              • My favorite session stickiness mechanism is Cookies. Cookies can be used as session tracking IDs to associate a user session with a particular webserver. There are many different ways of implementing this of which these are the few interesting ways I've used

                • Allow the loadbalancer to set a cookie for you in each session without the tracking cookie.

                • Most web application servers like PHP and java use cookie names like "PHPSESSIONID" or "JSESSIONID" which is an excellent session identifier which a loadbalancer can track.

                • There are a few other interesting cookie options... but I'd rather not discuss it here at this moment.

            • If you really need session stickyness, you should investigate further if your application is really horizontally scalable. Most often, sticky session feature is used as a bandaid to temporarily give an impression of scalable web app design, which could, in future, prove disastrous.

            • Session stickiness comes with some other configuration baggage as well.

              • You need to decide how long an idle session should be considered active before its shutdown. If you have a lot of traffic and if you set this session timeout to be too high, it can quickly fill up your memory.

              • If possible set the timeout to be as close to the application timeout on your app server.

          • Does the traffic have to be encrypted using SSL ?

            • Some loadbalancers have built in SSL engines.

            • Others have capabilities to offload them to SSL accelerators.

            • One could also set it up in such a way that traffic is decrypted on the apache server after the loadbalancing is done. Please be cautious here. If you decide to do SSL decryption on the webserver, you are effectively disallow the loadbalancer to inspect HTTP packets which can be otherwise used to make intelligent routing decisions

          • Do you need compression ?

            • Some Load balancers which come with SSL engines support on-the-fly compression which can significantly speed up user experience if you have a lot of compressible objects.

          • Debugging

            • Whether you like it or not, one day you will have some problem with your loadbalancer and you will be asked by their support team to get some sniffs for them. This usually is a painful process, especially if the equipment is in production network. One feature which can simplify this is allowing the device to sniff on itself. This is not a must-have, but probably a like-to-have feature.

          • Other services

            • Checklist for other services... if you want

              • NTP - Have its time always be in sync with rest of your network

              • Syslog - Ability to send syslog messages.

              • Mail - Ability to send mails when problems happen

              • SNMP - Monitoring and traps

          • External factors

            • A standard sized company investing in loadbalancers usually don't invest on a single loadbalancer. They usually buy a pair for production network, and another pair for QA or staging networks. By the time you add the support costs, it gets very expensive. If you make an investment like that, make sure the company selling that to you is still alive a couple of years down the line to support you.

            • Loadbalancer's are not as reliable as phones. Finding bugs in a loadbalancer is much easier than you think. If you don't have a reliable support team who is willing to help you patch the code fast enough, you might see some downtime or performance issues

            • At the same time, if the company does release a lot of patches on a weekly or monthly basis, you should find out how stable its code is. I usually ask how long it has been since the last stable release.

          August 25, 2007

          Feature or a bug ?

          Dratz asks: Feature or a bug ?

          TypePad architecture: Problems and solutions

          TypePad was and probably is one of the first and largest paid blogging service in the world. In a presentation at OSCON 2007 , Lisa Phillips and Garth Webb spoke about TypePad's problems in 2005. Since this is a common problem with any successful company I found it interesting enough to research a little more.

          TypePad was, like any other service, initially designed in the traditional way with Linux, Postgres, Apache, mod_perl, perl as the front end and NFS storage for images on a filer. At that time they were pushing close to 250mbps (4TB per day) through multiple pipes and with growing user base, activity and data they were growing at the rate of 10 to 20% per month.

          Just before the planned move to newer better data center, sometime in Oct 2005, TypePad started experiencing all kinds of problems due to its unexpected growth. The unprecedented stress on the system caused multiple failures over the next two months which ranged from hardware, software, storage to networking issues. While at times it made reading or publishing services to be completely unavailable, it also caused sporadic performance issues with statistic calculations.

          One of the most visible failures was in December of 2005 when during a routine maintenance, in the middle of the process of adding redundant storage, something caused the complete storage cluster to go offline which caused the entire bank of webservers serving the webpages went down . Because they had separate storage cluster for backend database, it wasn't affected by the outage directly.

          Its at times like these that most companies fail to communicate with their users. Sixpart, fortunately, understood this early and did its job well.

          Today Typepad's architecture is similar to the one of Livejournal with users distributed over multiple master-master mysql replication. They have partitioned the database by UserIDs and have a global database to map UserIDs to partitions. They use Mysql 5.0 with InnoDB and Linux Heartbeat for HA.

          The images though they decided to switch from a NFS storage to Perlbal ( Perl-based reverse proxy load balancer and web server) +MogileFS (open source distributed file system) which can scale much better with lower overhead over commodity hardware. Look at the image on the right which how Typepad served images in the transition phase from NFS to MogileFS. Follow the arrows with numbers to see how the requests go through within the network. For an image stored on MogileFS (Mogstored), the app server talks to MogileDB through mod_perl2 first (Step 3,4). MogileDB/mod_perl2 sends a Perlbal internal redirect(Step 5,6,7) to the actual image resource which is located on Mogstored(step 8,9).

          Since most of the activity on the blogs are read only operations, it made sense to add memcached early into the process to ease load on a lot of components.

          memcached is a high-performance, distributed memory object caching system, generic in nature, but intended for use in speeding up dynamic web applications by alleviating database load.

          In another interesting approach to scalable architecture they recognized the fact that one of the most write intensive operations was commenting system which made them experiment with "The Schwartz". This technology helped them use a queuing mechanism which could reliably delay write intensive operations to the database effectively allowing it to scale more.

          The Schwartz is taglined "a reliable job queue system" and was originally developed as a generic job processing system for Six Apart's hosted services. It is used in production today on TypePad, Livejournal and Vox for managing tasks that can be performed by the system without user interaction.


          August 20, 2007

          Web storage for backups

          I'm contemplating using S3 for backups. Paul Stamantiou has a script ready to go. The thing which convinced me was this chart Paul showed. For 10GB of space he paid under 3 dollars per month. Thats really cheap...

          GMail, Microsoft and yahoo all provide extra storage as well. However none of them have stable company supported APIs to allow users to upload data in this form.

          How Skype network handles scalability..

          There was a major skype outage last week and though there is an "official explaination" and other discussions about it floating around, I found this comment from one of the GigaOm readers more interesting to think about. Now this particular description may not accurately describe the problem (which might be speculation as well) but it does describe , in a few words, how skype's p2p network scales out. You should also take a look at the detailed discussion of the skype protocol here.
          Number of Skype Authentication servers:
          Count == 50; // Clustered
          Number of potential Skype clients:
          Count = 220,000,000 // Mostly decentralized
          Number of SuperNode clients to maintain network connectivity:
          Count = N / 300 at any one time.

          • If there are 3.0 million users online then the ratio is 3,000,000 / 300 = 10,000 == Supernodes available
          • Supernodes are bootstraps into the network for normal first run clients ("and handle routing of children calls").
          • Supernodes maintain the network overlay via a DHT("Distributed Has Table") "type" method. // This is normally very slow and done over UDP
          • If a client cannot find a Supernode, regardless of authentication via central server then is NOT allowed on the Skype network.

          Lack of Supernodes mean lack of network connectivity regardless of successful login via “central server”.
          You CAN be a Supernode but not have full network connectivity because you have only a portion of the “Distributed Index Data aka DHT”.
          MOST people that become Supernodes will bail out if they cannot keep a clear route (”aka calls bail out, client restarts and aborts Supernode status, thus booting it’s 300 - 500 Children and putting them into a “Connecting mode”.

          Children that are trying to “Connect” are unable to do anything unless they have a “Supernode” as a parent. // No calls, No IM….

          The overview of this is as follows:

          Skype introduced a flaw into the network that dealt with “routing” and “fucked” the “decentralized data store aka DHT” this in turn ran clients on a RANDOM search of Supernodes which at this point were well booted off of the network.

          In the End:
          It is a huge cycle, no matter how many bugs they “fix” in the “central servers” it will take many days for N nodes to become Supernodes so they can route X data from peer A to peer B. This is NOT minor, a fix to the centralized server code base to relay data to N Supernodes there is lack there of, resulting of a very segregate network. Right now there are approximatly 10,000 sub Skype networks instead of 1 Single “in sync” network. When this “data store(see DHT) is in sync globally then the Skype network will be again STABLE.

          I know this is very broad but, unless magically all of said nodes can recreate the “single overlay (DHT)” then nothing will be in sync. You will see delayed messaged, delayed or incorrect profiles and presence.

          My take, in the end is give it 48 more hours and it may be semi-stable, but hey this is what you get with using end users as your own redundancy…