Scaling Early: Feedjit
November 10th, 2007Mark Maunder from Feedjit make an interesting presentation about scaling early. He focuses on some of the key operational issues related to the web server and server caching which I found very interesting.
|
|
Mark Maunder from Feedjit make an interesting presentation about scaling early. He focuses on some of the key operational issues related to the web server and server caching which I found very interesting.
|
|
A short thought provoking post by Mark Callaghan about running Mysql over HDFS. Its probably not ideal, but its an interesting thought regardless.
Indexing 100 million blogs with over 10 billion objects, and with a user base which is doubling every six months, technorati has an edge over most
blog search engines. But they are much more than search, and any technorati user can explain you that. I recommend you read John Newton’s interview with David Sifry which I found fascinating. Here are the highlights from the interview if you don’t have time to read the whole thing
I wrote my first crawler in a few lines of perl code to spider a website recursively about 10 years ago. Two years ago I wrote another crawler in a few thousand lines using java+php and mysql. But this time I wasn’t really interested in competing with google, and instead crawled feeds (rss/atom). Google hadn’t released its blog search engine at that time. Using the little java knowledge I had and 3rd part packages like Rome and some HML parsers I hacked up my first crawler in a matter of days. The original website allowed users to train the bayesian based engine to teach it what kind of news items you like to read and automatically track millions of feeds to find the best content for you. After a few weeks of running that site, I started having renewed appreciation for the search engineers who breath this stuff day in and day out. That site eventually went down… mostly due to design flaws I made early which I’m listing here for those who love learning.
The reason why I’m not embarrassed, blogging about my mistakes, is because I’m not a developer to begin with. And the second reason is that I’m about to take a second shot at it to see if I can do it better this time. The objective is not to build another search engine, but to understand and learn from your mistakes and do it better.
The first phase of this learning experience resulted in what you see currently on blogofy.com. The initial prototype displayed here does limited crawling to gather feed updates. The feed update algorithm is already a little smarter (it updates some more often than others). The quality of the feeds should also be better because of human behind the engine who manually adds the feeds to the database. I also realized soon after my first attempt that I should have investigated JSON to make Java and PHP talk to each other. In the current version of blogofy the core engine is in PHP except the indexing/storage engine which will soon move to Solr(which is java based). PHP has JSON based hooks to talk to Solr which seems to work very well. Solr incase you didn’t know is a very fast lucene based search engine which does much better than Mysql for the kind of search operations I would be doing. And yes, I replaced Belkin router with an apple wireless… so lets see how that works out to be.
Will send more updates if I make progress. If any of you have ideas on what else I should be doing please let me know.
Finally it happened. EC2 is available for everybody. And more than that they now provide servers with 7.5GB and 15GB of RAM per instance. Sweet. 
For a lot of companies EC2 was not viable due to high memory requirements of some of the applications. Splitting up such tasks to use less memory on multiple servers was possible, but not really cost and time efficient. The release of new types of instances removes that road block and would probably invoke significant curiosity from memory crunching application developers.
$0.10 - Small Instance (Default)
1.7 GB of memory, 1 EC2 Compute Unit (1 virtual core with 1 EC2 Compute Unit), 160 GB of instance storage, 32-bit platform
$0.40 - Large Instance
7.5 GB of memory, 4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each), 850 GB of instance storage, 64-bit platform
$0.80 - Extra Large Instance
15 GB of memory, 8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units each), 1690 GB of instance storage, 64-bit platform
Data Transfer
$0.10 per GB - all data transfer in
$0.18 per GB - first 10 TB / month data transfer out
$0.16 per GB - next 40 TB / month data transfer out
$0.13 per GB - data transfer out / month over 50 TB
[Blogofy: bringing feeds together ]
I took a week’s break from blogging to work on one of my long overdue personal projects. Even though I use Google Reader as my feed aggregator I noticed a lot of folks still prefer a visual UI to track news and feeds. The result of my experimentation of designing such a Visual UI to track feeds lead me to create Blogofy
If you have an interesting blog on Web Scalability, Availability or Performance which you want included here please let me know. The list of blogs on the page is in flux at the moment and I might move the feeds around a little depending on user feedback and blog activity.
Kosmix, a search startup has released source to C++ implementation of something which looks like a clustered file system. This looks very similar to Hadoop/HDFS, but the C++ factor will be a big performance boost.
From Skrenta blog
If anyone has experience with KFS, or has more information please leave a comment here.
When asked what they mean by scalability, a lot of people talk about improving performance, about implementing HA, or even talk about a particular technology or protocol. Unfortunately, scalability is none of that. Don’t get me wrong. You still need to know all about speed, performance, HA technology, application platform, network, etc. But that is not the definition of scalability.
Scalability, simply, is about doing what you do in a bigger way. Scaling a web application is all about allowing more people to use your application. If you can’t figure out how to improve performance while scaling out, its okay. And as long as you can scale to handle larger number of users its ok to have multiple single points of failures as well.
There are two key primary ways of scaling web applications which is in practice today.

Every component, whether its processors, servers, storage drives or load-balancers have some kind of management/operational overhead. When you try to scale that, its important to understand what percentage of the resource is actually usable. This measurement is called “scalability factor“. If you loose 5% of a processor power every time you add a CPU to your system, then your “scalability factor” is 0.95. A scalability factor of 0.9 means you will only be able to use 90% of the resource.
Scalability can be further sub-classified based on the “scalability factor”.
If you need scalability, urgently, going vertical is probably going to be the easiest (provided you have the bank balance to go with it). In most cases, without a line of code change, you might be able to drop in your application on a super-expensive 64 CPU server from Sun or HP and storage from EMC, Hitachi or Netapp and everything will be fine. For a while at least. Unfortunately Vertical scaling, gets more and more expensive as you grow.
Horizontal scalability, on the other hand doesn’t require you to buy more and more expensive servers. Its meant to be scaled using commodity storage and server solutions. But Horizontal scalability isn’t cheap either. The application has to be built ground up to run on multiple servers as a single application. Two interesting problems which most application in a horizontally scalable world have to worry about are “Split brain” and “hardware failure“.
While infinite horizontal linear scalability is difficult to achieve, infinite vertical scalability is impossible. If you are building capacity for a pre-determined number of users, it might be wise to investigate vertical scalability. But if you are building a web application which could be used by millions, going vertical could be an expensive mistake.
But scalability is not just about CPU (processing power). For a successful scalable web application, all layers have to scale in equally. Which includes the storage layer(Clustered file systems, s3,etc), the database layer (partitioning,federation), application layer(memcached,scaleout,terracota,tomcat clustering,etc), the web layer, loadbalancer , firewall, etc. For example if you don’t have a way to implement multiple load balancers to handle your future web traffic load, it doesn’t really matter how much money and effort you put into horizontal scalability of the web layer. Your traffic will be limited to only what your load balancer can push.
Choosing the right kind of scalability depends on how much you want to scale and spend. In fact if someone says there is a “one size fits all” solution, don’t believe them. And if someone starts a “scalability” discussion in the next party you attend, please do ask them what they mean by scalability first.
References
Smugmug.com, a 5 year old company with just 23 employees has 315000 paying customers and 195 million photographs. CEO & “Chief Geek” Don MacAskill has a nice set of slides where he talks about its 5 year journey during which it went from small startup to a profitable business. The talk was given during Amazon’s “Startup project” so it talks mostly about how it uses AWS (Amazon Web services).![]()
Other than wonderful fun loving employees who are also “super heroes” in his eyes, he talks about the how they are doubling the storage requirements on an yearly basis. They already have about 300 TB in use, and as of today all of that is on Amazon’s S3. Don estimates that based on his estimates, for the storage they are using, they are saving about 500K per year which is pretty big for a small operation like theirs.
The Smugmug architecture has evolved over time. Internally it can serve images in 3 different ways.
This flexibility allowed Smugmug try different ways of using internal and external/S3 storage. Interestingly even though they wanted to get away from self-managed internal storage, they noticed that S3 storage wasn’t very cheap when you take into account the bandwidth utilization. After various permutation and combinations, they managed to setup a system where they continued to use S3 for primary storage, but kept 10% of the hottest objects locally on Smugmug’s own servers to minimize bandwidth utilization on S3 service. This allowed them to get away with not buying 95% of the storage drives they were originally supposed to buy. 5% of 300TB is still 15TB, which is smaller but not small enough unfortunately.
Though storage management is easy with S3, it can at times make things difficult. Permission management was one of those things where they had to sacrifice speed/performance in favour of using “Proxy” mechanism which is more robust/reliable way of serving protected objects.
On reliability, Don mentioned that having multiple single points of failures is not really helpful if you want to provide near 100% availability. With S3 in picture, not only do they have to worry about connectivity from customer to Smugmug and Smugmug’s own servers, they also have to worry about connectivity to Amazon and availability of Amazon services round the clock. Hence they had to design their app ground up to handle failures gracefully. For example, write failures to S3 is handled by recording the change locally which is then synced asynchronously. At other times when things break, its designed to either try again, or alert the right folks so that action can be taken.
While talking about nice-to-have-features, he mentioned that S3 shouldn’t be confused with CDN. Its not a distributed caching service and it doesn’t have global locations like a CDN has. Regardless, he said, S3 probably should have limited Cache or Stream ability which will boost performance and add value to an already invaluable service.
On the topic of Smugmug’s future, Don mentioned they are flirting with the possibility of using EC2 in the near future. EC2 would probably be used as image processing computation nodes. Since the EC2 servers are located in the Amazon facility, it will save them bandwidth of transferring data from S3. And since they can turn off server instances on demand (and not pay for them while its offline) it will probably cut down their operating costs to maintain image processing servers as well.
At the end he mentioned a few services which he would like to see Amazon offer in future. Two important ones which he mentioned (and which I think are critical) are the absence of some kind of Database, and Loadbalanacer APIs.
PDF Slides of the same presentation here (34 slides in this one)