Showing posts from October, 2007

Scaling technorati - 100 million blogs indexed everyday

Indexing 100 million blogs with over 10 billion objects, and with a user base which is doubling every six months, technorati has an edge over most blog search engines. But they are much more than search, and any technorati user can explain you that. I recommend you read John Newton's interview with David Sifry which I found fascinating. Here are the highlights from the interview if you don't have time to read the whole thing Current status of technorati 1 terabyte a day added to its content storage 100 million blogs 10 billion objects 0.5 billion photos and videos Data doubling every six months Users doubling every six months The first version was supposed to be for tracking temporal information on low budget. That version put everything in relational database which was fine since the index sizes were smaller then physical memory It worked fine till about 20 million blogs The next generation took advantage of parallelism. Data was broken up into shard

Scalability stories for Oct 22, 2007

Why most large-scale sites which scale are not written in java ?   -  ( What nine of the world's largest websites are running on)  -  A couple of very interesting blogs to read. Slashdot's setup Part 1 - Just in time for the 10 year anniversary. Flexiscale - Looks like an amazon competitor in the hosting business. Pownce - Lessons learned   - Lessons learned while developing Pownce, a social messaging web application Amazon's Dynamo - Dynamo is internal technology developed at Amazon to address the need for an incrementally scalable, highly-available key-value storage system. The technology is designed to give its users the ability to trade-off cost, consistency, durability and performance, while maintaining high-availability. [ PDF ] . This paper presents the design and implementation of Dynamo, a highly available key-value storage system that some of Amazon’s core services use to provide an “always-on” experience.  To achieve this level of avail

Crawling sucks.

I wrote my first crawler in a few lines of perl code to spider a website recursively about 10 years ago. Two years ago I wrote another crawler in a few thousand lines using java+php and mysql. But this time I wasn't really interested in competing with google, and instead crawled feeds (rss/atom). Google hadn't released its blog search engine at that time. Using the little java knowledge I had and 3rd part packages like Rome and some HML parsers I hacked up my first crawler in a matter of days. The original website allowed users to train the bayesian based engine to teach it what kind of news items you like to read and automatically track millions of feeds to find the best content for you. After a few weeks of running that site, I started having renewed appreciation for the search engineers who breath this stuff day in and day out. That site eventually went down... mostly due to design flaws I made early which I'm listing here for those who love learning. Seeding proble

EC2 for everyone. And now includes 64bit with 15GB Ram too.

  Finally it happened. EC2 is available for everybody . And more than that they now provide servers with 7.5GB and 15GB of RAM per instance. Sweet.   For a lot of companies EC2 was not viable due to high memory requirements of some of the applications. Splitting up such tasks to use less memory on multiple servers was possible, but not really cost and time efficient. The release of new types of instances removes that road block and would probably invoke significant curiosity from memory crunching application developers. $0.10 - Small Instance (Default) 1.7 GB of memory, 1 EC2 Compute Unit (1 virtual core with 1 EC2 Compute Unit), 160 GB of instance storage, 32-bit platform $0.40 - Large Instance 7.5 GB of memory, 4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each), 850 GB of instance storage, 64-bit platform $0.80 - Extra Large Instance 15 GB of memory, 8 EC2 Compute Units (4 virtual cores with 2 EC2 Compute Units each), 1690 GB of instance storage, 64-bit platfor

Web Scalability dashboard

[ Blogofy: bringing feeds together ] I took a week's break from blogging to work on one of my long overdue personal projects. Even though I use Google Reader as my feed aggregator I noticed a lot of folks still prefer a visual UI to track news and feeds. The result of my experimentation of designing such a Visual UI to track feeds lead me to create Blogofy If you have an interesting blog on Web Scalability, Availability or Performance which you want included here please let me know. The list of blogs on the page is in flux at the moment and I might move the feeds around a little depending on user feedback and blog activity.