Archive for the ‘software’ Category

Scaling updates for Feb 10, 2010

Lots of interesting updates today.
But would like to first mention the fantastic work Cloud computing group at UCSB are doing to make appengine framework more open. They have done significant work at making appscale “work” with different kinds of data sources including HBase, Cassandra, Voldemort, MongoDB, Hypertable and Mysql and MemcacheDB. Appscale is actively [...]

Read the rest of this entry »

Windows Azure

Windows Azure is an application platform provided by Microsoft to allow others to run applications on Microsoft’s “cloud” infrastructure. Its finally open for business (as of Feb 1, 2010). Below are some links about Azure for those who are still catching up.
Wikipedia: Windows Azure has three core components: Compute, Storage and Fabric. As the names [...]

Read the rest of this entry »

Hive @Facebook

Hive is a data warehouse infrastructure built over Hadoop. It provides tools to enable easy data ETL, a mechanism to put structures on the data, and the capability to querying and analysis of large data sets stored in Hadoop files. Hive defines a simple SQL-like query language, called QL, that enables users familiar with SQL [...]

Read the rest of this entry »

HAProxy : Load balancing

Designing any scalable web architecture would be incomplete without investigating “load balancers”.  There used to be a time when selecting and installing load balancers was an art by itself. Not anymore.
A lot of organizations today, use Apache web servers as a proxy server (and also as a load balancer) for the backend application clusters. Though [...]

Read the rest of this entry »

ESI: Edge Side Includes

Web page caching gets tricky once personalization is involved. Lets take twitter public_timeline for example which seems to be perfect for caching. Unfortunately when a user is logged in, it also shows the user’s information. Caching that particular page in its entirety, on the web server, in such scenarios, may not be an option. Another [...]

Read the rest of this entry »

Cassandra for service registry/discovery service

My last post was about my struggle to find a good distributed ESB/Service-discovery solution built over open source tools which was simple to use and maintain. Thanks to reader comments (Dan especially) and some other email exchanges, it seems like building a custom solution is unavoidable if I really want to keep things simple.
Dan suggested [...]

Read the rest of this entry »

Google app engine review (Java edition)

For the last couple of weekends I’ve been playing with Google App Engine, (Java edition) and was pleasantly surprised at the direction it has taken. I was also fortunate enough to see some Google Engineers talk on this subject at Google I/O which helped me a lot to compile all this information.
But before I [...]

Read the rest of this entry »

Scaling technorati – 100 million blogs indexed everyday

Indexing 100 million blogs with over 10 billion objects, and with a user base which is doubling every six months, technorati has an edge over most blog search engines. But they are much more than search, and any technorati user can explain you that. I recommend you read John Newton’s interview with David Sifry which [...]

Read the rest of this entry »

Crawling sucks.

I wrote my first crawler in a few lines of perl code to spider a website recursively about 10 years ago. Two years ago I wrote another crawler in a few thousand lines using java+php and mysql. But this time I wasn’t really interested in competing with google, and instead crawled feeds (rss/atom). Google hadn’t [...]

Read the rest of this entry »

Scalable products: KFS released

Kosmix, a search startup has released source to C++ implementation of something which looks like a clustered file system. This looks very similar to Hadoop/HDFS, but the C++ factor will be a big performance boost.
From Skrenta blog

Incremental scalability – New chunkserver nodes can be added as storage needs increase; the system automatically adapts to the [...]

Read the rest of this entry »