August 04, 2007

The "me too" phenomenon and Identity theft

A very interesting article from Muhammad Saleem on the "me too" phenomenon. My problem with this phenomenon is that this might make stealing identity easier than before. In this new web 2.0 world, if I need your passwords or mother's maiden name, all I have to do is build an interesting application which you would like to try out at least once. Once I have your password or other key information (which most likely be the same across all your applications), I can shut the side down and do other interesting things. I'm an open advocate of OpenID which attacks some of the issues, but its no silver bullet.
More from Muhammad's blog..
"Everyday a new company announces a 'new' product which is nothing more than the old product with slight modifications or a few small additional features. This mentality is not only bad for users but also for marketers and even the startups.

A prime example of this phenomenon can be witnessed by comparing Dodgeball, Twitter, Jaiku, Tumblr, Pownce and a plethora of other microblogging tools. 90% of the services these different tools offer are the same, and the 10% that differentiates them is not significant enough to make most users switch."

Hadoop and HBase

This may not be a surprise for a lot of people but it was for me. Even though I have been using lucene and nutch for some time, I didn't really know enough about Hadoop and HBase until recently.


  • Scalable: Hadoop can reliably store and process petabytes.

  • Economical: It distributes the data and processing across clusters of commonly available computers. These clusters can number into the thousands of nodes.

  • Efficient: By distributing the data, Hadoop can process it in parallel on the nodes where the data is located. This makes it extremely rapid.

  • Reliable: Hadoop automatically maintains multiple copies of data and automatically redeploys computing tasks based on failures.

Hadoop implements MapReduce, using the Hadoop Distributed File System (HDFS) (see figure below.) MapReduce divides applications into many small blocks of work. HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. MapReduce can then process the data where it is located.

Google's Bigtable, a distributed storage system for structured data, is a very effective mechanism for storing very large amounts of data in a distributed environment.

Just as Bigtable leverages the distributed data storage provided by the Google File System, Hbase will provide Bigtable-like capabilities on top of Hadoop.

Data is organized into tables, rows and columns, but a query language like SQL is not supported. Instead, an Iterator-like interface is available for scanning through a row range (and of course there is an ability to retrieve a column value for a specific key).

Any particular column may have multiple values for the same row key. A secondary key can be provided to select a particular value or an Iterator can be set up to scan through the key-value pairs for that column given a specific row key.