July 17, 2009

Weekend reading material

 

Products/Ideas

  1. redis - http://code.google.com/p/redis/ : Redis is a key-value database. It is similar to memcached but the dataset is not volatile, and values can be strings, exactly like in memcached, but also lists and sets with atomic operations to push/pop elements.
  2. HBase - http://hadoop.apache.org/hbase/ : HBase is the Hadoop database. Its an open-source, distributed, column-oriented store modeled after the Google paper, Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, HBase provides Bigtable-like capabilities on top of Hadoop.
  3. Sherpa - http://research.yahoo.com/node/2139
  4. BigTable - http://labs.google.com/papers/bigtable-osdi06.pdf
  5. voldemort - It is basically just a big, distributed, persistent, fault-tolerant hash table. For applications that can use an O/R mapper like active-record or hibernate this will provide horizontal scalability and much higher availability but at great loss of convenience. For large applications under internet-type scalability pressure, a system may likely consists of a number of functionally partitioned services or apis, which may manage storage resources across multiple data centers using storage systems which may themselves be horizontally partitioned. For applications in this space, arbitrary in-database joins are already impossible since all the data is not available in any single database. A typical pattern is to introduce a caching layer which will require hashtable semantics anyway. For these applications Voldemort offers a number of advantages
  6. Dynamo - A highly available key-value storage system that some of Amazon’s core services use to provide an “always-on” experience.  To achieve this level of availability, Dynamo sacrifices consistency under certain failure scenarios. It makes extensive use of object versioning and application-assisted conflict resolution in a manner that provides a novel interface for developers to use.
  7. Cassandra - Cassandra is a highly scalable, eventually consistent, distributed, structured key-value store. Cassandra brings together the distributed systems technologies from Dynamo and the data model from Google's BigTable. Like Dynamo, Cassandra is eventually consistent. Like BigTable, Cassandra provides a ColumnFamily-based data model richer than typical key/value systems.
  8. Hypertable - : Hypertable is an open source project based on published best practices and our own experience in solving large-scale data-intensive tasks.
  9. HDFS - The Hadoop Distributed File System (HDFS) is a distributed file system designed to run on commodity hardware. It has many similarities with existing distributed file systems. However, the differences from other distributed file systems are significant. HDFS is highly fault-tolerant and is designed to be deployed on low-cost hardware. HDFS provides high throughput access to application data and is suitable for applications that have large data sets.

Blog/Posts/Links

  1. Eventually Consistent 
  2. Bunch of Links at bytepawn
  3. Fallacies of Distributed Computing

Is Yahoo launching a cloud storage solution : MObStor

While rest of the world is busy with Microsoft and Google, Yahoo might be preparing to launch MObStor which they tout as the “Unstructured Storage for the Internet”.

While comparing MObStor to the various Cloud computing storage solutions already available, Navneet Joneja, Sr. Product Manager, mentions Facebook’s Haystack to describe MObStor’s architectural design. He also points out that though Facebook’s Haystack was optimized to store photographs, MObStor was optimized for diverse set of use cases.

Its a REST based, browser-accessible API with simple security model, and content-agnostic storage features. The focus of this service seems to be fast, reliable, secure storage with the option of allowing customers to layer additional services on top of the core service. It claims it would be optimized for high performance and high availability (who doesn’t).

Here is more from the Yahoo Developer Network Blog

Facebook's Haystack is based on commodity storage. While MObStor does support commodity storage, it doesn't require it. Instead, we have a storage-layer abstraction we call the ObjectStore. The ObjectStore encapsulates the key storage operations we need to perform, and allows us to have many underlying physical object stores. This allows us to mix, for example, filer-based storage with commodity storage. The upper layers have the routing intelligence that determines which ObjectStore a given piece of data is stored in. However, like Haystack, we do support high request rates using our own optimized ObjectStore written to run on commodity hardware - with one important difference. While Haystack identifies every object using a 64-bit photo key, all objects in MObStor are accessible through logical (i.e., client-supplied) URLs, not object IDs.

In MObStor, the storage layer maintains the mapping between logical URLs and physical storage, and can use any means to do so - the implementation is encapsulated within the storage layer. Needless to say, this operation is a potential performance bottleneck, so we've carefully optimized the algorithms used and the hardware that they run on.

Now with Amazon, Google, Microsoft and Yahoo in the picture the last shoe might finally drop.