Archive for the ‘hadoop’ Category

Scalable products: KFS released

Friday, September 28th, 2007

Kosmix, a search startup has released source to C++ implementation of something which looks like a clustered file system. This looks very similar to Hadoop/HDFS, but the C++ factor will be a big performance boost.Kosmic

From Skrenta blog

    • Incremental scalability - New chunkserver nodes can be added as storage needs increase; the system automatically adapts to the new nodes.
    • Availability - Replication is used to provide availability due to chunk server failures.
    • Re-balancing - Periodically, the meta-server may rebalance the chunks amongst chunkservers. This is done to help with balancing disk space utilization amongst nodes.
    • Data integrity - To handle disk corruptions to data blocks, data blocks are checksummed. Checksum verification is done on each read; whenever there is a checksum mismatch, re-replication is used to recover the corrupted chunk.
    • Client side fail-over - During reads, if the client library determines that the chunkserver it is communicating with is unreachable, the client library will fail-over to another chunkserver and continue the read. This fail-over is transparent to the application.
    • Language support - KFS client library can be accessed from C++, Java, and Python.
    • FUSE support on Linux - By mounting KFS via FUSE, this support allows existing Linux utilities (such as, ls) to interface with KFS.
    • Leases - KFS client library uses caching to improve performance. Leases are used to support cache consistency.

If anyone has experience with KFS, or has more information please leave a comment here.

Popularity: 65%

Hadoop and HBase

Saturday, August 4th, 2007

This may not be a surprise for a lot of people but it was for me. Even though I have been using lucene and nutch for some time, I didn’t really know enough about Hadoop and HBase until recently.

Hadoop

  • Scalable: Hadoop can reliably store and process petabytes.
  • Economical: It distributes the data and processing across clusters of commonly available computers. These clusters can number into the thousands of nodes.
  • Efficient: By distributing the data, Hadoop can process it in parallel on the nodes where the data is located. This makes it extremely rapid.
  • Reliable: Hadoop automatically maintains multiple copies of data and automatically redeploys computing tasks based on failures.


Hadoop implements MapReduce, using the Hadoop Distributed File System (HDFS) (see figure below.) MapReduce divides applications into many small blocks of work. HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. MapReduce can then process the data where it is located.

HBase

Google’s Bigtable, a distributed storage system for structured data, is a very effective mechanism for storing very large amounts of data in a distributed environment.

Just as Bigtable leverages the distributed data storage provided by the Google File System, Hbase will provide Bigtable-like capabilities on top of Hadoop.

Data is organized into tables, rows and columns, but a query language like SQL is not supported. Instead, an Iterator-like interface is available for scanning through a row range (and of course there is an ability to retrieve a column value for a specific key).

Any particular column may have multiple values for the same row key. A secondary key can be provided to select a particular value or an Iterator can be set up to scan through the key-value pairs for that column given a specific row key.

Popularity: 19%