Kosmix, a search startup has released source to C++ implementation of something which looks like a clustered file system. This looks very similar to Hadoop/HDFS, but the C++ factor will be a big performance boost.
From Skrenta blog
- Incremental scalability – New chunkserver nodes can be added as storage needs increase; the system automatically adapts to the new nodes.
- Availability – Replication is used to provide availability due to chunk server failures.
- Re-balancing – Periodically, the meta-server may rebalance the chunks amongst chunkservers. This is done to help with balancing disk space utilization amongst nodes.
- Data integrity – To handle disk corruptions to data blocks, data blocks are checksummed. Checksum verification is done on each read; whenever there is a checksum mismatch, re-replication is used to recover the corrupted chunk.
- Client side fail-over – During reads, if the client library determines that the chunkserver it is communicating with is unreachable, the client library will fail-over to another chunkserver and continue the read. This fail-over is transparent to the application.
- Language support – KFS client library can be accessed from C++, Java, and Python.
- FUSE support on Linux – By mounting KFS via FUSE, this support allows existing Linux utilities (such as, ls) to interface with KFS.
- Leases – KFS client library uses caching to improve performance. Leases are used to support cache consistency.
If anyone has experience with KFS, or has more information please leave a comment here.