Hive @Facebook

Hive is a data warehouse infrastructure built over Hadoop. It provides tools to enable easy data ETL, a mechanism to put structures on the data, and the capability to querying and analysis of large data sets stored in Hadoop files. Hive defines a simple SQL-like query language, called QL, that enables users familiar with SQL to query the data. At the same time, this language also allows programmers who are familiar with the MapReduce fromwork to be able to plug in their custom mappers and reducers to perform more sophisticated analysis that may not be supported by the built-in capabilities of the language.

At a user group meeting, Ashish Thusoo from Facebook data team, spoke about how Facebook uses Hive for their data processing needs.

Problem

Facebook is a free service and has been experiencing rapid growth in last few years. The amount of data it collects, which used to be around 200GB per day in March 2008, has now grown to 15TB per day today.  Facebook realized early on that insights derived from simple algorithms on more data is better than insights from complex algorithm on smaller set of data.

But the traditional approach towards ETL on proprietary storage systems was not only getting expensive to maintain, it was also limited in the size it could scale to. This is when they started experimenting with Hadoop.

How Hadoop gave birth to Hive

Hadoop turned out to be superior in availability, scalability and manageability. Its efficiency wasn’t that great, but one could get more throughput by throwing more cheap hardware at it. Ashish pointed out that though at that point partial availability, resilience and scale was more important than ACID they had a hard time finding Hadoop programmers within Facebook to make use of the cluster.

It was this, that eventually forced Facebook, to build a new way of querying data from Hadoop which doesn’t require writing map-reduce jobs in java. That quickly lead to the development of hive, which does exactly what it was set out to do. Lets look at a couple of examples of hive queries.

  hive> FROM invites a INSERT OVERWRITE TABLE events SELECT a.bar, count(1) WHERE a.foo > 0 GROUP BY a.bar;
  hive> INSERT OVERWRITE TABLE events SELECT a.bar, count(1) FROM invites a WHERE a.foo > 0 GROUP BY a.bar;

Hive’s long term goal was to develop a system for managing and querying structured data built on top of Hadoop. To do that it used map-reduce mechanisms for execution and used HDFS for storage. They modeled the language on SQL, designed it to be extensible, interoperable and be able to out perform traditional processing mechanisms.

How it is usedimage

Facebook has a production Hive cluster which is primarily used for log summarization, including aggregation of impressions, click counts and statistics around user engagement. They have a separate cluster for “Ad hoc analysis” which is free for all/most Facebook employees to use. And over time they figured out how to use it for spam detection, ad optimization and a host of other undocumented stuff.

Facebook Hive/Hadoop statistics

The scribe/Hadoop cluster at Facebook has about 50 nodes in the cluster today and processes about 25TB of raw data. About 99% of its data is available for use within 20 seconds. The Hive/Hadoop cluster where most of the data processing happens has about 8400 cores with roughly about 12.5 PB of raw storage which translates to 4PB of usable storage after replication. Each node in the cluster is a 8 core server with 12TB of storage each.

All in all, Facebook gets 12 TB of compressed new data and scans about 135 TB of compressed data per day. There are more than 7500 Hive jobs which use up about 80000 computer hours each day.

References

Scalable products: KFS released

Kosmix, a search startup has released source to C++ implementation of something which looks like a clustered file system. This looks very similar to Hadoop/HDFS, but the C++ factor will be a big performance boost.Kosmic

From Skrenta blog

    • Incremental scalability – New chunkserver nodes can be added as storage needs increase; the system automatically adapts to the new nodes.
    • Availability – Replication is used to provide availability due to chunk server failures.
    • Re-balancing – Periodically, the meta-server may rebalance the chunks amongst chunkservers. This is done to help with balancing disk space utilization amongst nodes.
    • Data integrity – To handle disk corruptions to data blocks, data blocks are checksummed. Checksum verification is done on each read; whenever there is a checksum mismatch, re-replication is used to recover the corrupted chunk.
    • Client side fail-over – During reads, if the client library determines that the chunkserver it is communicating with is unreachable, the client library will fail-over to another chunkserver and continue the read. This fail-over is transparent to the application.
    • Language support – KFS client library can be accessed from C++, Java, and Python.
    • FUSE support on Linux – By mounting KFS via FUSE, this support allows existing Linux utilities (such as, ls) to interface with KFS.
    • Leases – KFS client library uses caching to improve performance. Leases are used to support cache consistency.

If anyone has experience with KFS, or has more information please leave a comment here.

Hadoop and HBase

This may not be a surprise for a lot of people but it was for me. Even though I have been using lucene and nutch for some time, I didn’t really know enough about Hadoop and HBase until recently.

Hadoop

  • Scalable: Hadoop can reliably store and process petabytes.
  • Economical: It distributes the data and processing across clusters of commonly available computers. These clusters can number into the thousands of nodes.
  • Efficient: By distributing the data, Hadoop can process it in parallel on the nodes where the data is located. This makes it extremely rapid.
  • Reliable: Hadoop automatically maintains multiple copies of data and automatically redeploys computing tasks based on failures.


Hadoop implements MapReduce, using the Hadoop Distributed File System (HDFS) (see figure below.) MapReduce divides applications into many small blocks of work. HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster. MapReduce can then process the data where it is located.

HBase

Google’s Bigtable, a distributed storage system for structured data, is a very effective mechanism for storing very large amounts of data in a distributed environment.

Just as Bigtable leverages the distributed data storage provided by the Google File System, Hbase will provide Bigtable-like capabilities on top of Hadoop.

Data is organized into tables, rows and columns, but a query language like SQL is not supported. Instead, an Iterator-like interface is available for scanning through a row range (and of course there is an ability to retrieve a column value for a specific key).

Any particular column may have multiple values for the same row key. A secondary key can be provided to select a particular value or an Iterator can be set up to scan through the key-value pairs for that column given a specific row key.