Showing posts from January 25, 2010

Hive @Facebook

Hive is a data warehouse infrastructure built over Hadoop . It provides tools to enable easy data ETL, a mechanism to put structures on the data, and the capability to querying and analysis of large data sets stored in Hadoop files. Hive defines a simple SQL-like query language, called QL, that enables users familiar with SQL to query the data. At the same time, this language also allows programmers who are familiar with the MapReduce fromwork to be able to plug in their custom mappers and reducers to perform more sophisticated analysis that may not be supported by the built-in capabilities of the language. At a user group meeting, Ashish Thusoo from Facebook data team, spoke about how Facebook uses Hive for their data processing needs. Problem Facebook is a free service and has been experiencing rapid growth in last few years. The amount of data it collects, which used to be around 200GB per day in March 2008, has now grown to 15TB per day today.  Facebook realized early on that ins