Archive for the ‘mapreduce’ Category

Spanner: Google’s next Massive Storage and Computation infrastructure

MapReduce, Bigtable and Pregel have their origins in Google and they all deal with “large systems”. But all of them may be dwarfed in size and complexity by a new project Google is working on, which was mentioned briefly (may be un-intentionally) at an event last year. Instead of caching data closer to user, it [...]

Read the rest of this entry »

Pregel: Google’s other data-processing infrastructure

Inside Google, MapReduce is used for 80% of all the data processing needs. That includes indexing web content, running the clustering engine for Google News, generating reports for popular queries (Google Trends), processing satellite imagery , language model processing for statistical machine translation and  even mundane tasks like data backup and restore. The other 20% [...]

Read the rest of this entry »

Hive @Facebook

Hive is a data warehouse infrastructure built over Hadoop. It provides tools to enable easy data ETL, a mechanism to put structures on the data, and the capability to querying and analysis of large data sets stored in Hadoop files. Hive defines a simple SQL-like query language, called QL, that enables users familiar with SQL [...]

Read the rest of this entry »

Google patents Map reduce “System and method for efficient large-scale data processing”

After filing in 2004, google finally got its patent on “System and method for efficient large-scale data processing”  approved  yesterday. Gigaom pointed out that if Google really wants to enforce it, it would have to go after many different vendors who are implementing “mapreduce” in some form in their applications and databases. Google’s intentions of [...]

Read the rest of this entry »

Scaling Powerset using Amazon’s EC2 and S3

The first thing most doc-com companies do before going public is setup an infrastructure to provide the service. And though it might sound straight forward to most of you, it can be a very expensive affair. To come up with the right kind of infrastructure for any new service a few key architectural decisions have [...]

Read the rest of this entry »

Hadoop and HBase

This may not be a surprise for a lot of people but it was for me. Even though I have been using lucene and nutch for some time, I didn’t really know enough about Hadoop and HBase until recently. Hadoop Scalable: Hadoop can reliably store and process petabytes. Economical: It distributes the data and processing [...]

Read the rest of this entry »