Inside Google, MapReduce is used for 80% of all the data processing needs. That includes indexing web content, running the clustering engine for Google News, generating reports for popular queries (Google Trends), processing satellite imagery , language model processing for statistical machine translation and even mundane tasks like data backup and restore.
According to wikipedia a â€œgraphâ€ is a collection of vertices or â€˜nodesâ€™ and a collection of â€˜edgesâ€™ that connect pair of â€˜nodesâ€™. Depending on the requirements, a â€˜graphâ€™ can be undirected which means there is no distinction between the two â€˜nodesâ€™ in the graph, or it could be directed from one â€˜nodeâ€™ to another.
While you can calculate something like â€˜pagerankâ€™ with MapReduce very quickly, you need more complex algorithms to mine some other kinds of relationships between pages or other sets of data.
Despite differences in structure and origin, many graphs out there have two things in common: each of them keeps growing in size, and there is a seemingly endless number of facts and details people would like to know about each one. Take, for example, geographic locations. A relatively simple analysis of a standard map (a graph!) can provide the shortest route between two cities. But progressively more sophisticated analysis could be applied to richer information such as speed limits, expected traffic jams, roadworks and even weather conditions. In addition to the shortest route, measured as sheer distance, you could learn about the most scenic route, or the most fuel-efficient one, or the one which has the most rest areas. All these options, and more, can all be extracted from the graph and made useful â€” provided you have the right tools and inputs. The web graph is similar. The web contains billions of documents, and that number increases daily. To help you find what you need from that vast amount of information, Google extracts more than 200 signals from the web graph, ranging from the language of a webpage to the number and quality of other pages pointing to it.
In order to achieve that, we have created scalable infrastructure, named Pregel, to mine a wide range of graphs. In Pregel, programs are expressed as a sequence of iterations. In each iteration, a vertex can, independently of other vertices, receive messages sent to it in the previous iteration, send messages to other vertices, modify its own and its outgoing edges’ states, and mutate the graph’s topology (experts in parallel processing will recognize that the Bulk Synchronous Parallel Model inspired Pregel).
Currently, Pregel scales to billions of vertices and edges, but this limit will keep expanding. Pregel’s applicability is harder to quantify, but so far we haven’t come across a type of graph or a practical graph computing problem which is not solvable with Pregel. It computes over large graphs much faster than alternatives, and the application programming interface is easy to use. Implementing PageRank, for example, takes only about 15 lines of code. Developers of dozens of Pregel applications within Google have found that "thinking like a vertex," which is the essence of programming in Pregel, is intuitive.
This is from the Paper
We conducted an experiment solving single-source shortest paths using a simple Belman-Ford algorithm expressed in Pregel. As input data, we chose a randomly generated graph with a log-normal distribution of outdegrees; the graph had 1B vertices and approximately 80B edges. Weights of all edges were set to 1 for simplicity. We used a cluster of 480 commodity multi-core PCs and 2000 workers. The graph was partitioned 8000 ways using the default partitioning function based on a random hash, to get a sense of the default performance of the Pregel system. Each worker was assigned 4 graph partitions and allowed up to 2 computational threads to execute the Compute() function over the partitions (not counting system threads that handle messaging, etc.). Running shortest paths took under 200 seconds, using 8 supersteps.
A talk is scheduled on Pregel at SIGMOD 2010
491 Pregel: A System for Large-Scale Graph Processing
Greg Malewicz, Google, Inc.; Matthew Austern, Google, Inc.; Aart Bik, Google, Inc.; James Dehnert, Google, Inc.; Ilan Horn, Google, Inc.; Naty Leiser, Google, Inc.; Grzegorz Czajkowski, Google, Inc.