Inside Google, MapReduce is used for 80% of all the data processing needs. That includes indexing web content, running the clustering engine for Google News, generating reports for popular queries (Google Trends), processing satellite imagery , language model processing for statistical machine translation and even mundane tasks like data backup and restore.
The other 20% is handled by a lesser known infrastructure called â€œPregelâ€ which is optimized to mine relationships from â€œgraphsâ€.
According to wikipedia a â€œgraphâ€ is a collection of vertices or â€˜nodesâ€™ and a collection of â€˜edgesâ€™ that connect pair of â€˜nodesâ€™. Depending on the requirements, a â€˜graphâ€™ can be undirected which means there is no distinction between the two â€˜nodesâ€™ in the graph, or it could be directed from one â€˜nodeâ€™ to another.
While you can calculate something like â€˜pagerankâ€™ with MapReduce very quickly, you need more complex algorithms to mine some other kinds of relationships between pages or other sets of data.
Despite differences in structure and origin, many graphs out there have two things in common: each of them keeps growing in size, and there is a seemingly endless number of facts and details people would like to know about each one. Take, for example, geographic locations. A relatively simple analysis of a standard map (a graph!) can provide the shortest route between two cities. But progressively more sophisticated analysis could be applied to richer information such as speed limits, expected traffic jams, roadworks and even weather conditions. In addition to the shortest route, measured as sheer distance, you could learn about the most scenic route, or the most fuel-efficient one, or the one which has the most rest areas. All these options, and more, can all be extracted from the graph and made useful â€” provided you have the right tools and inputs. The web graph is similar. The web contains billions of documents, and that number increases daily. To help you find what you need from that vast amount of information, Google extracts more than 200 signals from the web graph, ranging from the language of a webpage to the number and quality of other pages pointing to it.
In order to achieve that, we have created scalable infrastructure, named Pregel, to mine a wide range of graphs. In Pregel, programs are expressed as a sequence of iterations. In each iteration, a vertex can, independently of other vertices, receive messages sent to it in the previous iteration, send messages to other vertices, modify its own and its outgoing edges’ states, and mutate the graph’s topology (experts in parallel processing will recognize that the Bulk Synchronous Parallel Model inspired Pregel).
Currently, Pregel scales to billions of vertices and edges, but this limit will keep expanding. Pregel’s applicability is harder to quantify, but so far we haven’t come across a type of graph or a practical graph computing problem which is not solvable with Pregel. It computes over large graphs much faster than alternatives, and the application programming interface is easy to use. Implementing PageRank, for example, takes only about 15 lines of code. Developers of dozens of Pregel applications within Google have found that "thinking like a vertex," which is the essence of programming in Pregel, is intuitive.
This is from the Paper
We conducted an experiment solving single-source shortest paths using a simple Belman-Ford algorithm expressed in Pregel. As input data, we chose a randomly generated graph with a log-normal distribution of outdegrees; the graph had 1B vertices and approximately 80B edges. Weights of all edges were set to 1 for simplicity. We used a cluster of 480 commodity multi-core PCs and 2000 workers. The graph was partitioned 8000 ways using the default partitioning function based on a random hash, to get a sense of the default performance of the Pregel system. Each worker was assigned 4 graph partitions and allowed up to 2 computational threads to execute the Compute() function over the partitions (not counting system threads that handle messaging, etc.). Running shortest paths took under 200 seconds, using 8 supersteps.
A talk is scheduled on Pregel at SIGMOD 2010
491 Pregel: A System for Large-Scale Graph Processing
Greg Malewicz, Google, Inc.; Matthew Austern, Google, Inc.; Aart Bik, Google, Inc.; James Dehnert, Google, Inc.; Ilan Horn, Google, Inc.; Naty Leiser, Google, Inc.; Grzegorz Czajkowski, Google, Inc.
- Pregel: A system for large-scale graph processing
- Inference Anatomy of the Google Pregel
- Large-scale graph computing at Google
- Mapreduce sigmetrics 09
- Bulk synchronous parallel
- Hamburg, Graph processing on Hadoop
- Bulk synchronous parallel
- 12 Reasons to learn graph theory
Related databases and products
- Neo4j is an embedded, disk-based, fully transactional Java persistence engine that stores data structured in graphs rather than in tables. A graph (mathematical lingo for a network) is a flexible data structure that allows a more agile and rapid style of development. You can think of Neo4j as a high-performance graph engine with all the features of a mature and robust database. The programmer works with an object-oriented, flexible network structure rather than with strict and static tables â€” yet enjoys all the benefits of a fully transactional, enterprise-strength database.
- Hypergraphdb is a general purpose, extensible, portable, distributed, embeddable, open-source data storage mechanism. It is a graph database designed specifically for artificial intelligence and semantic web projects, it can also be used as an embedded object-oriented database for projects of all sizes.
- Infogrid is an Internet Graph Database with a many additional software components that make the development of REST-ful web applications on a graph foundation easy.
- DEX is a high performance library to manage very large graphs or networks. DEX is available for personal use in this web page. Take some time to play with it and develop your applications to manage the network like data structures that represent your network of friends, or your network of information.
- Gremlin – Gremlin is a graph-based programming language. The documentation herein will provide all the information necessary to understand how to use Gremlin for graph query, analysis, and manipulation.