Spanner: Google’s next Massive Storage and Computation infrastructure

MapReduce, Bigtable and Pregel have their origins in Google and they all deal with “large systems”. But all of them may be dwarfed in size and complexity by a new project Google is working on, which was mentioned briefly (may be un-intentionally) at an event last year.

Instead of caching data closer to user, it looks like Google is trying to take “the data” to the user. If you use GMail or a Google Doc service, then with this framework, Google could, auto-magically, “move” one of the master copies of your data to the nearest Google data center without really having to cache anything locally. And because they are building one single datastore cluster around the world, instead of building hundreds of smaller ones for different applications, it looks like they may not don’t need dedicated clusters for specific projects anymore.

Below is the gist of “Spanner” from a talk by Jeff Dean at Symposium held at Cornell. Take a look at the rest of the slides if you are interested in some impressive statistics on hardware performance and reliability.

  • Spanner: Storage & computation system that spans all our datacenters
    • Single global namespace
      • Names are independent of location(s) of data
      • Similarities to Bigtable: table, families, locality groups, coprocessors,…
      • Differences: hierarchical directories instead of rows, fine-grained replication
      • Fine-grained ACLs, replication configuration at the per-directory level
    • support mix of strong and weak consistency across datacenters
      • Strong consistency implemented with Paxos across tablet replicas
      • Full support for distributed transactions across directories/machines
    • much more automated operation
      • System automatically moves and adds replicas of data and computation based on constraints and usage patterns
      • Automated allocation of resources across entire fleet of machines.



Eventual consistency is just caching ?

So there is someone who thinks “eventual consistency is just caching”.  Though I liked the idea of discussing this, I don’t agree with Udi’s views on this.

“Cache” is generally used to store data which is more expensive to obtain from the primary location. For example, caching mysql queries is ideal for queries which could take more than fraction of a second to execute. Another example is caching queries to S3, SimpleDB or Google’s datastore which could cost money and introduce network latency into the mix. Though most applications are built to use such caches, they are also designed to be responsive in absence of caching layer.

The most important difference between “cache” and a “datastore” is that the dataflow is generally from “datastore” to “cache” rather than the other way round. Though one could queue data on “cache” first and then update datastore later (for performance reasons) that is not the way one should use it. If you are using “cache” to queue data for slower storage, you are using the wrong product. There are better “queuing” solutions  (activemq for example) that can do it for you in a more reliable way.

In most “eventually consistent” systems, there is no concept of primary and secondary nodes. Most nodes on such systems are considered equal and have similar performance characteristics.

Since “caching” solutions are designed for speed, they generally don’t have a concept of “replicas” or allow persistence to disk. Synchronizing between replica’s or to a disk can be expensive and be counter productive which is why its rare to find them on “caching” products. But many “eventually consistent” systems do provide a way for developers to request the level of “consistency” (or disk persistence) desired.

Do you have an opinion on this ? Please share examples if you have seen “caching” layer being used as an “eventually consistent datastore”.

Update: Udi mentioned on twitter that “write through caches” are eventually consistent. Sure, they are if you are talking about a caching layer on top of a persistent layer. I think there is an argument which could be made that “caches” are eventually consistent, but the reverse may not be true which is what his original post mentioned.

Cassandra : inverted index

Cassandra is the only NOSQL datastore I’m aware of, which is scalable, distributed, self replicating, eventually consistent, schema-less key-value store running on java which doesn’t have a single point of failure. HBase could also match most of these requirements, but Cassandra is easier to manage due to its tiny footprint.

The one thing Cassandra doesn’t do today is indexing columns.

Lets take a specific example to explain the problem. Lets say there are 100 rows in the datastore which have 5 columns each. If you want to find the row which says “Service=app2”, you will have to iterate one row at a time which is like full database scan. In a 100 row datastore if only one row had that particular column, it could take on an average about 50 rows before you find your data.


While I’m sure there is a good reason why this doesn’t exist yet, the application inserting the data could build such an inverted index itself even today. Here is an example of how a table of inverted index would look like.


If you want to find the “status” of all rows where “Service=app2”, all you have to do is find the list of keys by making a single call to this table. The second call would be to get all the columns values for that row. Even if you have 100 different rows in a table, finding that one particular row, matching your search query, could  now be done in two calls.

Of course there is a penalty you have to pay. Every time you insert one row of data, you would also have to insert multiple rows to build the inverted index. You would also have to update the inverted index yourself if any of the column values are updated or deleted. Cassandra 0.5.0 which was recently released has been benchmarked to insert about 10000 rows per second on a 4 core server with 2GB of RAM. If you have an average of 5 columns per row, that is about 1.5k actual row inserts per second (that includes 5 rows of inserts/updates required for an inverted index). For more throughput you always have an option to add more servers.

Facebook and Digg are both extensively using Cassandra in their architectures. Here are some interesting reading materials on Cassandra if you’d like to explore more.

[Updated: Discussion on Google Buzz ]