Google Storage : What it really is…

Yesterday Google formally announced Google Storage to a few (5000?) of us at Google I/O. Here is the gist of this as I see it from the various discussions/talks I attended.

To begin with, I have to point out that there is almost nothing new in what Google has proposed to provide. Amazon has been doing this for years with its S3.  The key difference is that if you are a google customer you won’t have to look elsewhere for storage services like this one.

Lets get the technical details out

  • Its tries to implement a Strong consistency model (CA of the CAP: Consistent and Available). Which means data you store is automatically replicated in a consistent way across multiple datacenter
    • Currently it replicates to multiple locations within US. In future it does plan to replicate across continents.
    • Currently there are no controls to control how replication happens or to where. They plan to learn from usage in beta period and develop controls over time.
  • There are two basic building blocks for objects Google Code Labs
    • Buckets – Containers
        All objects are stored in flat container. However, the tools understand “/” and “*” (wild cards) and does the right thing when used correctly
    • Objects – objects/files inside those containers
  • Implements RESTful APIs (GET/PUT/POST/DELETE/HEAD/etc)
    • All resources are identified by a URI
  • No theoretical size limit of Buckets or containers. However a 100GB limit per account would be imposed during beta phase.
  • Its of course built on Google very well tested, scalable, highly available infrastructure
  • It provides multiple, flexible authentication and sharing models
    • Does support standard public/private key based auth
    • Will also have integration with some kind of groups which will allow object to be shared with  or controlled by with multiple identities.
    • ACLs can be applied to both Buckets and Objects
      • Buckets
        • Control who can list objects
        • Who can create/delete objects
        • Who can read/write into the bucket
      • Objects
        • Who can read
        • Who can read/write
  • Tools
    • There were two tools mentioned during the talk
      • GS manager looks like a web application which allows an admin to manage this service
      • GS util is more like the shell tools AWS provides for S3.
        • As I mentioned before GS util accepts wild card
          • So something like this is possible
            • gsutil cp gs://gs2010/*  /home/rkt/gs2010
  • The service was created with “data liberation” as one of the goals. As shown by the previous command it takes just one line of code to transfer all of your data out.
  • Resume feature (if connection breaks during a big upload) is not available yet, but thats on the roadmap.
  • Groups feature was discussed a lot, but its not ready in the current release
  • Versioning feature is not available. Wasn’t clear if its on the roadmap or how long before its implemented.

Few other notes.

  • Its not clear how this plays with the “storage service” Google currently provides for Gmail/Docs storage. From what I heard this is not related to that storage service at all and there are no plans to integrate it.
  • The service is free in beta period to all developers who get access to it, but when its released it will follow a pricing model similar others in the industry. The pricing model is already published on their website
  • The speakers and the product managers didn’t comment on whether storage access from google apps engine would be charged (or at what rate)
  • They do provide MD5 signatures as a way of verifying if an object on the client is same as the object on the server, but its not used storing files itself. (So MD5 collisions issue shouldn’t be a problem)
  • US Navy is already using this service with about 80TB of data on Google Storage, and from what I heard they looked pretty happy talking about it.

I suspect this product will be in beta for a while before they release it out in the open.

Spanner: Google’s next Massive Storage and Computation infrastructure

MapReduce, Bigtable and Pregel have their origins in Google and they all deal with “large systems”. But all of them may be dwarfed in size and complexity by a new project Google is working on, which was mentioned briefly (may be un-intentionally) at an event last year.

Instead of caching data closer to user, it looks like Google is trying to take “the data” to the user. If you use GMail or a Google Doc service, then with this framework, Google could, auto-magically, “move” one of the master copies of your data to the nearest Google data center without really having to cache anything locally. And because they are building one single datastore cluster around the world, instead of building hundreds of smaller ones for different applications, it looks like they may not don’t need dedicated clusters for specific projects anymore.

Below is the gist of “Spanner” from a talk by Jeff Dean at Symposium held at Cornell. Take a look at the rest of the slides if you are interested in some impressive statistics on hardware performance and reliability.

  • Spanner: Storage & computation system that spans all our datacenters
    • Single global namespace
      • Names are independent of location(s) of data
      • Similarities to Bigtable: table, families, locality groups, coprocessors,…
      • Differences: hierarchical directories instead of rows, fine-grained replication
      • Fine-grained ACLs, replication configuration at the per-directory level
    • support mix of strong and weak consistency across datacenters
      • Strong consistency implemented with Paxos across tablet replicas
      • Full support for distributed transactions across directories/machines
    • much more automated operation
      • System automatically moves and adds replicas of data and computation based on constraints and usage patterns
      • Automated allocation of resources across entire fleet of machines.

image

 
References

Disaster Recovery: Impressive RPO and RTO objectives set by Google Apps Operations

Unless you are running a fly by night shop, DR (Disaster recovery) should be one of the top issues for your operations team. In a “Scalable architecture” world, the complexity of DR can become a disaster in itself. 

Yesterday Google Announced that it now finally has DR plan for Google Apps. While this is nice, one should always take such messages with a pinch of salt, until they prove it that they can do it. Look at the DR plan for Google App engine which was also there, but still suffered more than 2 hour outage because of incomplete documentation, insufficient training and probably lack of someone to make a quick decisive decision at the time of failure.

But back to Google Apps for now. These guys are planning for an RPO of 0 seconds, which means multiple datacenters will always be in consistent state all the time.  And they want a RTO to be instant failover as well ! This is an incredible DR plan, and requires technical expertise in all 7 layers of OSI Model to achieve it.

In larger businesses, companies will add a storage area network (SAN), which is a consolidated place for all storage. SANs are expensive, and even then, you’re out of luck if your data center goes down. So the largest enterprises will build an entirely new data center somewhere else, with another set of identical mail servers, another SAN and more people to staff them.

But if, heaven forbid, disaster strikes both your data centers, you’re toast (check out this customer’s experience with a fire). So big companies will often build the second data center far away, in a different ‘threat zone’, which creates even more management headaches. Next they need to ensure the primary SAN talks to the backup SAN, so they have to implement robust bandwidth to handle terabytes of data flying back and forth without crippling their network. There are other backup options as well, but the story’s the same: as redundancy increases, cost and complexity multiplies.

How do you know if your disaster recovery solution is as strong as you need it to be? It’s usually measured in two ways: RPO (Recovery Point Objective) and RTO (Recovery Time Objective). RPO is how much data you’re willing to lose when things go wrong, and RTO is how long you’re willing to go without service after a disaster.

For a large enterprise running SANs, the RTO and RPO targets are an hour or less: the more you pay, the lower the numbers. That can mean a large company spending the big bucks is willing to lose all the email sent to them for up to an hour after the system goes down, and go without access to email for an hour as well. Enterprises without SANs may be literally trucking tapes back and forth between data centers, so as you can imagine their RPOs and RTOs can stretch into days. As for small businesses, often they just have to start over.

For Google Apps customers, our RPO design target is zero, and our RTO design target is instant failover. We do this through live or synchronous replication: every action you take in Gmail is simultaneously replicated in two data centers at once, so that if one data center fails, we nearly instantly transfer your data over to the other one that’s also been reflecting your actions.

This is one of the most ambitious DR plan I’ve ever read off which involves such a huge customer base.They not only have to replicate all the user data into multiple data centers, they have to do it synchronously (or almost synchronously),  across a huge distance (latency can slow down synchronous operations) without impacting users. And to top it all, they have to do a complete site failover if the primary datacenter goes down.

I am impressed, but don’t mind learning more on how they do it.