Google Storage : What it really is…

Yesterday Google formally announced Google Storage to a few (5000?) of us at Google I/O. Here is the gist of this as I see it from the various discussions/talks I attended.

To begin with, I have to point out that there is almost nothing new in what Google has proposed to provide. Amazon has been doing this for years with its S3.  The key difference is that if you are a google customer you won’t have to look elsewhere for storage services like this one.

Lets get the technical details out

  • Its tries to implement a Strong consistency model (CA of the CAP: Consistent and Available). Which means data you store is automatically replicated in a consistent way across multiple datacenter
    • Currently it replicates to multiple locations within US. In future it does plan to replicate across continents.
    • Currently there are no controls to control how replication happens or to where. They plan to learn from usage in beta period and develop controls over time.
  • There are two basic building blocks for objects Google Code Labs
    • Buckets – Containers
        All objects are stored in flat container. However, the tools understand “/” and “*” (wild cards) and does the right thing when used correctly
    • Objects – objects/files inside those containers
  • Implements RESTful APIs (GET/PUT/POST/DELETE/HEAD/etc)
    • All resources are identified by a URI
  • No theoretical size limit of Buckets or containers. However a 100GB limit per account would be imposed during beta phase.
  • Its of course built on Google very well tested, scalable, highly available infrastructure
  • It provides multiple, flexible authentication and sharing models
    • Does support standard public/private key based auth
    • Will also have integration with some kind of groups which will allow object to be shared with  or controlled by with multiple identities.
    • ACLs can be applied to both Buckets and Objects
      • Buckets
        • Control who can list objects
        • Who can create/delete objects
        • Who can read/write into the bucket
      • Objects
        • Who can read
        • Who can read/write
  • Tools
    • There were two tools mentioned during the talk
      • GS manager looks like a web application which allows an admin to manage this service
      • GS util is more like the shell tools AWS provides for S3.
        • As I mentioned before GS util accepts wild card
          • So something like this is possible
            • gsutil cp gs://gs2010/*  /home/rkt/gs2010
  • The service was created with “data liberation” as one of the goals. As shown by the previous command it takes just one line of code to transfer all of your data out.
  • Resume feature (if connection breaks during a big upload) is not available yet, but thats on the roadmap.
  • Groups feature was discussed a lot, but its not ready in the current release
  • Versioning feature is not available. Wasn’t clear if its on the roadmap or how long before its implemented.

Few other notes.

  • Its not clear how this plays with the “storage service” Google currently provides for Gmail/Docs storage. From what I heard this is not related to that storage service at all and there are no plans to integrate it.
  • The service is free in beta period to all developers who get access to it, but when its released it will follow a pricing model similar others in the industry. The pricing model is already published on their website
  • The speakers and the product managers didn’t comment on whether storage access from google apps engine would be charged (or at what rate)
  • They do provide MD5 signatures as a way of verifying if an object on the client is same as the object on the server, but its not used storing files itself. (So MD5 collisions issue shouldn’t be a problem)
  • US Navy is already using this service with about 80TB of data on Google Storage, and from what I heard they looked pretty happy talking about it.

I suspect this product will be in beta for a while before they release it out in the open.

Hive @Facebook

Hive is a data warehouse infrastructure built over Hadoop. It provides tools to enable easy data ETL, a mechanism to put structures on the data, and the capability to querying and analysis of large data sets stored in Hadoop files. Hive defines a simple SQL-like query language, called QL, that enables users familiar with SQL to query the data. At the same time, this language also allows programmers who are familiar with the MapReduce fromwork to be able to plug in their custom mappers and reducers to perform more sophisticated analysis that may not be supported by the built-in capabilities of the language.

At a user group meeting, Ashish Thusoo from Facebook data team, spoke about how Facebook uses Hive for their data processing needs.

Problem

Facebook is a free service and has been experiencing rapid growth in last few years. The amount of data it collects, which used to be around 200GB per day in March 2008, has now grown to 15TB per day today.  Facebook realized early on that insights derived from simple algorithms on more data is better than insights from complex algorithm on smaller set of data.

But the traditional approach towards ETL on proprietary storage systems was not only getting expensive to maintain, it was also limited in the size it could scale to. This is when they started experimenting with Hadoop.

How Hadoop gave birth to Hive

Hadoop turned out to be superior in availability, scalability and manageability. Its efficiency wasn’t that great, but one could get more throughput by throwing more cheap hardware at it. Ashish pointed out that though at that point partial availability, resilience and scale was more important than ACID they had a hard time finding Hadoop programmers within Facebook to make use of the cluster.

It was this, that eventually forced Facebook, to build a new way of querying data from Hadoop which doesn’t require writing map-reduce jobs in java. That quickly lead to the development of hive, which does exactly what it was set out to do. Lets look at a couple of examples of hive queries.

  hive> FROM invites a INSERT OVERWRITE TABLE events SELECT a.bar, count(1) WHERE a.foo > 0 GROUP BY a.bar;
  hive> INSERT OVERWRITE TABLE events SELECT a.bar, count(1) FROM invites a WHERE a.foo > 0 GROUP BY a.bar;

Hive’s long term goal was to develop a system for managing and querying structured data built on top of Hadoop. To do that it used map-reduce mechanisms for execution and used HDFS for storage. They modeled the language on SQL, designed it to be extensible, interoperable and be able to out perform traditional processing mechanisms.

How it is usedimage

Facebook has a production Hive cluster which is primarily used for log summarization, including aggregation of impressions, click counts and statistics around user engagement. They have a separate cluster for “Ad hoc analysis” which is free for all/most Facebook employees to use. And over time they figured out how to use it for spam detection, ad optimization and a host of other undocumented stuff.

Facebook Hive/Hadoop statistics

The scribe/Hadoop cluster at Facebook has about 50 nodes in the cluster today and processes about 25TB of raw data. About 99% of its data is available for use within 20 seconds. The Hive/Hadoop cluster where most of the data processing happens has about 8400 cores with roughly about 12.5 PB of raw storage which translates to 4PB of usable storage after replication. Each node in the cluster is a 8 core server with 12TB of storage each.

All in all, Facebook gets 12 TB of compressed new data and scans about 135 TB of compressed data per day. There are more than 7500 Hive jobs which use up about 80000 computer hours each day.

References

Amazon launches CloudFront

Update as of Feb 28th 2009: Contradictory to my initial speculation, Amazon CloudFront is nothing like Akamai WAA. This is very depressing to me as an Akamai/WAA customer… I’m sure folks at Akamai don’t share this opinion.  CloudFront seems to be a glorified S3 solution which is mostly used for static (non-dynamic) content.

————-

Amazon has finally opened the doors of its new CDN (Content Delivery Network) called CloudFront. But instead of building a completely new product it has interestingly expanded its S3 network to include content replication for lower latency content delivery. By not reinventing a whole new way of uploading data to the CDN network, Amazon has seriously cut down the cost for end users to try out this technology.

image Most of the CDNs I’ve investigated do very well with static content which needs to be periodically refreshed somehow.

There is at least one service from Akamai called WAA – Web application accelerator which seem to understand the importance of accelerating extremely dynamic content using intelligent routing and closer points of presence to end user. WAA doesn’t put the content closer to the end user, but provides an extremely efficient conduit for this traffic where Akamai controls both ends network by placing a POP in front of the client and the server. By doing this Akamai can take control of TCP/IP window sizes within its network and provide a low latency, higher bandwidth response to the customer. In addition to all this Akamai also provides an option to cache some data ( as defined in the HTTP headers, or WAA configuration ) to be cached for a longer duration.

Though Amazon might be doing replication as well, it may be closer to the Akamai’s WAA model than what you thought. Its kind of obvious that if the data is going to change all the time, there has to be some kind of master-slave concept, and its also clear that if many people are accessing that data around the world it has to be transported through a very efficient high bandwidth network to the various Amazon Points of presence around the world. And finally just like the Akamai’s WAA model, it probably does the cache content to serve the content directly from its local cache incase the object hasn’t changed on the master since the last time it was retrieved.

A month ago I went shopping, looking for alternatives to Akamai’s WAA and didn’t find anyone. I suspect CloudFront changes that a little bit. One significant difference between Amazon and most CDNs out there including CloudFront is that there is relatively very little work which needs to be done by the developer to integrate with WAA. This is not true with most CDNs, and certainly not true for CloudFront if you are not already on S3. But it does change the dynamics of this industry.

Scalable products: KFS released

Kosmix, a search startup has released source to C++ implementation of something which looks like a clustered file system. This looks very similar to Hadoop/HDFS, but the C++ factor will be a big performance boost.Kosmic

From Skrenta blog

    • Incremental scalability – New chunkserver nodes can be added as storage needs increase; the system automatically adapts to the new nodes.
    • Availability – Replication is used to provide availability due to chunk server failures.
    • Re-balancing – Periodically, the meta-server may rebalance the chunks amongst chunkservers. This is done to help with balancing disk space utilization amongst nodes.
    • Data integrity – To handle disk corruptions to data blocks, data blocks are checksummed. Checksum verification is done on each read; whenever there is a checksum mismatch, re-replication is used to recover the corrupted chunk.
    • Client side fail-over – During reads, if the client library determines that the chunkserver it is communicating with is unreachable, the client library will fail-over to another chunkserver and continue the read. This fail-over is transparent to the application.
    • Language support – KFS client library can be accessed from C++, Java, and Python.
    • FUSE support on Linux – By mounting KFS via FUSE, this support allows existing Linux utilities (such as, ls) to interface with KFS.
    • Leases – KFS client library uses caching to improve performance. Leases are used to support cache consistency.

If anyone has experience with KFS, or has more information please leave a comment here.