Google Storage : What it really is…

Yesterday Google formally announced Google Storage to a few (5000?) of us at Google I/O. Here is the gist of this as I see it from the various discussions/talks I attended.

To begin with, I have to point out that there is almost nothing new in what Google has proposed to provide. Amazon has been doing this for years with its S3.  The key difference is that if you are a google customer you won’t have to look elsewhere for storage services like this one.

Lets get the technical details out

  • Its tries to implement a Strong consistency model (CA of the CAP: Consistent and Available). Which means data you store is automatically replicated in a consistent way across multiple datacenter
    • Currently it replicates to multiple locations within US. In future it does plan to replicate across continents.
    • Currently there are no controls to control how replication happens or to where. They plan to learn from usage in beta period and develop controls over time.
  • There are two basic building blocks for objects Google Code Labs
    • Buckets – Containers
        All objects are stored in flat container. However, the tools understand “/” and “*” (wild cards) and does the right thing when used correctly
    • Objects – objects/files inside those containers
  • Implements RESTful APIs (GET/PUT/POST/DELETE/HEAD/etc)
    • All resources are identified by a URI
  • No theoretical size limit of Buckets or containers. However a 100GB limit per account would be imposed during beta phase.
  • Its of course built on Google very well tested, scalable, highly available infrastructure
  • It provides multiple, flexible authentication and sharing models
    • Does support standard public/private key based auth
    • Will also have integration with some kind of groups which will allow object to be shared with  or controlled by with multiple identities.
    • ACLs can be applied to both Buckets and Objects
      • Buckets
        • Control who can list objects
        • Who can create/delete objects
        • Who can read/write into the bucket
      • Objects
        • Who can read
        • Who can read/write
  • Tools
    • There were two tools mentioned during the talk
      • GS manager looks like a web application which allows an admin to manage this service
      • GS util is more like the shell tools AWS provides for S3.
        • As I mentioned before GS util accepts wild card
          • So something like this is possible
            • gsutil cp gs://gs2010/*  /home/rkt/gs2010
  • The service was created with “data liberation” as one of the goals. As shown by the previous command it takes just one line of code to transfer all of your data out.
  • Resume feature (if connection breaks during a big upload) is not available yet, but thats on the roadmap.
  • Groups feature was discussed a lot, but its not ready in the current release
  • Versioning feature is not available. Wasn’t clear if its on the roadmap or how long before its implemented.

Few other notes.

  • Its not clear how this plays with the “storage service” Google currently provides for Gmail/Docs storage. From what I heard this is not related to that storage service at all and there are no plans to integrate it.
  • The service is free in beta period to all developers who get access to it, but when its released it will follow a pricing model similar others in the industry. The pricing model is already published on their website
  • The speakers and the product managers didn’t comment on whether storage access from google apps engine would be charged (or at what rate)
  • They do provide MD5 signatures as a way of verifying if an object on the client is same as the object on the server, but its not used storing files itself. (So MD5 collisions issue shouldn’t be a problem)
  • US Navy is already using this service with about 80TB of data on Google Storage, and from what I heard they looked pretty happy talking about it.

I suspect this product will be in beta for a while before they release it out in the open.

You don’t have to be Google to use NoSQL

Ted Dziuba has a post about “I can’t wait for NoSQL to Die”. The basic argument he makes is that one has to be at the size Google is to really benefit from NoSQL. I think he is missing the point. nosql

Here are my observations.

  • This is similar to the argument the traditional DB vendors were making when companies started switching away from the likes of Oracle/DB2 to MySQL. The difference between then and now is that before it was Large established databases vendors against the smaller (open-source) ones, and now its RDBMS vs non-RDBMS datastores.
  • Why NoSQL: The biggest difference between an RDBMS and a NoSQL datastore is the fact that NoSQL datastructures have no pre-defined schemas. That doesn’t mean that the developers don’t have to think about the data structure before using a NoSQL solution, but it does provide the opportunity to developers to add new columns which were not thought of at design time with little or no impact on applications using it. You can add and remove columns on the fly on most RDBMS as well, but those changes are usually considered significant. Also keep in mind that while NoSQL datastores could add columns at the row level, RDBMS solutions can only do it at the table level.
  • Scalability: There are basically two ways to scale any web application.
    • The first way is to build the app and leave the scalability issues for later (let the DBAs to figure out). This is an expensive iterative process which takes time to perfect. The issues around scalability and availability could be so complex that one may not be able to predict all the issues until they get used in production.
    • The second way is to train the programmers to architect the database so that it can scale better once it hits production. There is a significant upfront cost, but it pays over time.
    • NoSQL is the third way of doing it.
      • It restricts programmers by allowing only those operations and data-structures which can scale
      • And programmers who manage to figure out how to use it, have found that the these kind of restrictions guarantee significantly higher horizontal scalability than traditional RDBMS.
      • By architecting databases before the product is launched, it also reduces the amount of outage and post-deployment migrations.
  • High Availability: NoSQL is not just about scalability. Its also about “high-availability” at a cheaper cost.
    • While Ted did mention that some of the operations in Cassandra requires a restart, he forgot to mention that it doesn’t require all the nodes to be restarted at the same time. The cassandra datastore continues to be available even without many of its nodes. This is a common theme across most of the NoSQL based datastores. [CASSANDRA-44]
    • High availability over long distances with flaky network connection is not trivial to implement using traditional RDBMS based databases.
  • You don’t have to be Google to see benefits of using NoSQL.
    • If you are using S3 or SimpleDB on AWS or using datastores on Google’s Appengine then you are already using NoSQL. Many of the smaller startups are actually finding AWS/GAE to be cheaper than hosting their own servers.
      • One can still chose to use RDS like RDBMS solution, but they don’t get the benefit of high-availability and scalability which S3/SimpleDB offers out-of-the-box. 
    • While scalability to terabytes may not be a requirement for many of the smaller organizations, high availability is absolutely essential for most organizations today. RDBMS based solutions can do that, but setting up multi-master replication across two datacenters is non-trivial
  • Migration from RDBMS to NoSQL is not simple: I think Ted is right that not everyone will have success in cutting over from RDBMS to non-RDBMS world in one weekend. The reports of websites switching over to NoSQL overnight is sometimes grossly exaggerated. Most of these companies have been working on this for months if not years. And they would do extensive scalability, performance, availability and disaster-recovery tests before they put it in production.
  • RDBMS is not going anywhere: I also agree with Ted that RDBMS is not going anywhere anytime soon. Especially in organizations which are already using it. In fact most NoSQL datastores still haven’t figured out how to implement the level of security traditional RDBMS provide. I think thats the core reason why Google is still using it for some of its operational needs.

Finally, its my personal opinion that “Cloud computing” and commoditization of storage and servers were the key catalysts for the launch of so many NoSQL implementations. The ability to control infrastructure with APIs was a huge incentive for the developers to develop datastores which could scale dynamically as well. While Oracle/MySQL are not going anywhere anytime soon, “NoSQL” movement is definitely here to stay and I won’t be surprised if it evolves more on the way.

 

References

  1. Haters Gonna Hate
  2. Reddit: learning from mistakes
  3. Digg: Saying yes to NoSQL; Going steady with Cassandra
  4. Twitter @ 2009/07 : Up and running with cassandra
  5. Twitter @ 2010/03 : Ryan King about Twitter and Cassandra
  6. NoSQL vs RDBMS: Let the flames begin !
  7. Brewer’s CAP theorem on Distributed systems
  8. Database scalability
  9. What is scalability ?
  10. Thoughts on NoSQL

Cassandra as a communication medium – A service Registry and Discovery tool

Few weeks ago while I was mulling over what kind of service registry/discovery system to use for a scalable application deployment platform, I realized that for mid-size organizations with complex set of services, building one from scratch may be the only option.

I also found out that many AWS/EC2 customers have already been using S3 and SimpleDB to  publish/discover services. That discussion eventually led me to investigate Cassandra as the service registry datastore in an enterprise network.

Here are some of the observations I made as I played with column_orientedCassandra for this purpose.  I welcome feedback from readers if you think I’m doing something wrong or if you think I can improve the design further.

  • The biggest issue I noticed with Cassandra was the absence of inverted index which could be worked around as I have blogged here. I later realized there is something called Lucandra  as well which I need to look at, at some point.
  • The keyspace structure I used was very simple… ( I skipped some configuration lines to keep it simple )

<Keyspace Name="devkeyspace">
    <ColumnFamily  CompareWith="UTF8Type" Name="forward"  />
    <ColumnFamily  CompareWith="UTF8Type" Name="reverse"  />
</Keyspace>

  • Using an “OrderPreservingPartitioner” seemed important to do “range scans”.  Order Preserving partitioner keeps objects with similar looking keys together to allow bulk reads and writes. By default Cassandra randomly distributes the objects across the cluster which works well if you only have a few nodes.
  • I eventually plan to use this application across two datacenters. The best way to mirror data across datacenters in Cassandra is by using “RackAwareStrategy”. If you select this option, it tells Cassandra to try to pick replicas of each token from different datacenters/racks. The default algorithm uses IP addresses to determine if two nodes are part of the same rack/datacenter, but there are other interesting ways to do it as well.
  • Some of the APIs changed significantly between the versions I was playing with. Cassandra developers will remind you that this is expected in a product which is still at 0.5 version. What amazes me, however, is the fact that Facebook, Digg and now Twitter have been using this product in production without bringing down everything.
  • I was eventually able to build a thin java webapp to front-end Cassandra, which provided the REST/json interface for registry/discovery service. This is also the app which managed the inverted indexes.
    • Direct Cassandra access from remote services was disabled for security/stability reasons.
    • The app used DNS to loadbalance queries across multiple servers.
  • My initial performance tests on this cluster performed miserably because I forgot that all of my requests were hitting the same node. The right way to tests Cassandra’s capacity is by loadbalancing requests across all Cassandra nodes.
    • Also realized, that by default, the logging mode was set to “DEBUG” which is very verbose. Shutting that down seemed to speed up response times as well.
  • Playing with different consistency levels for reading and writing was also an interesting experience, especially when I started killing nodes just to see the app break. This is what tweeking CAP is all about.
  • Due to an interesting problem related to “eventual consistency”, Cassandra doesn’t completely delete data which was marked deletion or was intentionally changed. In the default configuration that data is kept around for 10 days before its completely removed from the system.
  • Some documentation on the core operational aspects of Cassandra exist, but it would be nice if there were more.

Cassandra was designed as a scalable,highly available datastore. But because of its interesting self-healing and “RackAware” features, it can become an interesting communication medium as well.

SimpleDB now allows you to tweak consistency levels

We discussed Brewer’s Theorm a few days ago and how its challenging to obtain Consistency, Availability and Partition tolerance in any distributed system. We also discussed that many of the Amazon Web Servicesdistributed datastores allow CAP to be tweaked to attain certain operational goals.

Amazon SimpleDB, which was released as an “Eventually Consistent” datastore,  today launched a few features to do just that.

  • Consistent reads: Select and GetAttributes request now include an optional Boolean flag “ConsistentRead” which requests datastore to return consistent results only. If you have noticed scenarios where read right after a write returned an old value, it shouldn’t happen anymore.
  • Conditional put/puts, delete/deletes : By providing “conditions” in the form of a key/value pair SimpleDB can now conditionally execute/discard an operation. This might look like a minor feature, but can go a long way in providing reliable datastore operations.

Even though SimpleDB now enables operations that support a stronger consistency model, under the covers SimpleDB remains the same highly-scalable, highly-available, and highly durable structured data store. Even under extreme failure scenarios, such as complete datacenter failures, SimpleDB is architected to continue to operate reliably. However when one of these extreme failure conditions occurs it may be that the stronger consistency options are briefly not available while the software reorganizes itself to ensure that it can provide strong consistency. Under those conditions the default, eventually consistent read will remain available to use.

References

Brewers CAP Theorem on distributed systems

Large distributed systems run into a problem which smaller systems don’t usually have to worry about. image“Brewers CAP Theorem” [ Ref 1] [ Ref 2] [ Ref 3]  defines this problem in a very simple way.

It states, that though its desirable to have Consistency, High-Availability and Partition-tolerance in every system, unfortunately no system can achieve all three at the same time.

Consistent: A fully Consistent system is one where the system can guarantee that once you store a state (lets say “x=y”) in the system, it will report the same state in every subsequent operation until the imagestate is explicitly changed by something outside the system. [Example 1] A single MySQL database instance is automatically fully consistent since there is only one node keeping the state. [Example 2] If two MySQL servers are involved, and if the system is designed in such a way that all keys starting “a” to “m” is kept on server 1, and keys “n” to “z” are kept on server 2”,  then system can still easily guarantee consistency.image Lets now setup the DBs as master-master replicas[Example 3] . If one of the database accepts get a “row insert” request, that information has to be committed to the second system before the operation is considered complete. To require 100% consistency in such a replicated environment, communication between nodes is paramount. The over all performance of such a system could drop as number of replica’s required goes up.

Available:  The database in [Example 1] or [Example 2] are not highly Available. In [Example 1] if the node goes down there would 100% data loss. In [Example 2] if one node goes down, you will have 50% data loss. [Example 3] is the only solution which solves that problem. A simple MySQL server replication [ Multi-master mode]  setup could provide 100% availability. Increasing the number of nodes with copies of the data directly increases the availability of the system. Availability is not just a protection from hardware failure. Replicas also help in loadbalancing concurrent operations, especially the read operations. “Slave” MySQL instances are a perfect example of such “replicas”.

Partition-tolerance: So you got Consistency and Availability by replicating data. Lets say you had these two MySQL servers in Example 3, in two different datacenters, and you loose the network connectivity between the two datacenters making both databases incapable of synchronizing state between the two. Would the two DBs be fully functional in such a scenario ? If you somehow do manage allow read/write operations on these two databases, it can be proved that the two servers won’t be consistent anymore. A banking application which keeps “state of your account” at all times is a perfect example where its bad to have inconsistent bank records. If I withdraw 1000 bucks from California, it should be instantly reflected in the NY branch so that the system accurately knows how much more I can withdraw at any given time. If the system fails to do this, it could potentially cause problems which could make some customers very unhappy. If the banks decide consistency is very important, and disable write operations during the outage, it will will loose “availability” of the cluster since all the bank accounts at both the branches will now be frozen until network comes up again.

This gets more interesting when you realize that C.A.P. rules don’t have to be applied in an “all or nothing” fashion. Different systems can choose various levels of consistency, availability or partition tolerance to meet their business objective. Increasing the number of replicas, for example, increases high availability but it could at the same time reduce partition tolerance or consistency.

When you are discussing distributed systems ( or distributed storage ), you should try to identify which of the three rules it is trying to achieve. BigTable, used by Google App engine, and HBase, which runs over Hadoop, claim to be always consistent and highly available. Amazon’s Dynamo, which is used by S3 service and datastores like Cassandra instead sacrifice consistency in favor of availability and partition tolerance.

CAP theorem doesn’t just apply to databases. Even simple web applications which store state in session objects have this problem. There are many “solutions” out there which allow you to replicate session objects and make sessions data “highly available”, but they all suffer from the same basic problem and its important for you to understand it.

Recognizing which of the “C.A.P.” rules your business really needs should be the first step in building any successful distributed, scalable, highly-available system.

Cassandra : inverted index

Cassandra is the only NOSQL datastore I’m aware of, which is scalable, distributed, self replicating, eventually consistent, schema-less key-value store running on java which doesn’t have a single point of failure. HBase could also match most of these requirements, but Cassandra is easier to manage due to its tiny footprint.

The one thing Cassandra doesn’t do today is indexing columns.

Lets take a specific example to explain the problem. Lets say there are 100 rows in the datastore which have 5 columns each. If you want to find the row which says “Service=app2”, you will have to iterate one row at a time which is like full database scan. In a 100 row datastore if only one row had that particular column, it could take on an average about 50 rows before you find your data.

image

While I’m sure there is a good reason why this doesn’t exist yet, the application inserting the data could build such an inverted index itself even today. Here is an example of how a table of inverted index would look like.

image

If you want to find the “status” of all rows where “Service=app2”, all you have to do is find the list of keys by making a single call to this table. The second call would be to get all the columns values for that row. Even if you have 100 different rows in a table, finding that one particular row, matching your search query, could  now be done in two calls.

Of course there is a penalty you have to pay. Every time you insert one row of data, you would also have to insert multiple rows to build the inverted index. You would also have to update the inverted index yourself if any of the column values are updated or deleted. Cassandra 0.5.0 which was recently released has been benchmarked to insert about 10000 rows per second on a 4 core server with 2GB of RAM. If you have an average of 5 columns per row, that is about 1.5k actual row inserts per second (that includes 5 rows of inserts/updates required for an inverted index). For more throughput you always have an option to add more servers.

Facebook and Digg are both extensively using Cassandra in their architectures. Here are some interesting reading materials on Cassandra if you’d like to explore more.

[Updated: Discussion on Google Buzz ]