You don’t have to be Google to use NoSQL

Ted Dziuba has a post about “I can’t wait for NoSQL to Die”. The basic argument he makes is that one has to be at the size Google is to really benefit from NoSQL. I think he is missing the point. nosql

Here are my observations.

  • This is similar to the argument the traditional DB vendors were making when companies started switching away from the likes of Oracle/DB2 to MySQL. The difference between then and now is that before it was Large established databases vendors against the smaller (open-source) ones, and now its RDBMS vs non-RDBMS datastores.
  • Why NoSQL: The biggest difference between an RDBMS and a NoSQL datastore is the fact that NoSQL datastructures have no pre-defined schemas. That doesn’t mean that the developers don’t have to think about the data structure before using a NoSQL solution, but it does provide the opportunity to developers to add new columns which were not thought of at design time with little or no impact on applications using it. You can add and remove columns on the fly on most RDBMS as well, but those changes are usually considered significant. Also keep in mind that while NoSQL datastores could add columns at the row level, RDBMS solutions can only do it at the table level.
  • Scalability: There are basically two ways to scale any web application.
    • The first way is to build the app and leave the scalability issues for later (let the DBAs to figure out). This is an expensive iterative process which takes time to perfect. The issues around scalability and availability could be so complex that one may not be able to predict all the issues until they get used in production.
    • The second way is to train the programmers to architect the database so that it can scale better once it hits production. There is a significant upfront cost, but it pays over time.
    • NoSQL is the third way of doing it.
      • It restricts programmers by allowing only those operations and data-structures which can scale
      • And programmers who manage to figure out how to use it, have found that the these kind of restrictions guarantee significantly higher horizontal scalability than traditional RDBMS.
      • By architecting databases before the product is launched, it also reduces the amount of outage and post-deployment migrations.
  • High Availability: NoSQL is not just about scalability. Its also about “high-availability” at a cheaper cost.
    • While Ted did mention that some of the operations in Cassandra requires a restart, he forgot to mention that it doesn’t require all the nodes to be restarted at the same time. The cassandra datastore continues to be available even without many of its nodes. This is a common theme across most of the NoSQL based datastores. [CASSANDRA-44]
    • High availability over long distances with flaky network connection is not trivial to implement using traditional RDBMS based databases.
  • You don’t have to be Google to see benefits of using NoSQL.
    • If you are using S3 or SimpleDB on AWS or using datastores on Google’s Appengine then you are already using NoSQL. Many of the smaller startups are actually finding AWS/GAE to be cheaper than hosting their own servers.
      • One can still chose to use RDS like RDBMS solution, but they don’t get the benefit of high-availability and scalability which S3/SimpleDB offers out-of-the-box. 
    • While scalability to terabytes may not be a requirement for many of the smaller organizations, high availability is absolutely essential for most organizations today. RDBMS based solutions can do that, but setting up multi-master replication across two datacenters is non-trivial
  • Migration from RDBMS to NoSQL is not simple: I think Ted is right that not everyone will have success in cutting over from RDBMS to non-RDBMS world in one weekend. The reports of websites switching over to NoSQL overnight is sometimes grossly exaggerated. Most of these companies have been working on this for months if not years. And they would do extensive scalability, performance, availability and disaster-recovery tests before they put it in production.
  • RDBMS is not going anywhere: I also agree with Ted that RDBMS is not going anywhere anytime soon. Especially in organizations which are already using it. In fact most NoSQL datastores still haven’t figured out how to implement the level of security traditional RDBMS provide. I think thats the core reason why Google is still using it for some of its operational needs.

Finally, its my personal opinion that “Cloud computing” and commoditization of storage and servers were the key catalysts for the launch of so many NoSQL implementations. The ability to control infrastructure with APIs was a huge incentive for the developers to develop datastores which could scale dynamically as well. While Oracle/MySQL are not going anywhere anytime soon, “NoSQL” movement is definitely here to stay and I won’t be surprised if it evolves more on the way.

 

References

  1. Haters Gonna Hate
  2. Reddit: learning from mistakes
  3. Digg: Saying yes to NoSQL; Going steady with Cassandra
  4. Twitter @ 2009/07 : Up and running with cassandra
  5. Twitter @ 2010/03 : Ryan King about Twitter and Cassandra
  6. NoSQL vs RDBMS: Let the flames begin !
  7. Brewer’s CAP theorem on Distributed systems
  8. Database scalability
  9. What is scalability ?
  10. Thoughts on NoSQL

Scalability links for March 20th 2010 – Lots of datastore related items

Spanner: Google’s next Massive Storage and Computation infrastructure

MapReduce, Bigtable and Pregel have their origins in Google and they all deal with “large systems”. But all of them may be dwarfed in size and complexity by a new project Google is working on, which was mentioned briefly (may be un-intentionally) at an event last year.

Instead of caching data closer to user, it looks like Google is trying to take “the data” to the user. If you use GMail or a Google Doc service, then with this framework, Google could, auto-magically, “move” one of the master copies of your data to the nearest Google data center without really having to cache anything locally. And because they are building one single datastore cluster around the world, instead of building hundreds of smaller ones for different applications, it looks like they may not don’t need dedicated clusters for specific projects anymore.

Below is the gist of “Spanner” from a talk by Jeff Dean at Symposium held at Cornell. Take a look at the rest of the slides if you are interested in some impressive statistics on hardware performance and reliability.

  • Spanner: Storage & computation system that spans all our datacenters
    • Single global namespace
      • Names are independent of location(s) of data
      • Similarities to Bigtable: table, families, locality groups, coprocessors,…
      • Differences: hierarchical directories instead of rows, fine-grained replication
      • Fine-grained ACLs, replication configuration at the per-directory level
    • support mix of strong and weak consistency across datacenters
      • Strong consistency implemented with Paxos across tablet replicas
      • Full support for distributed transactions across directories/machines
    • much more automated operation
      • System automatically moves and adds replicas of data and computation based on constraints and usage patterns
      • Automated allocation of resources across entire fleet of machines.

image

 
References

HBase at Adobe

Cosmin Lehene, has a wonderful pair of posts about how a team in Adobe selected, tested and implemented an HBase based datastore for production use.

Its interesting how much they thought about failure and about the backup for backups. And in spite of all that how things still break. Building something new based on cutting-edge technology is not for the faint hearted. It needs to be supported by the organization, lead by those who can see the future, and backed by a team of experts who are always ready for challenges.