You don’t have to be Google to use NoSQL

Ted Dziuba has a post about “I can’t wait for NoSQL to Die”. The basic argument he makes is that one has to be at the size Google is to really benefit from NoSQL. I think he is missing the point. nosql

Here are my observations.

  • This is similar to the argument the traditional DB vendors were making when companies started switching away from the likes of Oracle/DB2 to MySQL. The difference between then and now is that before it was Large established databases vendors against the smaller (open-source) ones, and now its RDBMS vs non-RDBMS datastores.
  • Why NoSQL: The biggest difference between an RDBMS and a NoSQL datastore is the fact that NoSQL datastructures have no pre-defined schemas. That doesn’t mean that the developers don’t have to think about the data structure before using a NoSQL solution, but it does provide the opportunity to developers to add new columns which were not thought of at design time with little or no impact on applications using it. You can add and remove columns on the fly on most RDBMS as well, but those changes are usually considered significant. Also keep in mind that while NoSQL datastores could add columns at the row level, RDBMS solutions can only do it at the table level.
  • Scalability: There are basically two ways to scale any web application.
    • The first way is to build the app and leave the scalability issues for later (let the DBAs to figure out). This is an expensive iterative process which takes time to perfect. The issues around scalability and availability could be so complex that one may not be able to predict all the issues until they get used in production.
    • The second way is to train the programmers to architect the database so that it can scale better once it hits production. There is a significant upfront cost, but it pays over time.
    • NoSQL is the third way of doing it.
      • It restricts programmers by allowing only those operations and data-structures which can scale
      • And programmers who manage to figure out how to use it, have found that the these kind of restrictions guarantee significantly higher horizontal scalability than traditional RDBMS.
      • By architecting databases before the product is launched, it also reduces the amount of outage and post-deployment migrations.
  • High Availability: NoSQL is not just about scalability. Its also about “high-availability” at a cheaper cost.
    • While Ted did mention that some of the operations in Cassandra requires a restart, he forgot to mention that it doesn’t require all the nodes to be restarted at the same time. The cassandra datastore continues to be available even without many of its nodes. This is a common theme across most of the NoSQL based datastores. [CASSANDRA-44]
    • High availability over long distances with flaky network connection is not trivial to implement using traditional RDBMS based databases.
  • You don’t have to be Google to see benefits of using NoSQL.
    • If you are using S3 or SimpleDB on AWS or using datastores on Google’s Appengine then you are already using NoSQL. Many of the smaller startups are actually finding AWS/GAE to be cheaper than hosting their own servers.
      • One can still chose to use RDS like RDBMS solution, but they don’t get the benefit of high-availability and scalability which S3/SimpleDB offers out-of-the-box. 
    • While scalability to terabytes may not be a requirement for many of the smaller organizations, high availability is absolutely essential for most organizations today. RDBMS based solutions can do that, but setting up multi-master replication across two datacenters is non-trivial
  • Migration from RDBMS to NoSQL is not simple: I think Ted is right that not everyone will have success in cutting over from RDBMS to non-RDBMS world in one weekend. The reports of websites switching over to NoSQL overnight is sometimes grossly exaggerated. Most of these companies have been working on this for months if not years. And they would do extensive scalability, performance, availability and disaster-recovery tests before they put it in production.
  • RDBMS is not going anywhere: I also agree with Ted that RDBMS is not going anywhere anytime soon. Especially in organizations which are already using it. In fact most NoSQL datastores still haven’t figured out how to implement the level of security traditional RDBMS provide. I think thats the core reason why Google is still using it for some of its operational needs.

Finally, its my personal opinion that “Cloud computing” and commoditization of storage and servers were the key catalysts for the launch of so many NoSQL implementations. The ability to control infrastructure with APIs was a huge incentive for the developers to develop datastores which could scale dynamically as well. While Oracle/MySQL are not going anywhere anytime soon, “NoSQL” movement is definitely here to stay and I won’t be surprised if it evolves more on the way.

 

References

  1. Haters Gonna Hate
  2. Reddit: learning from mistakes
  3. Digg: Saying yes to NoSQL; Going steady with Cassandra
  4. Twitter @ 2009/07 : Up and running with cassandra
  5. Twitter @ 2010/03 : Ryan King about Twitter and Cassandra
  6. NoSQL vs RDBMS: Let the flames begin !
  7. Brewer’s CAP theorem on Distributed systems
  8. Database scalability
  9. What is scalability ?
  10. Thoughts on NoSQL

Comments

henchan said…
A further consideration is that NoSQL's absence of explicit schema can be of considerable benefit for certain kinds of application in which the heterogeneity of data sources is important.
An RDS that requires a schema, such as RDBMS, imposes functional constraints upon application designers which can sometimes be undesired. Such limitations, are an artefact of a top-down data architecture. While they may be desirable to 'real' businesses such as Walmart that have sufficient market clout to impose EDI-like design constraints on its suppliers (and the power relation that entails), there is an emerging class of applications that are attempting to facilitate cross-organisation communication in such a manner that the existence of a 'Master' schema is not necessarily built in to the design.

Whether or not any of these applications will be successful remains to be seen. Yet the original post smells so much like FUD. In wishing failure upon an emerging technology class even before it has had a chance to prove itself, Ted Dziuba demonstrates that he has not yet absorbed the lessons of the Innovator's Dilemma. Tomorrow's successful businesses need not resemble today's. With hindsight, our path to the future will have looked just like children playing: neither tidying up after themselves nor porting the utility scripts.
code43 said…
Also, aside from the issue of mega-scalability, one could adopt NoSQL for the convenience of the key-value system. Having schema-less data allows for agile and flexible project development -- plus it's very simple to implement. See, for example, the Python module called y_serial which includes a tutorial at http://yserial.sourceforge.net/

Python y_serial module -- should take less than 10 minutes to get started -- for those wishing to avoid setting up server daemons and writing SQL statements -- instead, to concentrate on Python code with persistent data structures.
Nice article. I'd just like to add a concrete example of the advantages of NoSQL.

I run a business that has offered "cloud" based services for publishers since 2003. Very basically, we host lots of portals and community sites for publishers which run on our own proprietary software.

One of the big problems with this business has been, that we have to split portals/community site content across different database servers. When a portal grows or shrinks significantly, we move its content to a new database server to make sure overall load is spread efficiently.

Together with backing up each individual database server, this presents a very significant management overhead.

If I am able to re-engineer that business - and new projects in the pipeline now being developed on Cassandra allowing I hope to - I shall have all those publisher sites running off one big Cassandra cluster.

No more moving sites between database servers, no more poor service when one site gets busy and we haven't got time to move its content to a less loaded database server, no more backing up lots and lots of different databases... (and yes we could do master/slave replication, but then you don't spread writes)

Finally, I know some people have the idea that Google is the only company that encounters scalability issues on RDBMS systems - I can report that in fact it is quite easy for a heavily loaded user forum to max out a single database server.

When this happens, what you are meant to do is go to a master/slave replication setup. Yes, this can work, but if you research the admin costs associated with running this, you will see that a user forum rarely justifies it.

Cassandra presents the possibility of having a single database cluster to which you can very simply add nodes when you need to scale out.

What I think this means is that NoSQL will revolutionize the Web because it *commoditizes* scalability i.e. now just add/remove server instances.

Popular posts from this blog

Chrome Frame - How to add command line parameters

Creating your first chrome app on a Chromebook

Brewers CAP Theorem on distributed systems