March 27, 2010

You don’t have to be Google to use NoSQL

Ted Dziuba has a post about “I can’t wait for NoSQL to Die”. The basic argument he makes is that one has to be at the size Google is to really benefit from NoSQL. I think he is missing the point. nosql

Here are my observations.

  • This is similar to the argument the traditional DB vendors were making when companies started switching away from the likes of Oracle/DB2 to MySQL. The difference between then and now is that before it was Large established databases vendors against the smaller (open-source) ones, and now its RDBMS vs non-RDBMS datastores.
  • Why NoSQL: The biggest difference between an RDBMS and a NoSQL datastore is the fact that NoSQL datastructures have no pre-defined schemas. That doesn’t mean that the developers don’t have to think about the data structure before using a NoSQL solution, but it does provide the opportunity to developers to add new columns which were not thought of at design time with little or no impact on applications using it. You can add and remove columns on the fly on most RDBMS as well, but those changes are usually considered significant. Also keep in mind that while NoSQL datastores could add columns at the row level, RDBMS solutions can only do it at the table level.
  • Scalability: There are basically two ways to scale any web application.
    • The first way is to build the app and leave the scalability issues for later (let the DBAs to figure out). This is an expensive iterative process which takes time to perfect. The issues around scalability and availability could be so complex that one may not be able to predict all the issues until they get used in production.
    • The second way is to train the programmers to architect the database so that it can scale better once it hits production. There is a significant upfront cost, but it pays over time.
    • NoSQL is the third way of doing it.
      • It restricts programmers by allowing only those operations and data-structures which can scale
      • And programmers who manage to figure out how to use it, have found that the these kind of restrictions guarantee significantly higher horizontal scalability than traditional RDBMS.
      • By architecting databases before the product is launched, it also reduces the amount of outage and post-deployment migrations.
  • High Availability: NoSQL is not just about scalability. Its also about “high-availability” at a cheaper cost.
    • While Ted did mention that some of the operations in Cassandra requires a restart, he forgot to mention that it doesn’t require all the nodes to be restarted at the same time. The cassandra datastore continues to be available even without many of its nodes. This is a common theme across most of the NoSQL based datastores. [CASSANDRA-44]
    • High availability over long distances with flaky network connection is not trivial to implement using traditional RDBMS based databases.
  • You don’t have to be Google to see benefits of using NoSQL.
    • If you are using S3 or SimpleDB on AWS or using datastores on Google’s Appengine then you are already using NoSQL. Many of the smaller startups are actually finding AWS/GAE to be cheaper than hosting their own servers.
      • One can still chose to use RDS like RDBMS solution, but they don’t get the benefit of high-availability and scalability which S3/SimpleDB offers out-of-the-box. 
    • While scalability to terabytes may not be a requirement for many of the smaller organizations, high availability is absolutely essential for most organizations today. RDBMS based solutions can do that, but setting up multi-master replication across two datacenters is non-trivial
  • Migration from RDBMS to NoSQL is not simple: I think Ted is right that not everyone will have success in cutting over from RDBMS to non-RDBMS world in one weekend. The reports of websites switching over to NoSQL overnight is sometimes grossly exaggerated. Most of these companies have been working on this for months if not years. And they would do extensive scalability, performance, availability and disaster-recovery tests before they put it in production.
  • RDBMS is not going anywhere: I also agree with Ted that RDBMS is not going anywhere anytime soon. Especially in organizations which are already using it. In fact most NoSQL datastores still haven’t figured out how to implement the level of security traditional RDBMS provide. I think thats the core reason why Google is still using it for some of its operational needs.

Finally, its my personal opinion that “Cloud computing” and commoditization of storage and servers were the key catalysts for the launch of so many NoSQL implementations. The ability to control infrastructure with APIs was a huge incentive for the developers to develop datastores which could scale dynamically as well. While Oracle/MySQL are not going anywhere anytime soon, “NoSQL” movement is definitely here to stay and I won’t be surprised if it evolves more on the way.



  1. Haters Gonna Hate
  2. Reddit: learning from mistakes
  3. Digg: Saying yes to NoSQL; Going steady with Cassandra
  4. Twitter @ 2009/07 : Up and running with cassandra
  5. Twitter @ 2010/03 : Ryan King about Twitter and Cassandra
  6. NoSQL vs RDBMS: Let the flames begin !
  7. Brewer’s CAP theorem on Distributed systems
  8. Database scalability
  9. What is scalability ?
  10. Thoughts on NoSQL

March 18, 2010

Spanner: Google’s next Massive Storage and Computation infrastructure

MapReduce, Bigtable and Pregel have their origins in Google and they all deal with “large systems”. But all of them may be dwarfed in size and complexity by a new project Google is working on, which was mentioned briefly (may be un-intentionally) at an event last year.

Instead of caching data closer to user, it looks like Google is trying to take “the data” to the user. If you use GMail or a Google Doc service, then with this framework, Google could, auto-magically, “move” one of the master copies of your data to the nearest Google data center without really having to cache anything locally. And because they are building one single datastore cluster around the world, instead of building hundreds of smaller ones for different applications, it looks like they may not don’t need dedicated clusters for specific projects anymore.

Below is the gist of “Spanner” from a talk by Jeff Dean at Symposium held at Cornell. Take a look at the rest of the slides if you are interested in some impressive statistics on hardware performance and reliability.

  • Spanner: Storage & computation system that spans all our datacenters
    • Single global namespace
      • Names are independent of location(s) of data
      • Similarities to Bigtable: table, families, locality groups, coprocessors,…
      • Differences: hierarchical directories instead of rows, fine-grained replication
      • Fine-grained ACLs, replication configuration at the per-directory level
    • support mix of strong and weak consistency across datacenters
      • Strong consistency implemented with Paxos across tablet replicas
      • Full support for distributed transactions across directories/machines
    • much more automated operation
      • System automatically moves and adds replicas of data and computation based on constraints and usage patterns
      • Automated allocation of resources across entire fleet of machines.



March 17, 2010

Pregel: Google’s other data-processing infrastructure

Inside Google, MapReduce is used for 80% of all the data processing needs. That includes indexing web content, running the clustering engine for Google News, generating reports for popular queries (Google Trends), processing satellite imagery , language model processing for statistical machine translation and  even mundane tasks like data backup and restore.

The other 20% is handled by a lesser known infrastructure called “Pregel” which is optimized to mine artrelationships from “graphs”.

According to wikipedia a “graph” is a collection of vertices or ‘nodes’ and a collection of ‘edges’ that connect pair of ‘nodes’.  Depending on the requirements, a ‘graph’ can be undirected which means there is no distinction between the two ‘nodes’ in the graph, or it could be directed from one ‘node’ to another.

While you can calculate something like ‘pagerank’ with MapReduce very quickly, you need more complex algorithms to mine some other kinds of relationships between pages or other sets of data.


Despite differences in structure and origin, many graphs out there have two things in common: each of them keeps growing in size, and there is a seemingly endless number of facts and details people would like to know about each one. Take, for example, geographic locations. A relatively simple analysis of a standard map (a graph!) can provide the shortest route between two cities. But progressively more sophisticated analysis could be applied to richer information such as speed limits, expected traffic jams, roadworks and even weather conditions. In addition to the shortest route, measured as sheer distance, you could learn about the most scenic route, or the most fuel-efficient one, or the one which has the most rest areas. All these options, and more, can all be extracted from the graph and made useful — provided you have the right tools and inputs. The web graph is similar. The web contains billions of documents, and that number increases daily. To help you find what you need from that vast amount of information, Google extracts more than 200 signals from the web graph, ranging from the language of a webpage to the number and quality of other pages pointing to it.

In order to achieve that, we have created scalable infrastructure, named Pregel, to mine a wide range of graphs. In Pregel, programs are expressed as a sequence of iterations. In each iteration, a vertex can, independently of other vertices, receive messages sent to it in the previous iteration, send messages to other vertices, modify its own and its outgoing edges' states, and mutate the graph's topology (experts in parallel processing will recognize that the Bulk Synchronous Parallel Model inspired Pregel).

Currently, Pregel scales to billions of vertices and edges, but this limit will keep expanding. Pregel's applicability is harder to quantify, but so far we haven't come across a type of graph or a practical graph computing problem which is not solvable with Pregel. It computes over large graphs much faster than alternatives, and the application programming interface is easy to use. Implementing PageRank, for example, takes only about 15 lines of code. Developers of dozens of Pregel applications within Google have found that "thinking like a vertex," which is the essence of programming in Pregel, is intuitive.

This is from the Paper

We conducted an experiment solving single-source shortest paths using a simple Belman-Ford algorithm expressed in Pregel. As input data, we chose a randomly generated graph with a log-normal distribution of outdegrees; the graph had 1B vertices and approximately 80B edges. Weights of all edges were set to 1 for simplicity. We used a cluster of 480 commodity multi-core PCs and 2000 workers. The graph was partitioned 8000 ways using the default partitioning function based on a random hash, to get a sense of the default performance of the Pregel system. Each worker was assigned 4 graph partitions and allowed up to 2 computational threads to execute the Compute() function over the partitions (not counting system threads that handle messaging, etc.). Running shortest paths took under 200 seconds, using 8 supersteps.


A talk is scheduled on Pregel at SIGMOD 2010

491 Pregel: A System for Large-Scale Graph Processing
Greg Malewicz, Google, Inc.; Matthew Austern, Google, Inc.; Aart Bik, Google, Inc.; James Dehnert, Google, Inc.; Ilan Horn, Google, Inc.; Naty Leiser, Google, Inc.; Grzegorz Czajkowski, Google, Inc.


  1. Pregel: A system for large-scale graph processing graph
  2. Inference Anatomy of the Google Pregel
  3. Large-scale graph computing at Google
  4. Mapreduce sigmetrics 09
  5. Bulk synchronous parallel
  6. Hamburg, Graph processing on Hadoop
  7. Bulk synchronous parallel
  8. 12 Reasons to learn graph theory
Related databases and products
  1. Neo4j is an embedded, disk-based, fully transactional Java persistence engine that stores data structured in graphs rather than in tables. A graph (mathematical lingo for a network) is a flexible data structure that allows a more agile and rapid style of development. You can think of Neo4j as a high-performance graph engine with all the features of a mature and robust database. The programmer works with an object-oriented, flexible network structure rather than with strict and static tables — yet enjoys all the benefits of a fully transactional, enterprise-strength database.
  2. Hypergraphdb is a general purpose, extensible, portable, distributed, embeddable, open-source data storage mechanism. It is a graph database designed specifically for artificial intelligence and semantic web projects, it can also be used as an embedded object-oriented database for projects of all sizes.
  3. Infogrid is an Internet Graph Database with a many additional software components that make the development of REST-ful web applications on a graph foundation easy.
  4. DEX is a high performance library to manage very large graphs or networks. DEX is available for personal use in this web page. Take some time to play with it and develop your applications to manage the network like data structures that represent your network of friends, or your network of information.
  5. Gremlin - Gremlin is a graph-based programming language. The documentation herein will provide all the information necessary to understand how to use Gremlin for graph query, analysis, and manipulation.

March 16, 2010

Product: Northscale memcached and membase server

Northscale, an organization formed by leaders of memcached open source project, today launched a couple of interesting products. The first one is an enhanced distribution of standard opensource memcached server which also does “secure multi-tenancy, dynamic cluster scaling and browser-based administration”. The second product is “Membase server” is a distributed key-value datastore build on top of memcached, but uses relational DB for persistence underneath.

The features provided seems to be geared towards organizations which lack (or perfer not to spend on) in-house technical expertise to test, develop and manage advanced caching technologies.

One of the biggest customers of NorthScale happens to be Zynga, which has been using it to power its FarmVille and Cafe World applications since Dec last year.

BTW, they also have a “labs” site which gives some idea of what other stuff they have been cooking.

  • Moxi – memcahed proxy
  • libconflate - “a library that helps you configure large installations of applications easily. It's designed to be easy to use, safe, and autonomous while still leveraging centralized resources to allow you to have your clusters easily adapt to change.”

HBase at Adobe

Cosmin Lehene, has a wonderful pair of posts about how a team in Adobe selected, tested and implemented an HBase based datastore for production use.

Its interesting how much they thought about failure and about the backup for backups. And in spite of all that how things still break. Building something new based on cutting-edge technology is not for the faint hearted. It needs to be supported by the organization, lead by those who can see the future, and backed by a team of experts who are always ready for challenges.

March 14, 2010

Scalable Tools: Murder: a bittorent based, file transfer tool for faster deployments

Deploying to one server can be done with a single script doing a bunch of ssh/scp commands. If you have a few more servers, you can run it in a loop sequentially or fork the processes to do them in parallel. At some point though, it will get unmanageable especially if you have the challenge of updating multiple datacenters at the same time. So how does a company like twitter do their releases ?

Murder is an interesting P2P/Bittorent based tool which twitter uses to distribute files during its own software updates. Here are some more details.

Murder is a method of using Bittorrent to distribute files to a large amount 
of servers within a production environment. This allows for scalable and fast
deploys in environments of hundreds to tens of thousands of servers where
centralized distribution systems wouldn't otherwise function. A "Murder" is
normally used to refer to a flock of crows, which in this case applies to a
bunch of servers doing something.

In order to do a Murder transfer, there are several components required to be
set up beforehand -- many the result of BitTorrent nature of the system. Murder
is based on BitTornado.

- A torrent tracker. This tracker, started by running the ''
script, runs a self-contained server on one machine. Although technically this
is still a centralized system (everyone relying on this tracker), the
communication between this server and the rest is minimal and normally
acceptable. To keep things simple tracker-less distribution (DHT) is currently
not supported. The tracker is actually just a mini-httpd that hosts a
/announce path which the Bittorrent clients update their state onto.

- A seeder. This is the server which has the files that you'd like to deploy
onto all other servers. For Twitter, this is the server that did the git diff.
The files are placed into a directory that a torrent gets created from. Murder
will tgz up the directory and create a .torrent file (a very small file
containing basic hash information about the tgz file). This .torrent file lets
the peers know what they're downloading. The tracker keeps track of which
.torrent files are currently being distributed. Once a Murder transfer is
started, the seeder will be the first server many machines go to to get
pieces. These pieces will then be distributed in a tree-fashion to the rest of
the network, but without necessarily getting the parts from the seeder.

- Peers. This is the group of servers (hundreds to tens of thousands) which
will be receiving the files and distributing the pieces amongst themselves.
Once a peer is done downloading the entire tgz file, it will continue seeding
for a while to prevent a hotspot effect on the seeder.


1. Configure the list of servers and general settings in config.rb. (one time)
2. Distribute the Murder files to all your servers: (one time)
cap murder:distribute_files
3. Start the tracker: (one time)
cap murder:start_tracker
4. Create a torrent file from a remote directory of files (on seeder):
cap murder:create_torrent tag="Deploy20100101" files_path="~/files"
5. Start seeding the files:
cap murder:start_seeding tag="Deploy20100101"
6. Distribute the files to all servers:
cap murder:peer tag="Deploy20100101" destination_path="/tmp/out"

Once completed, all files will be in /tmp/out/Deploy20091015/ on all servers.

More info here

March 12, 2010

Scalability links for March 13th 2010

For some reason there has been a disproportionately high number of news items on Cassandra lately. Some of those are included below, but also included are some other interesting updates which you might have missed.


March 08, 2010

Automated, faster, repeatable, scalable deployments

While efficient automated deployment tools like Puppet and Capistrano are a big step in the right direction, its not the complete solution for an automated deployment process. This post will explore some of the less discussed issues which are as important for automated, fast, repeatable scalable deployments. 

Rapid Build and Integration with tests

  • Use Source control to build an audit trail: Put everything possible in it, including configurations and deployment scripts.
  • Continuous Builds triggered by code check-ins can detect and report problems early image
    • Use tools which provide targeted feedback about build failures. It reduces noise and improves over all quality faster
    • Faster the build happens after a check-in, better are the chances for bugs to get fixed quickly. Delays can be costly since broken builds could impact other developers as well
    • Build smaller components (fail fast)
  • Continuous integration tests of all components can detect errors which may not be caught a build time.

Automated database changes

Can database changes be automated ? This is probably one of the most interesting challenges for automation, especially if the app requires data migrations which can’t be rolled back. While it would be nice to have only incremental changes introduced into each deployment (which are guaranteed to be forward and backward compatible), there might be some need for non-trivial changes once in a while. As long as there is a process to separate the trivial from non-trivial changes, it might be possible to automate most of the database changes through an automation process.

Tracking which migrations have been applied and which are pending is a very application specific problem for which there are no silver bullets.


Configuration management

Environment-specific properties

Its not abnormal to have different sets of configuration for dev and production. But creating different build packages for different target environments is not the right solution. If you need to change properties between environments pick a better way to do it.

  • Either externalize the configuration properties to a file/directory location outside your app folder, such that repeated deployments don’t overwrite properties.
  • Or, update the right properties automatically during deployment using a deployment framework which is capable of that.
Pushing at deployment time or pulling at run time

In some cases pulling new configuration files dynamically after application startup might make more sense. This is especially true for applications on an infrastructure like AWS/EC2. If applications were already deployed on the base OS image, then it will come up automatically when the system boots up. Some folks keep only minimal information in the base OS image, and use a datastore like S3 to download the latest configuration from. In a private network where using S3 is not possible, you could replace it with some kind of shared store like SVN/NFS/FTP/SCP/HTTPetc.

Deployment frameworks

3rd Party frameworks
  • Fabric - Fabric is a Python library and command-line tool for streamlining the use of SSH for application deployment or systems administration tasks.
  • Puppet -  Put simply, Puppet is a system for automating system administration tasks.
  • Capistrano - It is designed with repeatability in mind, letting you easily and reliably automate tasks that used to require login after login and a small army of custom shell scripts.  ( also check out webistrano )
  • Bcfg2 - Bcfg2 helps system administrators produce a consistent, reproducible, and verifiable description of their environment, and offers visualization and reporting tools to aid in day-to-day administrative tasks.
  • Chef - Chef is a systems integration framework, built to bring the benefits of configuration management to your entire infrastructure.
  • Slack - slack is an evolution from the usual "put files in some central directory" that is fairly common practice.
  • Kokki - System configuration management framework influenced by Chef
Custom or Mixed frameworks

The tools listed above are not the only set of tools available. Simple bash/sh scripts, ant scripts, even tools like cruisecontrol and hudson can be used for automated deployments. Here are some other interesting observations 

  • Building huge monolithically applications are thing of the past. Understanding how to break them up into self-contained, less inter-dependent components is the challenge.
  • If all of your servers get the same exact copy of application and configuration, then you don’t need to worry about configuration management. Just find a tool which deploys files fast.
  • If your deployments have a lot of inter-dependencies between components then choose a tool which gives you a visual interface of the deployment process if required.
  • Don’t be shy to write wrapper scripts to automate more tasks.
Push/Pull/P2P Frameworks

Grig has an interesting post about Push vs Pull where he lists the pros/cons of both the systems. What he forgot to mention is P2P which is the way twitter is going for its deployment. P2P has advantages from both Push and Pull architecture but comes with its own set of challenges. I haven’t seen an opensource tool using P2P yet, but I’m sure its not too far out.

Outage windows

Though deployments are easier with long outage windows, thats something hard to come by. In an ideal world one would have a parallel set of servers which one could cut over to with a flip of a switch. Unfortunately if user data is involved this is almost impossible to do. The next best alternative is to do “rolling updates” in small batches of servers. The reason this could be challenging is because the deployment tool needs to make sure the app really has completed initialization before it moves on to the next set of servers.

This can be further complicated by the fact that at times there are version dependencies between different applications. In such cases there needs to be a robust infrastructure to facilitate discovery of the right version of applications.


Deployment automation, in my personal opinion, is about the process, not the tool. If you have any interesting observations, ideas or comments, please feel free to write to me or leave a comment on this blog.

March 04, 2010

Disaster Recovery: Impressive RPO and RTO objectives set by Google Apps Operations

Unless you are running a fly by night shop, DR (Disaster recovery) should be one of the top issues for your operations team. In a “Scalable architecture” world, the complexity of DR can become a disaster in itself. 

Yesterday Google Announced that it now finally has DR plan for Google Apps. While this is nice, one should always take such messages with a pinch of salt, until they prove it that they can do it. Look at the DR plan for Google App engine which was also there, but still suffered more than 2 hour outage because of incomplete documentation, insufficient training and probably lack of someone to make a quick decisive decision at the time of failure.

But back to Google Apps for now. These guys are planning for an RPO of 0 seconds, which means multiple datacenters will always be in consistent state all the time.  And they want a RTO to be instant failover as well ! This is an incredible DR plan, and requires technical expertise in all 7 layers of OSI Model to achieve it.
In larger businesses, companies will add a storage area network (SAN), which is a consolidated place for all storage. SANs are expensive, and even then, you're out of luck if your data center goes down. So the largest enterprises will build an entirely new data center somewhere else, with another set of identical mail servers, another SAN and more people to staff them.

But if, heaven forbid, disaster strikes both your data centers, you're toast (check out this customer's experience with a fire). So big companies will often build the second data center far away, in a different 'threat zone', which creates even more management headaches. Next they need to ensure the primary SAN talks to the backup SAN, so they have to implement robust bandwidth to handle terabytes of data flying back and forth without crippling their network. There are other backup options as well, but the story's the same: as redundancy increases, cost and complexity multiplies.

How do you know if your disaster recovery solution is as strong as you need it to be? It's usually measured in two ways: RPO (Recovery Point Objective) and RTO (Recovery Time Objective). RPO is how much data you're willing to lose when things go wrong, and RTO is how long you're willing to go without service after a disaster.

For a large enterprise running SANs, the RTO and RPO targets are an hour or less: the more you pay, the lower the numbers. That can mean a large company spending the big bucks is willing to lose all the email sent to them for up to an hour after the system goes down, and go without access to email for an hour as well. Enterprises without SANs may be literally trucking tapes back and forth between data centers, so as you can imagine their RPOs and RTOs can stretch into days. As for small businesses, often they just have to start over.

For Google Apps customers, our RPO design target is zero, and our RTO design target is instant failover. We do this through live or synchronous replication: every action you take in Gmail is simultaneously replicated in two data centers at once, so that if one data center fails, we nearly instantly transfer your data over to the other one that's also been reflecting your actions.

This is one of the most ambitious DR plan I’ve ever read off which involves such a huge customer base.They not only have to replicate all the user data into multiple data centers, they have to do it synchronously (or almost synchronously),  across a huge distance (latency can slow down synchronous operations) without impacting users. And to top it all, they have to do a complete site failover if the primary datacenter goes down.

I am impressed, but don’t mind learning more on how they do it.

March 01, 2010

The Reddit problem: Learning from mistakes

Reddit has a very interesting post about what not to do when trying to build a scalable system. While the error is tragic, I think its an excellent design mistakes to learn from.

Though the post lacked detailed technical report, we might be able to recreate what happened. They mentioned they are using MemcacheDB datastore, with 2GB RAM per node, to keep some data which they need very often.

Dont be confused between MemcacheDB and memcached. While memcached is a distributed cache engine, MemcacheDB is actually a persistent datastore. And because they both use the same protocol, applications often use memcached libraries to connect to MemcacheDB datastore. redditheader

Both memcached and MemcacheDB rely on the clients to figure out how the keys are distributed across multiple nodes. Reddit chose MD5 to generate the keys for their key/value pairs. The algorithm Reddit used to identify which node in the cluster a key should go to could have been dependent on the number of nodes in the system. For example one popular way to identify which node a key should be on would be to use the “modulo” function. For example key “k” could be stored on the node “n” where “n=k modulo 3”. [ If k=101, then n=2 ]


Though MemcacheDB uses BDB to persist data, it seems like they heavily relied on keeping all the data in RAM. And at some point they might have hit the upper limit on what could be cached in RAM which caused disk i/o which resulted in slower response times. In a scalable architecture one should have been able to add new nodes and the system should have been able to scale.

Unfortunately though this algorithm works beautifully during regular operation, it fails as soon as you add or remove a node (when you change n). At that point you can’t guarantee that all the data you previously stored on node n would still be on the same node.

And while this algorithm may still be ok for “memcached” cache clusters, its really bad for MemcacheDB which requires “consistent hashing”.


Reddit today announced that they have increased RAM on these MemcacheDB servers from 2GB to 6GB, which allows 94% of their DB to be kept in memory. But they have realized their mistake (they probably figured this out long time back) and are thinking about how to fix it. The simplest solution of adding a few nodes requires re-hashing their keys which would take days according to their estimate. And of course just adding nodes without using some kind of “consistent hashing” is still not a scalable solution.

I personally learnt two things

  • Dont mix MemcacheDB and memcached. They are not designed to solve the same problem.

  • Don’t just simply replace memcached with MemcacheDB without thinking twice

There are many different products out there today which do a better job at scaling, so I won’t be surprised if they abandon MemcacheDB completely as well.