Cassandra: What is HintedHandoff ?

Nate has a very good post about how Cassandra is different from a lot of other distributed data-stores. In particular he explains that every node in a Cassandra cluster are identical to every other node. After using cassandra and a few months I can tell you for a fact that its true. It does come at a price though. Because its so decentralized, if you want to make a schema change, for example, configuration files of all of the nodes in the cluster need to be updated all at the same time. Some of these problems will go away when 0.70 finally comes out.

While is true that Cassandra doesn’t have a concept of single “master” server, each node participating in the cluster do actually act as masters and slaves of parts of the entire key range. The actual % size of the key range owned by an individual node depends on the replication factor, on how one picks the keys and what partitioner algorithm was selected.

The fun starts when a node, which could be the master for a range of keys, goes down. This is how Nate explains the process..

Though the node is the "primary" for a portion of the data in the cluster, the number of copies of the data kept on other nodes in the cluster is configurable. When a node goes down, the other nodes containing copies, referred to as "replicas", continue to service read requests and will even accept writes for the down node. When the node returns, these queued up writes are sent from the replicas to bring the node back up to date

And this is from the Cassandra wiki

If a node which should receive a write is down, Cassandra will write a hint to a live replica node indicating that the write needs to be replayed to the unavailable node. If no live replica nodes exist for this key, and ConsistencyLevel.ANY was specified, the coordinating node will write the hint locally. Cassandra uses hinted handoff as a way to (1) reduce the time required for a temporarily failed node to become consistent again with live ones and (2) provide extreme write availability when consistency is not required.

A hinted write is NOT sufficient to count towards ConsistencyLevel requirements of ONE, QUORUM, or ALL. Take the simple example of a cluster of two nodes, A and B, and a replication factor of 1 (each key is stored on one node). Suppose node A is down while we write key K to it with ConsistencyLevel.ONE. Then we must fail the write: recall from the API page that "if W + R > ReplicationFactor, where W is the number of nodes to block for on write, and R the number to block for on reads, you will have strongly consistent behavior; that is, readers will always see the most recent write."

Thus if we write a hint to B and call the write good because it is written "somewhere," there is no way to read the data at any ConsistencyLevel until A comes back up and B forwards the data to him. Historically, only the lowest ConsistencyLevel of ZERO would accept writes in this situation; for 0.6, we added ConsistencyLevel.ANY, meaning, "wait for a write to succeed anywhere, even a hinted write that isn’t immediately readable."

Mike Perham has a related post on the same topic. He goes further and explains that because there could be scenarios where writes are not immediately visible due to a disabled master node, its possible that master could get out of sync with the slaves in the confusion. There is a process called “anti-entropy” which cassandra uses to detect and resolve such issues. Here is how he explains

The final trick up Cassandra’s proverbial sleeve is anti-entropy. AE explicitly ensures that the nodes in the cluster agree on the current data. If read repair or hinted handoff don’t work due to some set of circumstances, the AE service will ensure that nodes reach eventual consistency. The AE service runs during “major compactions” (the equivalent of rebuilding a table in an RDBMS) so it is a relatively heavyweight process that runs infrequently. AE uses a Merkle Tree to determine where within the tree of column family data the nodes disagree and then repairs each of those branches.