Cassandra : inverted index

Cassandra is the only NOSQL datastore I’m aware of, which is scalable, distributed, self replicating, eventually consistent, schema-less key-value store running on java which doesn’t have a single point of failure. HBase could also match most of these requirements, but Cassandra is easier to manage due to its tiny footprint.

The one thing Cassandra doesn’t do today is indexing columns.

Lets take a specific example to explain the problem. Lets say there are 100 rows in the datastore which have 5 columns each. If you want to find the row which says “Service=app2”, you will have to iterate one row at a time which is like full database scan. In a 100 row datastore if only one row had that particular column, it could take on an average about 50 rows before you find your data.

image

While I’m sure there is a good reason why this doesn’t exist yet, the application inserting the data could build such an inverted index itself even today. Here is an example of how a table of inverted index would look like.

image

If you want to find the “status” of all rows where “Service=app2”, all you have to do is find the list of keys by making a single call to this table. The second call would be to get all the columns values for that row. Even if you have 100 different rows in a table, finding that one particular row, matching your search query, could  now be done in two calls.

Of course there is a penalty you have to pay. Every time you insert one row of data, you would also have to insert multiple rows to build the inverted index. You would also have to update the inverted index yourself if any of the column values are updated or deleted. Cassandra 0.5.0 which was recently released has been benchmarked to insert about 10000 rows per second on a 4 core server with 2GB of RAM. If you have an average of 5 columns per row, that is about 1.5k actual row inserts per second (that includes 5 rows of inserts/updates required for an inverted index). For more throughput you always have an option to add more servers.

Facebook and Digg are both extensively using Cassandra in their architectures. Here are some interesting reading materials on Cassandra if you’d like to explore more.

[Updated: Discussion on Google Buzz ]

10 comments

    1. @jothanan : I assumed that if 1 row of data needs 5 more rows of inserts/updates that would make a total of 6 rows of inserts/updates required per row of data inserted. 10k/6 =~ 1.5K.
      But I could be wrong, and would appreciate if you could shed a little light on why you think it can still do 10k per second.

  1. The problem of updating an inverted index is much worse than merely an extra update per new cell. The first extra bit of pain is dealing with concurrent updates; a simple read-extend-write of the index now falls prey to the classic problem of N-1/N concurrent updates being lost. The next bit of pain is dealing with a really big index. If you have a few thousand rows, let alone a billion, updates of a simple index are going to be extremely inefficient. Now you’ll have to maintain your index as a B-tree or some such using multiple rows. Now you get to the most painful part of all: concurrent multi-row updates. If you want to support something as simple as SQL’s “ORDER BY x LIMIT n” for a potentially large dataset with just about any of the simpler distributed key/value or column stores, it’s going to be rather unpleasant.

    Note that I’m not saying this invalidates the NoSQL approach. I’m pretty well known as a NoSQL *advocate* myself. It’s just something people have to be aware of as they’re working through their scale/feature/CAP requirements.

  2. Jeff,

    The solution I came up with, addresses a particular problem I was having with using cassandra in managing a glorified service state registry which I was building. I wanted Cassandra’s availability and partition-tolerance features and didn’t really care about its scalability because I knew what my dataset size would be.

    The implementation doesn’t require “read-extend-write”. You can just “write” to the column-family directly because I know what key to add. I also don’t have any use case to to a scan entire index… all I need is last N rows to find something I need from an index. As far as Cassandra is concerned, that table is just a huge ordered list of rows, so I don’t think there should be a performance issue.

    I think you could make an argument that indexes could be out of sync from data at times since they are not in the same column Family, but again the application I’m building is ok with minor issues like that.

    My understanding is that what I’m doing is not too far away from how others are using Cassandra. Some sample schema designs from Evan @ twitter shows that they might also use similar way of indexing data using Cassandra. http://blog.evanweaver.com/articles/2009/07/06/up-and-running-with-cassandra/

    rkt

  3. @Royans, I think its a decent approach given your application requirments.
    I have been doing something not exactly the same but it is in effect an index for an intranet search engine that i have been migrating to cassandra from oracle.

    My implementation supports what i’d like to say is “full-text” search, but is still in early beta.
    @Jeff some valid points there especially on the concurrent updates when i first attempted to create a simple index i ran into those issues but simple code changes and additions solved those issues.
    My solution was to maintain a column family for index logs where i keep track of what is being written with other useful info including a status of success or failure. Since its a search engine which is continually running i simply use a few methods in a class to periodically check for error statuses in the log. Each success log is deleted at some point and i check my column families to see if any rows or columns were written that are marked as failed in the log. If no record is found in the column family the value should have been written then i retry and then continue to check for inconsistency in other column families related to the failed update and try to correct them. It’s extra work but it has worked out quite well to so far. My data set is only 100 GB right now but it’s growing day by day and i’ve hammered out quite a lot of issues and she’s fairly stable to this end and quite fast one searches which is my priority :-)

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>