Is auto-sharding ready for auto-pilot ?

James Golick makes a point which lot of people miss. He doesn’t believe auto-sharding features NoSQL provides is ready for full auto-pilot yet, and that good developers have to think about sharding as part of design architecture, regardless of what datastore you pick.

If you take at face value the marketing materials of many NoSQL database vendors, you’d think that with a horizontally scalable data store, operations engineering simply isn’t necessary. Recent high profile outages suggest otherwise.

MongoDB, Redis-cluster (if and when it ships), Cassandra, Riak, Voldemort, and friends are tools that may be able to help you scale your data storage to varying degrees. Compared to sharding a relational database by hand, using a partitioned data store may even reduce operations costs at scale. But fundamentally, no software can use system resources that aren’t there.

At the very least one has to understand how auto sharding in a NoSQL works, how easy is it to setup, maintain, backup and restore. “Rebalancing” can be an expensive operation, and if shards are separated by distance or high latency, some designs might be better than others.

MongoDB : The bit.ly implementation

I was at the MongoSF conference few weeks ago, and 10gen just hosted one at NY as well.I was taken aback by the simplicity and the hope MongoDB provides.

I’ll have a more detailed post about what I think about it in a few days, but until then chew on these slides from bit.ly who also uses it to power its backend datastore.

Cassandra as a communication medium – A service Registry and Discovery tool

Few weeks ago while I was mulling over what kind of service registry/discovery system to use for a scalable application deployment platform, I realized that for mid-size organizations with complex set of services, building one from scratch may be the only option.

I also found out that many AWS/EC2 customers have already been using S3 and SimpleDB to  publish/discover services. That discussion eventually led me to investigate Cassandra as the service registry datastore in an enterprise network.

Here are some of the observations I made as I played with column_orientedCassandra for this purpose.  I welcome feedback from readers if you think I’m doing something wrong or if you think I can improve the design further.

  • The biggest issue I noticed with Cassandra was the absence of inverted index which could be worked around as I have blogged here. I later realized there is something called Lucandra  as well which I need to look at, at some point.
  • The keyspace structure I used was very simple… ( I skipped some configuration lines to keep it simple )

<Keyspace Name="devkeyspace">
    <ColumnFamily  CompareWith="UTF8Type" Name="forward"  />
    <ColumnFamily  CompareWith="UTF8Type" Name="reverse"  />
</Keyspace>

  • Using an “OrderPreservingPartitioner” seemed important to do “range scans”.  Order Preserving partitioner keeps objects with similar looking keys together to allow bulk reads and writes. By default Cassandra randomly distributes the objects across the cluster which works well if you only have a few nodes.
  • I eventually plan to use this application across two datacenters. The best way to mirror data across datacenters in Cassandra is by using “RackAwareStrategy”. If you select this option, it tells Cassandra to try to pick replicas of each token from different datacenters/racks. The default algorithm uses IP addresses to determine if two nodes are part of the same rack/datacenter, but there are other interesting ways to do it as well.
  • Some of the APIs changed significantly between the versions I was playing with. Cassandra developers will remind you that this is expected in a product which is still at 0.5 version. What amazes me, however, is the fact that Facebook, Digg and now Twitter have been using this product in production without bringing down everything.
  • I was eventually able to build a thin java webapp to front-end Cassandra, which provided the REST/json interface for registry/discovery service. This is also the app which managed the inverted indexes.
    • Direct Cassandra access from remote services was disabled for security/stability reasons.
    • The app used DNS to loadbalance queries across multiple servers.
  • My initial performance tests on this cluster performed miserably because I forgot that all of my requests were hitting the same node. The right way to tests Cassandra’s capacity is by loadbalancing requests across all Cassandra nodes.
    • Also realized, that by default, the logging mode was set to “DEBUG” which is very verbose. Shutting that down seemed to speed up response times as well.
  • Playing with different consistency levels for reading and writing was also an interesting experience, especially when I started killing nodes just to see the app break. This is what tweeking CAP is all about.
  • Due to an interesting problem related to “eventual consistency”, Cassandra doesn’t completely delete data which was marked deletion or was intentionally changed. In the default configuration that data is kept around for 10 days before its completely removed from the system.
  • Some documentation on the core operational aspects of Cassandra exist, but it would be nice if there were more.

Cassandra was designed as a scalable,highly available datastore. But because of its interesting self-healing and “RackAware” features, it can become an interesting communication medium as well.

Scalable logging using Syslog

Syslog is a commonly used transport mechanism for system logs. But people sometimes forget it could be used for a lot of other purposes as well.

Take, for example, the interesting challenge of aggregating web server logs from 100 different servers into one server and syslogthen figuring out how to merge them. If you have built your own tool to do this, you would have figured out  by now how expensive it is to poll all the servers and how out-of-date these logs could get by the time you process it. If you are not inserting them into some kind of datastore which sorts the rows by timestamp, you now also have to take up the challenge of building merge-sort script.

There is nothing which stops applications from using syslog as well. If your apps are in Java, you should try out Syslog appender for log4j [Ref 1] [Ref 2]. Not only do you get central logging, you also get get to see real-time “tail -f” of events as they happen in a merged file. If there are issues anywhere in your network, you have just one place to look at. If your logging volume is high, you would have to use other tools (or build your own) to do log analysis.

Here are some things you might have to think about if you plan to use syslog for your environment.

  1. Setup different syslog servers for each of your datacenters using split DNS or by use different hostnames.
  2. Try not to send logs across WAN links
  3. Rotate logs on a nightly basis, or depending on the log volume
  4. Reduce amount of logging (don’t do “debug” in production for example)
  5. Write tools to detect change in logging volume in dev/qa environment. If you follow good logging practice, you should be able to identify components which are responsible for the increase very quickly.
  6. Identify log patterns which could be causes of concerns and setup some kind of alerting using your regular monitoring service (nagios for example). Don’t be afraid to use 3rd party tools which do this very well.
  7. Syslog over UDP is non-blocking, but the syslog server can overloaded if logging volume is not controlled. The most expensive part of logging is disk i/o. If you notice high i/o
  8. UDP doesn’t guarantee that every log event will make it to the syslog server. Find out if that level of uncertainty in logging is ok for your environment.

Other interesting observations

  1. The amount of changes required in a java app which is already using log4j to log to a syslog server is trivial
  2. Logging to local files can be disabled, which means you don’t have to worry about disk storage on each server..
  3. If you are using or want to use tools like splunk or hadoop/hbase for log analysis, syslog is probably the easiest way to get there.
  4. You can always loadbalance syslog servers by using DNS loadbalancing.
  5. Apache webservers can’t do syslog out of the box, but you can still make it happen
  6. I personally like haproxy more and it does do syslog out of the box.
  7. If you want to log events from startup/shutdown scripts, you can use the “logger” *nix command to send events to the syslog server.

How is log aggregated in your environment ?

References

Cassandra : inverted index

Cassandra is the only NOSQL datastore I’m aware of, which is scalable, distributed, self replicating, eventually consistent, schema-less key-value store running on java which doesn’t have a single point of failure. HBase could also match most of these requirements, but Cassandra is easier to manage due to its tiny footprint.

The one thing Cassandra doesn’t do today is indexing columns.

Lets take a specific example to explain the problem. Lets say there are 100 rows in the datastore which have 5 columns each. If you want to find the row which says “Service=app2”, you will have to iterate one row at a time which is like full database scan. In a 100 row datastore if only one row had that particular column, it could take on an average about 50 rows before you find your data.

image

While I’m sure there is a good reason why this doesn’t exist yet, the application inserting the data could build such an inverted index itself even today. Here is an example of how a table of inverted index would look like.

image

If you want to find the “status” of all rows where “Service=app2”, all you have to do is find the list of keys by making a single call to this table. The second call would be to get all the columns values for that row. Even if you have 100 different rows in a table, finding that one particular row, matching your search query, could  now be done in two calls.

Of course there is a penalty you have to pay. Every time you insert one row of data, you would also have to insert multiple rows to build the inverted index. You would also have to update the inverted index yourself if any of the column values are updated or deleted. Cassandra 0.5.0 which was recently released has been benchmarked to insert about 10000 rows per second on a 4 core server with 2GB of RAM. If you have an average of 5 columns per row, that is about 1.5k actual row inserts per second (that includes 5 rows of inserts/updates required for an inverted index). For more throughput you always have an option to add more servers.

Facebook and Digg are both extensively using Cassandra in their architectures. Here are some interesting reading materials on Cassandra if you’d like to explore more.

[Updated: Discussion on Google Buzz ]