Latest Publications

Real-Time MapReduce using S4

While trying to figure out how to do real-time log analysis in my own organization I realized that most map-reduce frameworks are designed to run as batch jobs in time delays manner rather than be instantaneous like a SQL query to a Mysql DB. There are some frameworks which are bucking the trend. Yahoo! Lab! recently announced that their “Advertising Sciences” group has built a general purpose, real-time, distributed, fault-tolerant, scalable, event driven, expandable platform called “S4” which allows programmers to easily implement applications for processing continuous unbounded streams of data.

S4 clusters are built using low-cost commoditized hardware, and leverage many technologies from Yahoo!’s Hadoop project. S4 is written in Java and uses the Spring Framework to build a software component architecture. Over a dozen pluggable modules have been created so far.

Why do we need a real-time map-reduce framework?
Applications such as personalization, user feedback, malicious traffic detection, and real-time search require both very fast response and scalability. In S4 we abstract the input data as streams of key-value pairs that arrive asynchronously and are dispatched intelligently to processing nodes that produce data sets of output key-value pairs. In search, for example, the output data sets are made available to the serving system before a user executes her next search query. We use this rapid feedback to adapt the search models based on user intent

Read more: Original post from Yahoo! Labs

Storage options on app engine

For those who think google app engine only has one kind of datastore, the one built around “bigtable”, think again. Nick Johnson goes into details of all the other options available with their pro’s and con’s in his post.

App Engine provides more data storage mechanisms than is apparent at first glance. All of them have different tradeoffs, so it’s likely that one – or more – of them will suit your application well. Often, the ideal solution involves a combination, such as the datastore and memcache, or local files and instance memory.

Storage options he lists..

  • Datastore
  • Memcache
  • Instance memory
  • Local Files

Read more: Original post from Nick

Is auto-sharding ready for auto-pilot ?

James Golick makes a point which lot of people miss. He doesn’t believe auto-sharding features NoSQL provides is ready for full auto-pilot yet, and that good developers have to think about sharding as part of design architecture, regardless of what datastore you pick.

If you take at face value the marketing materials of many NoSQL database vendors, you’d think that with a horizontally scalable data store, operations engineering simply isn’t necessary. Recent high profile outages suggest otherwise.

MongoDB, Redis-cluster (if and when it ships), Cassandra, Riak, Voldemort, and friends are tools that may be able to help you scale your data storage to varying degrees. Compared to sharding a relational database by hand, using a partitioned data store may even reduce operations costs at scale. But fundamentally, no software can use system resources that aren’t there.

At the very least one has to understand how auto sharding in a NoSQL works, how easy is it to setup, maintain, backup and restore. “Rebalancing” can be an expensive operation, and if shards are separated by distance or high latency, some designs might be better than others.

Why Membase Uses Erlang

It’s totally worth it. Erlang (Erlang/OTP really, which is what most people mean when they say “Erlang has X”) does out of the box a lot of things we would have had to either build from scratch or attempt to piece together existing libraries to do. Its dynamic type system and pattern matching (ala Haskelland ML) make Erlang code tend to be even more concise than Python and Ruby, two languages known for their ability to do a lot in few lines of NorthScalecode.

The single largest advantage to us of using Erlang has got to be its built-in support for concurrency. Erlang models concurrent tasks as “processes” that can communicate with one another only via message-passing (which makes use of pattern matching!), in what is known as the actor model of concurrency. This alone makes an entire class of concurrency-related bugs completely impossible. While it doesn’t completely prevent deadlock, it turns out to be pretty difficult to miss a potential deadlock scenario when you write code this way. While it’s certainly possible to implement the actor model in most if not all of the alternative environments I mentioned, and in fact such implementations exist, they are either incomplete or suffer from an impedance mismatch with existing libraries that expect you to use threads.

The complete original post can be found here: blog.membase.com

Cassandra’s future @facebook and links to other NoSQL slides

I heard an unconfirmed rumor that facebook is moving away from Cassandra. Not sure why, or to what, but rumors like this is a concern regardless. After twitter‘s backing off, and digg’s troubles, which were indirectly linked to either Cassandra’s maturity as a production solution or their understanding of Cassandra’s capability, it might raise more eyebrows if facebook does really abandon cassandra.  Cassandra was created in Facebook, which it opensourced, but its my understanding today that most of the development on the open sourced cassandra happens outside its walls. Rackspace is a big sponsor(may not be the largest anymore) of the open source project and Riptano, which has built a whole compnay around the open source project has done a tremendous job of promoting.

Scalability links for October 30th :

Scalability links for October 28th

Scalability links for October 28th:

Scalability links for October 21st

Scalability links for October 21st:

Scalability links for October 18th

Scalability links for October 18th:

Scaling Graphite by using Cfmap as the data transport

Graphite is an extremely promising system and resource graphing tool which tries to take RRD to the next level. Here are some of the most interesting features of graphite which I liked.

  • Updates can happen to records in the past (RRD doesn’t allow this I think)image
  • Creation of new datasets is trivial with whisper/carbon ( its part of  the graphite framework )
  • Graphite allows federated storage (multiple servers across multiple datacenters for example)

Monitoring and graphing resources across data-centers is tricky however. Especially because WAN links cannot be trusted. While loosing some data may be ok for some folks, it may not be acceptable for others. Graphite tries to solve this problem by providing an option to federate data across multiple servers which could each be kept in separate datacenters.

Another way to solve this problem is by using a data transport which is resilient to network failures.

Since Cfmap (thanks to Cassandra) is a distributed, eventually consistent, state repository, it could easily be extended to act as an eventually consistent data queue for tools like graphite. Take a look at an example of a server record here (on the right).  With some minor modifications, we were able to log and publish all changes to system attributes using an API like this. And with the right script running on the graphite server, the import of these stats into carbon (a component of graphite), became a trivial task.

image

With Cfmap’s easy REST interface, adding new stats to graphite becomes as simple as registering the stats to cfmap from anywhere in the network. [ sample script ]

Graphite is not in use in our corporate network today, but I’m extremely excited at the possibilities and would be actively looking at how else we could use it.

[ Take a look at RabbitMQ integration with Graphite for another interesting way of working with graphite ]

Cfmap: Publishing, discovering and dashboarding infrastructure state

image

Dynamic infrastructure can be a challenging if apps and scripts can’t keep up with them. At Ingenuity we observed this problem when we started moving towards virtualization and SOA (service oriented architecture). Remembering server names became impractical, and error-free manual configuration changes became impossible.

imageWhile there are some tools which solve parts of this specific problem, we couldn’t find any opensource tool which could be used to both publish and discover state of a system in a distributed, scalable and fault-tolerant way. Zookeeper which comes pretty close to what we needed was a fully consistent system which was not designed to be used across multiple data centers over high latency, unstable network connections. We wanted a system which could not only be up during network outages, but also sync up the state from different data-centers when they are connected.

We built a few different tools to solve our scalability problems, one of which is a tool called Cfmap which we are opensourcing today to help others facing the same problem.

So what is cfmap ?

Built over cassandra, cfmap is designed to be a scalable, eventually consistent and a fault tolerant repository of state information. It provides a set of REST APIs and UIs to both publish and discover state of an entity or a group of entities with great ease. The APIs are so simple that you would most probably be writing your own custom agents for the various servers and processes than use the agent which comes bundled with the tool.

We have been using cfmap internally for a few months and the results are promising. Here is an example of how cfmap’s dashboard looks like on our network  (I’ve changed some names to protect the actual resource names).  Here is another dashboard which is running out in the public which you can use today as a demo.

image

Cfmap provides the ability to quickly drill down to a filtered set of servers or apps, and the ability to export them quickly into a json or a shell greppable format. The two export formats available today makes dashboarding and scripting a trivial task.

The image above shows a small set of applications from our dev cluster which is sorted in the order of the time when the apps were deployed. In addition to showing the host names, status of the apps, and version information, it also lists the time when the app sent the last heartbeat. What is not visible here is that it also keeps track of certain changes in a “log” which could be used to understand historical changes of a particular record over time.

While REST interface is easy to use, you could choose to use the commandline tool “cfquery”, which comes with Cfmap to interact with cfmap. Cfquery could be used to both publish and search results… lets look at some example.

Here is an example of how one could extract a list of all the hosts in cfmap.

rkt@torque:~/cc/cfmap/bin$ ./cfquery.pl -c view | grep ":host=" | cut -d':' -f2host=team50host=ip-10-205-15-124host=torquehost=anorien

Here is a more elaborate example which shows up cfmap output could be used as parts of other scripts. In this case, the query just specifies a host “anorien” in the query. The result is a dump of all the properties set by the host. A few extra commands can quickly help you extract specific properties which can then be used as a data-source for other tools (like monitoring).

rkt@torque:~/cc/cfmap/bin$ ./cfquery.pl -c view -p "host=anorien"

52cb892bc339f286bacbcfe9a8c8b4a6:port=0
52cb892bc339f286bacbcfe9a8c8b4a6:stats_host_freeswap=1999
52cb892bc339f286bacbcfe9a8c8b4a6:host=anorien
52cb892bc339f286bacbcfe9a8c8b4a6:stats_host_loadavg5m=0
52cb892bc339f286bacbcfe9a8c8b4a6:cfqversion=1.1
52cb892bc339f286bacbcfe9a8c8b4a6:stats_host_estconn=4
52cb892bc339f286bacbcfe9a8c8b4a6:type=host
52cb892bc339f286bacbcfe9a8c8b4a6:deployed_date=1286217400
52cb892bc339f286bacbcfe9a8c8b4a6:version=2.6.32-00007-g56678ec
52cb892bc339f286bacbcfe9a8c8b4a6:ip=127.0.0.1
52cb892bc339f286bacbcfe9a8c8b4a6:stats_host_pscount=101
52cb892bc339f286bacbcfe9a8c8b4a6:stats_host_loadavg15m=0
52cb892bc339f286bacbcfe9a8c8b4a6:stats_host_loadavgentities=0
52cb892bc339f286bacbcfe9a8c8b4a6:stats_host_freemem=3
52cb892bc339f286bacbcfe9a8c8b4a6:stats_host_loadavg1m=0
52cb892bc339f286bacbcfe9a8c8b4a6:stats_host_totalswap=1999
52cb892bc339f286bacbcfe9a8c8b4a6:stats_host_totalmem=501
52cb892bc339f286bacbcfe9a8c8b4a6:appname=os
52cb892bc339f286bacbcfe9a8c8b4a6:checked=1286331427

rkt@torque:~/cc/cfmap/bin$ ./cfquery.pl -c view -p "host=anorien" | grep stats_host_totalmem
52cb892bc339f286bacbcfe9a8c8b4a6:stats_host_totalmem=501
rkt@torque:~/cc/cfmap/bin$ ./cfquery.pl -c view -p "host=anorien" | grep stats_host_totalmem | cut -d'=' -f2
501

Few other interesting features

  • Schema-less design – cfmap provides a simple schema-less datastore which could be used for other purposes as well. Please note that since it was designed to maintain “state” (instead of a simple datastore API), it has a few reserved keywords which have a special meaning.
  • Low overhead to add/delete cfmap nodes – Since its built over cassandra, adding new nodes is as simple as adding new cassandra servers.
  • Configurable - The recommended way of setting up cfmap for production use would be to host cfmap (which comes with a bundled version of cassandra) on 3 or more servers. Then put them all under a single DNS entry (round robin) and let DNS loadbalancing take care of the rest.
    • If you want an even more redundancy, setup something like haproxy on each of the nodes which could also monitor and redirect traffic to alternate cfmap nodes when failures (or GCs) happen.
    • The default setup doesn’t enforce consistency during reads or writes to facilitate smooth operation even during massive network or system failures. But if you wish, you could tweak the consistency, replication requirements based on your needs.

Cfmap is still a very early prototype, but we welcome others to play with it.