Cfmap: Publishing, discovering and dashboarding infrastructure state

image

Dynamic infrastructure can be a challenging if apps and scripts can’t keep up with them. At Ingenuity we observed this problem when we started moving towards virtualization and SOA (service oriented architecture). Remembering server names became impractical, and error-free manual configuration changes became impossible.

imageWhile there are some tools which solve parts of this specific problem, we couldn’t find any opensource tool which could be used to both publish and discover state of a system in a distributed, scalable and fault-tolerant way. Zookeeper which comes pretty close to what we needed was a fully consistent system which was not designed to be used across multiple data centers over high latency, unstable network connections. We wanted a system which could not only be up during network outages, but also sync up the state from different data-centers when they are connected.

We built a few different tools to solve our scalability problems, one of which is a tool called Cfmap which we are opensourcing today to help others facing the same problem.

So what is cfmap ?

Built over cassandra, cfmap is designed to be a scalable, eventually consistent and a fault tolerant repository of state information. It provides a set of REST APIs and UIs to both publish and discover state of an entity or a group of entities with great ease. The APIs are so simple that you would most probably be writing your own custom agents for the various servers and processes than use the agent which comes bundled with the tool.

We have been using cfmap internally for a few months and the results are promising. Here is an example of how cfmap’s dashboard looks like on our network  (I’ve changed some names to protect the actual resource names).  Here is another dashboard which is running out in the public which you can use today as a demo.

image

Cfmap provides the ability to quickly drill down to a filtered set of servers or apps, and the ability to export them quickly into a json or a shell greppable format. The two export formats available today makes dashboarding and scripting a trivial task.

The image above shows a small set of applications from our dev cluster which is sorted in the order of the time when the apps were deployed. In addition to showing the host names, status of the apps, and version information, it also lists the time when the app sent the last heartbeat. What is not visible here is that it also keeps track of certain changes in a “log” which could be used to understand historical changes of a particular record over time.

While REST interface is easy to use, you could choose to use the commandline tool “cfquery”, which comes with Cfmap to interact with cfmap. Cfquery could be used to both publish and search results… lets look at some example.

Here is an example of how one could extract a list of all the hosts in cfmap.

rkt@torque:~/cc/cfmap/bin$ ./cfquery.pl -c view | grep ":host=" | cut -d':' -f2host=team50host=ip-10-205-15-124host=torquehost=anorien

Here is a more elaborate example which shows up cfmap output could be used as parts of other scripts. In this case, the query just specifies a host “anorien” in the query. The result is a dump of all the properties set by the host. A few extra commands can quickly help you extract specific properties which can then be used as a data-source for other tools (like monitoring).

rkt@torque:~/cc/cfmap/bin$ ./cfquery.pl -c view -p "host=anorien"

52cb892bc339f286bacbcfe9a8c8b4a6:port=0
52cb892bc339f286bacbcfe9a8c8b4a6:stats_host_freeswap=1999
52cb892bc339f286bacbcfe9a8c8b4a6:host=anorien
52cb892bc339f286bacbcfe9a8c8b4a6:stats_host_loadavg5m=0
52cb892bc339f286bacbcfe9a8c8b4a6:cfqversion=1.1
52cb892bc339f286bacbcfe9a8c8b4a6:stats_host_estconn=4
52cb892bc339f286bacbcfe9a8c8b4a6:type=host
52cb892bc339f286bacbcfe9a8c8b4a6:deployed_date=1286217400
52cb892bc339f286bacbcfe9a8c8b4a6:version=2.6.32-00007-g56678ec
52cb892bc339f286bacbcfe9a8c8b4a6:ip=127.0.0.1
52cb892bc339f286bacbcfe9a8c8b4a6:stats_host_pscount=101
52cb892bc339f286bacbcfe9a8c8b4a6:stats_host_loadavg15m=0
52cb892bc339f286bacbcfe9a8c8b4a6:stats_host_loadavgentities=0
52cb892bc339f286bacbcfe9a8c8b4a6:stats_host_freemem=3
52cb892bc339f286bacbcfe9a8c8b4a6:stats_host_loadavg1m=0
52cb892bc339f286bacbcfe9a8c8b4a6:stats_host_totalswap=1999
52cb892bc339f286bacbcfe9a8c8b4a6:stats_host_totalmem=501
52cb892bc339f286bacbcfe9a8c8b4a6:appname=os
52cb892bc339f286bacbcfe9a8c8b4a6:checked=1286331427

rkt@torque:~/cc/cfmap/bin$ ./cfquery.pl -c view -p "host=anorien" | grep stats_host_totalmem
52cb892bc339f286bacbcfe9a8c8b4a6:stats_host_totalmem=501
rkt@torque:~/cc/cfmap/bin$ ./cfquery.pl -c view -p "host=anorien" | grep stats_host_totalmem | cut -d'=' -f2
501

Few other interesting features

  • Schema-less design – cfmap provides a simple schema-less datastore which could be used for other purposes as well. Please note that since it was designed to maintain “state” (instead of a simple datastore API), it has a few reserved keywords which have a special meaning.
  • Low overhead to add/delete cfmap nodes – Since its built over cassandra, adding new nodes is as simple as adding new cassandra servers.
  • Configurable - The recommended way of setting up cfmap for production use would be to host cfmap (which comes with a bundled version of cassandra) on 3 or more servers. Then put them all under a single DNS entry (round robin) and let DNS loadbalancing take care of the rest.
    • If you want an even more redundancy, setup something like haproxy on each of the nodes which could also monitor and redirect traffic to alternate cfmap nodes when failures (or GCs) happen.
    • The default setup doesn’t enforce consistency during reads or writes to facilitate smooth operation even during massive network or system failures. But if you wish, you could tweak the consistency, replication requirements based on your needs.

Cfmap is still a very early prototype, but we welcome others to play with it.

2 comments

    1. We looked at it before trying to build this. We observed that zookeeper was a fully consistent system, and wasn’t designed to be used over high latency unstable network links ( we have multiple data centers across the continent)

      We wanted some of the capabilities of zookeeper but were ok with an inconsistent system which self-heals over time. Cfmap wasn’t designed to replace zookeeper, and its possible there will be networks where both of these could be used in tandem.

      Think of cfmap as a eventually-consistent state messaging platform.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>