Scaling Graphite by using Cfmap as the data transport

Graphite is an extremely promising system and resource graphing tool which tries to take RRD to the next level. Here are some of the most interesting features of graphite which I liked.

  • Updates can happen to records in the past (RRD doesn’t allow this I think)image
  • Creation of new datasets is trivial with whisper/carbon ( its part of  the graphite framework )
  • Graphite allows federated storage (multiple servers across multiple datacenters for example)

Monitoring and graphing resources across data-centers is tricky however. Especially because WAN links cannot be trusted. While loosing some data may be ok for some folks, it may not be acceptable for others. Graphite tries to solve this problem by providing an option to federate data across multiple servers which could each be kept in separate datacenters.

Another way to solve this problem is by using a data transport which is resilient to network failures.

Since Cfmap (thanks to Cassandra) is a distributed, eventually consistent, state repository, it could easily be extended to act as an eventually consistent data queue for tools like graphite. Take a look at an example of a server record here (on the right).  With some minor modifications, we were able to log and publish all changes to system attributes using an API like this. And with the right script running on the graphite server, the import of these stats into carbon (a component of graphite), became a trivial task.

image

With Cfmap’s easy REST interface, adding new stats to graphite becomes as simple as registering the stats to cfmap from anywhere in the network. [ sample script ]

Graphite is not in use in our corporate network today, but I’m extremely excited at the possibilities and would be actively looking at how else we could use it.

[ Take a look at RabbitMQ integration with Graphite for another interesting way of working with graphite ]

Cfmap: Publishing, discovering and dashboarding infrastructure state

image

Dynamic infrastructure can be a challenging if apps and scripts can’t keep up with them. At Ingenuity we observed this problem when we started moving towards virtualization and SOA (service oriented architecture). Remembering server names became impractical, and error-free manual configuration changes became impossible.

imageWhile there are some tools which solve parts of this specific problem, we couldn’t find any opensource tool which could be used to both publish and discover state of a system in a distributed, scalable and fault-tolerant way. Zookeeper which comes pretty close to what we needed was a fully consistent system which was not designed to be used across multiple data centers over high latency, unstable network connections. We wanted a system which could not only be up during network outages, but also sync up the state from different data-centers when they are connected.

We built a few different tools to solve our scalability problems, one of which is a tool called Cfmap which we are opensourcing today to help others facing the same problem.

So what is cfmap ?

Built over cassandra, cfmap is designed to be a scalable, eventually consistent and a fault tolerant repository of state information. It provides a set of REST APIs and UIs to both publish and discover state of an entity or a group of entities with great ease. The APIs are so simple that you would most probably be writing your own custom agents for the various servers and processes than use the agent which comes bundled with the tool.

We have been using cfmap internally for a few months and the results are promising. Here is an example of how cfmap’s dashboard looks like on our network  (I’ve changed some names to protect the actual resource names).  Here is another dashboard which is running out in the public which you can use today as a demo.

image

Cfmap provides the ability to quickly drill down to a filtered set of servers or apps, and the ability to export them quickly into a json or a shell greppable format. The two export formats available today makes dashboarding and scripting a trivial task.

The image above shows a small set of applications from our dev cluster which is sorted in the order of the time when the apps were deployed. In addition to showing the host names, status of the apps, and version information, it also lists the time when the app sent the last heartbeat. What is not visible here is that it also keeps track of certain changes in a “log” which could be used to understand historical changes of a particular record over time.

While REST interface is easy to use, you could choose to use the commandline tool “cfquery”, which comes with Cfmap to interact with cfmap. Cfquery could be used to both publish and search results… lets look at some example.

Here is an example of how one could extract a list of all the hosts in cfmap.

rkt@torque:~/cc/cfmap/bin$ ./cfquery.pl -c view | grep ":host=" | cut -d':' -f2host=team50host=ip-10-205-15-124host=torquehost=anorien

Here is a more elaborate example which shows up cfmap output could be used as parts of other scripts. In this case, the query just specifies a host “anorien” in the query. The result is a dump of all the properties set by the host. A few extra commands can quickly help you extract specific properties which can then be used as a data-source for other tools (like monitoring).

rkt@torque:~/cc/cfmap/bin$ ./cfquery.pl -c view -p "host=anorien"

52cb892bc339f286bacbcfe9a8c8b4a6:port=0
52cb892bc339f286bacbcfe9a8c8b4a6:stats_host_freeswap=1999
52cb892bc339f286bacbcfe9a8c8b4a6:host=anorien
52cb892bc339f286bacbcfe9a8c8b4a6:stats_host_loadavg5m=0
52cb892bc339f286bacbcfe9a8c8b4a6:cfqversion=1.1
52cb892bc339f286bacbcfe9a8c8b4a6:stats_host_estconn=4
52cb892bc339f286bacbcfe9a8c8b4a6:type=host
52cb892bc339f286bacbcfe9a8c8b4a6:deployed_date=1286217400
52cb892bc339f286bacbcfe9a8c8b4a6:version=2.6.32-00007-g56678ec
52cb892bc339f286bacbcfe9a8c8b4a6:ip=127.0.0.1
52cb892bc339f286bacbcfe9a8c8b4a6:stats_host_pscount=101
52cb892bc339f286bacbcfe9a8c8b4a6:stats_host_loadavg15m=0
52cb892bc339f286bacbcfe9a8c8b4a6:stats_host_loadavgentities=0
52cb892bc339f286bacbcfe9a8c8b4a6:stats_host_freemem=3
52cb892bc339f286bacbcfe9a8c8b4a6:stats_host_loadavg1m=0
52cb892bc339f286bacbcfe9a8c8b4a6:stats_host_totalswap=1999
52cb892bc339f286bacbcfe9a8c8b4a6:stats_host_totalmem=501
52cb892bc339f286bacbcfe9a8c8b4a6:appname=os
52cb892bc339f286bacbcfe9a8c8b4a6:checked=1286331427

rkt@torque:~/cc/cfmap/bin$ ./cfquery.pl -c view -p "host=anorien" | grep stats_host_totalmem
52cb892bc339f286bacbcfe9a8c8b4a6:stats_host_totalmem=501
rkt@torque:~/cc/cfmap/bin$ ./cfquery.pl -c view -p "host=anorien" | grep stats_host_totalmem | cut -d'=' -f2
501

Few other interesting features

  • Schema-less design – cfmap provides a simple schema-less datastore which could be used for other purposes as well. Please note that since it was designed to maintain “state” (instead of a simple datastore API), it has a few reserved keywords which have a special meaning.
  • Low overhead to add/delete cfmap nodes – Since its built over cassandra, adding new nodes is as simple as adding new cassandra servers.
  • Configurable – The recommended way of setting up cfmap for production use would be to host cfmap (which comes with a bundled version of cassandra) on 3 or more servers. Then put them all under a single DNS entry (round robin) and let DNS loadbalancing take care of the rest.
    • If you want an even more redundancy, setup something like haproxy on each of the nodes which could also monitor and redirect traffic to alternate cfmap nodes when failures (or GCs) happen.
    • The default setup doesn’t enforce consistency during reads or writes to facilitate smooth operation even during massive network or system failures. But if you wish, you could tweak the consistency, replication requirements based on your needs.

Cfmap is still a very early prototype, but we welcome others to play with it.