Splunk : Fastest way to get web operations dashboard running

This is a cross-post from my personal blog.

Few weeks ago I asked a question on quora about log aggregation. I was surprised to find that no opensource solution came close to what I wanted, but I got a lot of suggessions to try out splunk. So I did.

What I wanted was an aggregation tool which collects, displays and alerts based on events logged by the various webservers across the network which could be in different datacenters. The organization where I set this up was generating about 300mb of production haproxy logs per day and something around 200mb of non-prod logs. Here is why splunk fit very well in this organization.

1) Log aggregation across multiple servers/datacenters- The organization had already solved this problem by piping haproxy logs using syslog-ng. They used a little bit of filtering to discard logs which are not interesting for splunk. Syslog-ng can be configured to use tcp instead of udp to make log delivery reliable. Splunk is capable of working as remote agents as well… but sending raw logs to it might increase the licensing costs.
2) Realtime dashboard – Splunk is a memory and cpu hog, but for smaller amount of logs, true realtime dashboard works beautifully. Even with multiple syslog-ng and splunk servers involved in the log fow, I was able to see realtime graphical dashboards updated within 5 to 10 seconds of the actual requests. Thats pretty impressive and may not too useful for high volume websites. Generating realtime dashboards which don’t update automatically is a more realistic use of splunks resource, and this again works pretty well as long as too many people are not trying to use it at the same time.
3) Querying/Filtering/Analyzing – Splunk’s querying language is very different from SQL but there are cheatsheets available to help you create queries. This querying language is very powerful and is perhaps the toughest part of the learning curve. The results from these queries can be sent to web dashboards or to alerting agents which can trigger emails/pages based on pre-defined conditions.
4) Its important to note that splunk is not just for http logs. So it has to be trained to generate reports you would like. Unlike something like awstats you would have to write your own queries and dashboards (which are in XML). There is extensive documentation available, and the support guys were very helpful when I called. On the other hand if all you wanted was awstats like dashboard you could just used google analytics.
5) Free/Commercial versions – While the free version can do most of the stuff there are some key enterprise features for which I’ll recommend buying the commercial version. Authentication, LDAP integration, Alerting features, Federation, etc are some of the features which are missing in free edition. Oh, and phone support.

I’m still not convinced that splunk is scalable.. the biggest issue with splunk is that the cost of maintaining splunk goes up with amount of logs generated per day. Hardware costs, and licensing costs at some point will cross the cost of developing/architecting/setting_up something like hadoop/flume/hive/opentsdb/etc in your own network. But unless you are a big shop, it might be a good idea to postpone that discussion until u really need to do it.

S4: Distributed Stream Computing Platform

A few weeks ago I mentioned Yahoo! Labs was working on something called S4 for real-time data analysis. Yesterday they released an 8 page paper with detailed description of how and why they built this. Here is the abstract from the paper.

Its interesting to note that the authors compared S4 with MapReduce and explained that MapReduce was too optimized for batch process and wasn’t the best place to do real time computation. They also made an architectural decision of not building a system which can do both offline (batch) processing and real-time processing since they feared such a system would end up to be not good for either.

S4 is a general-purpose, distributed, scalable, partially fault-tolerant, pluggable imageplatform that allows programmers to easily develop applications for processing continuous unbounded streams of data. Keyed data events are routed with affinity to Processing Elements (PEs), which consume the events and do one or both of the following: (1) emit one or more events which may be consumed by other PEs, (2) publish results. The architecture resembles the Actors model [1], providing semantics of encapsulation and location transparency, thus allowing applications to be massively concurrent while exposing a simple programming interface to application developers. In this paper, we outline the S4 architecture in detail, describe various applications, including real-life deployments. Our de- sign is primarily driven by large scale applications for data mining and machine learning in a production environment. We show that the S4 design is surprisingly flexible and lends itself to run in large clusters built with commodity hardware.

Code: http://s4.io/

Authors: Neumeyer L, Robbins B, Nair A, Kesari A
Source: Yahoo! Labs