Splunk : Fastest way to get web operations dashboard running

Few weeks ago I asked a question on quora about log aggregation. I was surprised to find no opensource solution which came close to what I wanted, but I got a lot of suggession from different people to try out splunk. So I did.

What I wanted was an aggregation tool which collects, displays and alerts based on events logged by the various webservers across the network which could be in different datacenters. The organization where I set this up was generating about 300mb of production haproxy logs per day and something around 200mb of non-prod logs. Here is why splunk fit very well in this organization.

1) Log aggregation across multiple servers/datacenters- The organization had already solved this problem by piping haproxy logs using syslog-ng. They used a little bit of filtering to discard logs which are not interesting for splunk. Syslog-ng can be configured to use tcp instead of udp to make log delivery reliable. Splunk is capable of working as remote agents as well… but sending raw logs to it might increase the licensing costs.
2) Realtime dashboard – Splunk is a memory and cpu hog, but for smaller amount of logs, true realtime dashboard works beautifully. Even with multiple syslog-ng and splunk servers in between, I was able to see realtime graphical dashboards update within 5 to 10 seconds of the actual requests. Thats pretty impressive and may not too useful for high volume websites. Generating realtime dashboards which don’t update automatically is a more realistic use of splunks resource, and this again works pretty well as long as not too many people are trying to use it at the same time.
3) Querying/Filtering/Analyzing – Splunks querying language is very different from SQL but there are cheatsheets available to help you create queries. This querying language is very powerful and perhaps the toughest part of the learning curve. The results from these queries can be sent to dashboards or to alerting agents which can trigger emails/pages based on conditions.
4) Its important to note that splunk is not just for http logs. So it has to be trained to generate reports you would like. Unlike something like awstats you would have to write your own queries and dashboards (which are in XML). On the other hand if all you wanted was awstats like dashboard you could just use something like google analytics.
5) Free/Commercial versions – While the free version can do most of the stuff there are some key enterprise features for which I’ll recommend buying the commercial version. Authentication, LDAP integration, Alerting features, Federation, etc are some of the features which are missing in free edition.

I’m still not convinced that splunk is scalable.. the biggest issue with splunk is that the cost of maintaining splunk goes up with amount of logs. Hardware costs, and licensing costs at some point will cross the cost of developing/architecting/setting_up something like hadoop/flume/hive/opentsdb/etc