This is a cross-post from my personal blog.
Few weeks ago I asked a question on quora about log aggregation. I was surprised to find that no opensource solution came close to what I wanted, but I got a lot of suggessions to try out splunk. So I did.
What I wanted was an aggregation tool which collects, displays and alerts based on events logged by the various webservers across the network which could be in different datacenters. The organization where I set this up was generating about 300mb of production haproxy logs per day and something around 200mb of non-prod logs. Here is why splunk fit very well in this organization.
1) Log aggregation across multiple servers/datacenters- The organization had already solved this problem by piping haproxy logs using syslog-ng. They used a little bit of filtering to discard logs which are not interesting for splunk. Syslog-ng can be configured to use tcp instead of udp to make log delivery reliable. Splunk is capable of working as remote agents as well… but sending raw logs to it might increase the licensing costs.
2) Realtime dashboard – Splunk is a memory and cpu hog, but for smaller amount of logs, true realtime dashboard works beautifully. Even with multiple syslog-ng and splunk servers involved in the log fow, I was able to see realtime graphical dashboards updated within 5 to 10 seconds of the actual requests. Thats pretty impressive and may not too useful for high volume websites. Generating realtime dashboards which don’t update automatically is a more realistic use of splunks resource, and this again works pretty well as long as too many people are not trying to use it at the same time.
3) Querying/Filtering/Analyzing – Splunk’s querying language is very different from SQL but there are cheatsheets available to help you create queries. This querying language is very powerful and is perhaps the toughest part of the learning curve. The results from these queries can be sent to web dashboards or to alerting agents which can trigger emails/pages based on pre-defined conditions.
4) Its important to note that splunk is not just for http logs. So it has to be trained to generate reports you would like. Unlike something like awstats you would have to write your own queries and dashboards (which are in XML). There is extensive documentation available, and the support guys were very helpful when I called. On the other hand if all you wanted was awstats like dashboard you could just used google analytics.
5) Free/Commercial versions – While the free version can do most of the stuff there are some key enterprise features for which I’ll recommend buying the commercial version. Authentication, LDAP integration, Alerting features, Federation, etc are some of the features which are missing in free edition. Oh, and phone support.
I’m still not convinced that splunk is scalable.. the biggest issue with splunk is that the cost of maintaining splunk goes up with amount of logs generated per day. Hardware costs, and licensing costs at some point will cross the cost of developing/architecting/setting_up something like hadoop/flume/hive/opentsdb/etc in your own network. But unless you are a big shop, it might be a good idea to postpone that discussion until u really need to do it.