Splunk : Fastest way to get web operations dashboard running

This is a cross-post from my personal blog.

Few weeks ago I asked a question on quora about log aggregation. I was surprised to find that no opensource solution came close to what I wanted, but I got a lot of suggessions to try out splunk. So I did.

What I wanted was an aggregation tool which collects, displays and alerts based on events logged by the various webservers across the network which could be in different datacenters. The organization where I set this up was generating about 300mb of production haproxy logs per day and something around 200mb of non-prod logs. Here is why splunk fit very well in this organization.

1) Log aggregation across multiple servers/datacenters- The organization had already solved this problem by piping haproxy logs using syslog-ng. They used a little bit of filtering to discard logs which are not interesting for splunk. Syslog-ng can be configured to use tcp instead of udp to make log delivery reliable. Splunk is capable of working as remote agents as well… but sending raw logs to it might increase the licensing costs.
2) Realtime dashboard – Splunk is a memory and cpu hog, but for smaller amount of logs, true realtime dashboard works beautifully. Even with multiple syslog-ng and splunk servers involved in the log fow, I was able to see realtime graphical dashboards updated within 5 to 10 seconds of the actual requests. Thats pretty impressive and may not too useful for high volume websites. Generating realtime dashboards which don’t update automatically is a more realistic use of splunks resource, and this again works pretty well as long as too many people are not trying to use it at the same time.
3) Querying/Filtering/Analyzing – Splunk’s querying language is very different from SQL but there are cheatsheets available to help you create queries. This querying language is very powerful and is perhaps the toughest part of the learning curve. The results from these queries can be sent to web dashboards or to alerting agents which can trigger emails/pages based on pre-defined conditions.
4) Its important to note that splunk is not just for http logs. So it has to be trained to generate reports you would like. Unlike something like awstats you would have to write your own queries and dashboards (which are in XML). There is extensive documentation available, and the support guys were very helpful when I called. On the other hand if all you wanted was awstats like dashboard you could just used google analytics.
5) Free/Commercial versions – While the free version can do most of the stuff there are some key enterprise features for which I’ll recommend buying the commercial version. Authentication, LDAP integration, Alerting features, Federation, etc are some of the features which are missing in free edition. Oh, and phone support.

I’m still not convinced that splunk is scalable.. the biggest issue with splunk is that the cost of maintaining splunk goes up with amount of logs generated per day. Hardware costs, and licensing costs at some point will cross the cost of developing/architecting/setting_up something like hadoop/flume/hive/opentsdb/etc in your own network. But unless you are a big shop, it might be a good idea to postpone that discussion until u really need to do it.

Scalable logging using Syslog

Syslog is a commonly used transport mechanism for system logs. But people sometimes forget it could be used for a lot of other purposes as well.

Take, for example, the interesting challenge of aggregating web server logs from 100 different servers into one server and syslogthen figuring out how to merge them. If you have built your own tool to do this, you would have figured out  by now how expensive it is to poll all the servers and how out-of-date these logs could get by the time you process it. If you are not inserting them into some kind of datastore which sorts the rows by timestamp, you now also have to take up the challenge of building merge-sort script.

There is nothing which stops applications from using syslog as well. If your apps are in Java, you should try out Syslog appender for log4j [Ref 1] [Ref 2]. Not only do you get central logging, you also get get to see real-time “tail -f” of events as they happen in a merged file. If there are issues anywhere in your network, you have just one place to look at. If your logging volume is high, you would have to use other tools (or build your own) to do log analysis.

Here are some things you might have to think about if you plan to use syslog for your environment.

  1. Setup different syslog servers for each of your datacenters using split DNS or by use different hostnames.
  2. Try not to send logs across WAN links
  3. Rotate logs on a nightly basis, or depending on the log volume
  4. Reduce amount of logging (don’t do “debug” in production for example)
  5. Write tools to detect change in logging volume in dev/qa environment. If you follow good logging practice, you should be able to identify components which are responsible for the increase very quickly.
  6. Identify log patterns which could be causes of concerns and setup some kind of alerting using your regular monitoring service (nagios for example). Don’t be afraid to use 3rd party tools which do this very well.
  7. Syslog over UDP is non-blocking, but the syslog server can overloaded if logging volume is not controlled. The most expensive part of logging is disk i/o. If you notice high i/o
  8. UDP doesn’t guarantee that every log event will make it to the syslog server. Find out if that level of uncertainty in logging is ok for your environment.

Other interesting observations

  1. The amount of changes required in a java app which is already using log4j to log to a syslog server is trivial
  2. Logging to local files can be disabled, which means you don’t have to worry about disk storage on each server..
  3. If you are using or want to use tools like splunk or hadoop/hbase for log analysis, syslog is probably the easiest way to get there.
  4. You can always loadbalance syslog servers by using DNS loadbalancing.
  5. Apache webservers can’t do syslog out of the box, but you can still make it happen
  6. I personally like haproxy more and it does do syslog out of the box.
  7. If you want to log events from startup/shutdown scripts, you can use the “logger” *nix command to send events to the syslog server.

How is log aggregated in your environment ?

References