Scalable logging using Syslog
Syslog is a commonly used transport mechanism for system logs. But people sometimes forget it could be used for a lot of other purposes as well.
Take, for example, the interesting challenge of aggregating web server logs from 100 different servers into one server and then figuring out how to merge them. If you have built your own tool to do this, you would have figured out by now how expensive it is to poll all the servers and how out-of-date these logs could get by the time you process it. If you are not inserting them into some kind of datastore which sorts the rows by timestamp, you now also have to take up the challenge of building merge-sort script.
There is nothing which stops applications from using syslog as well. If your apps are in Java, you should try out Syslog appender for log4j [Ref 1] [Ref 2]. Not only do you get central logging, you also get get to see real-time “tail -f†of events as they happen in a merged file. If there are issues anywhere in your network, you have just one place to look at. If your logging volume is high, you would have to use other tools (or build your own) to do log analysis.
Here are some things you might have to think about if you plan to use syslog for your environment.
- Setup different syslog servers for each of your datacenters using split DNS or by use different hostnames.
- Try not to send logs across WAN links
- Rotate logs on a nightly basis, or depending on the log volume
- Reduce amount of logging (don’t do “debug†in production for example)
- Write tools to detect change in logging volume in dev/qa environment. If you follow good logging practice, you should be able to identify components which are responsible for the increase very quickly.
- Identify log patterns which could be causes of concerns and setup some kind of alerting using your regular monitoring service (nagios for example). Don’t be afraid to use 3rd party tools which do this very well.
- Syslog over UDP is non-blocking, but the syslog server can overloaded if logging volume is not controlled. The most expensive part of logging is disk i/o. If you notice high i/o
- UDP doesn’t guarantee that every log event will make it to the syslog server. Find out if that level of uncertainty in logging is ok for your environment.
Other interesting observations
- The amount of changes required in a java app which is already using log4j to log to a syslog server is trivial
- Logging to local files can be disabled, which means you don’t have to worry about disk storage on each server..
- If you are using or want to use tools like splunk or hadoop/hbase for log analysis, syslog is probably the easiest way to get there.
- You can always loadbalance syslog servers by using DNS loadbalancing.
- Apache webservers can’t do syslog out of the box, but you can still make it happen
- I personally like haproxy more and it does do syslog out of the box.
- If you want to log events from startup/shutdown scripts, you can use the “logger†*nix command to send events to the syslog server.
How is log aggregated in your environment ?
References
Comments
Rsyslog doesn't use UDP so there is no problem of loosing logs. Even more, rsyslog buffers on disk when an upstream rsyslogd is down. All in all, rsyslog is the better syslog.