Reliable forwarding with rsyslog
Once you have more than 1 server (and maybe even with just 1) you need a central location to store your logs. Often this will be using a tool specifically designed for storing logs so you can easily search them and make them available to developers without needing access to systems. Whether this is via a massive central syslog server or a remote service like Papertrail, Loggly or Splunk, logging is important for many reasons. All our servers log to local files via rsyslog but we also have a remote action defined to send our logs to Papertrail using certificate authenticated TCP/TLS encryption.
Last week we encountered an unusual issue where rsyslog stopped logging at the same time on every single one of our servers within one of our data centres. This is over 70 nodes on Ubuntu 10.04 LTS and 12.04 LTS (therefore running 2 different rsyslog versions) and having varying use cases from extremely quiet MongoDB arbiters to verbose website monitoring control servers which schedule thousands of checks to our customer URLs across nodes in many geographic regions.
rsyslog was still a running process, it just stopped logging both remotely and to local files. We found a few discussions of this problem from 2012, 2011 and 2009 but they didn’t entirely cover the problem. However, the common thread was a connectivity issue causing problems with the queuing. Although not definitive, that every server in just one of our data centres saw this problem gave weight to a network based issue which may have caused rsyslog to hang for all actions, even though it was a network issue and we still had disk based logging enabled.
After discussing the issue with the Papertrail support guys, in order to combat this we decided to enable reliable f0rwarding which means rsyslog will queue log lines in memory and then to disk if the remote server cannot be reached, posting them when connectivity returns. This is necessary because syslog over TCP is not entirely reliable and is achieved by adding the following lines to your
rsyslog.conf above the action line:
$ActionResumeRetryCount -1 # retry forever to log to papertrail $ActionQueueType LinkedList # use asynchronous processing $ActionQueueFileName /var/log/emergency_syslog # set file name, also enables disk mode $ActionQueueMaxFileSize 500M
However it is important to limit the disk space used:
Disk assisted queues are special in that they do not have any size limit. The enqueue an unlimited amount of elements. To prevent running out of space, disk and disk-assisted queues can be size-limited via the “$QueueMaxDiskSpace” configuration parameter. If it is not set, the limit is only available free space (and reaching this limit is currently not very gracefully handled, so avoid running into it!). If a limit is set, the queue can not grow larger than it.
Which is what the final config line does in the example above, and should be adjusted based on how much disk space you have. This should give us more resiliance against network issues, prevent loss of log lines and also stop the disk from suddenly quickly filling on all servers during a network problem.
Enjoy this post? You may also like Making a point with SLAs