It seems everywhere I go to work, I face the same operational problems. Once again, I must find a way to centralize logs and provide different levels of access to said logs. Sadly, the syslog protocol is getting quite aged, and it’s just not enough anymore. It works well if you have only a few machines, and only need to provide access to sysadmins. But when developers and other types of users are thrown into the mix, you need a more granular system.
Also, support for the syslog protocol varies greatly from daemons to daemons. One major culprit for me as always been Apache (and web servers in general) because it only supports out of the box syslog for error logs. For access logs, you can use different techniques, but no matter which one you use, you end end with the same problem: if you have more than one vhost on the machine, all their logs end up in the same syslog facility. You can obviously filter them after that, but that’s way more work than say email logs.
If you work for a company that has a lot of budget, you may consider getting Splunk. It’s a very good commercial product with a free version also. But last I checked, it was priced at 7000$US per half gig of logs indexed per day. When you have web servers generating several gigs of logs each per day, that could end up being very expensive. Money that could be used to buy hardware to deploy more open source software. Which I tend to prefer.
So after a week or so of investigation, testing and benchmarking, here are my findings. The architecture of the final setup is not settled yet, but it will more or less look like this.
NOTE: Keep in mind, this post won’t go into details about clustering and scaling. But all chosen products can achieve that. You should be able to come up with that on your own easily. I will concentrate on the tools and the workflow.
1. Systems
In order to avoid changing the syslog daemon on each servers, I decide to keep either sysklogd or rsyslog intact. As it’s installed by default. And with a centralized configuration management system, it’s simple to add a new entry to send your logs to another machine. Something like this:
*.* @somehost.domain.tld
That will take care of most of the systems’ logs and send them to a central machine. But what about Apache? As far as error logs are concerned, it’s simple. You need to reconfigure Apache to send them to syslog. On RedHat-based systems, you want to edit /etc/httpd/conf/httpd.conf and on Debian-based systems (I might be wrong, I don’t have one handy with Apache installed) you’d have to modify /etc/apache2/apache2.conf or something like that.
ErrorLog syslog:local2
I use local2 as an example, but you could pick any facility. In any case, I recommend using one of the local* facilities, as later on, it will allow you to create filters and alerts based on that.
Now as I was mentioning earlier, the issue with web servers is that you can have more than one vhost on a machine. And syslog was never designed with that in mind. So you will inevitably end up with all logs for a machine in the same facility. Apache supports pipping logs to an external program. I found that the simplest was to use the logger tool. So either system-wide or per vhost, you can add something similar to your configuration:
CustomLog "|/bin/logger -p local2.info" combined
During benchmarks, that obviously added some overhead. You will have to plan your site’s capacity with that in mind. But by not much. Maybe 5% more. I think it’s well worth it for the operational advantages.
As far as Java applications are concerned, you can easily configure it to send to syslog using log4j.
2. Tailing log files (warning)
In the past, I sometimes would use tools that would tail log files and then forward them to a central syslog machine. That’s fine if you just need to archive on a file system somewhere. But to use there’s one thing that’s important to know, what you get in a log file, is just a string. That’s not a standard and properly formatted syslog message. Rsyslog is able to log real syslog messages to your files, but I don’t recommend it as it’s harder to read. That said, if you want the webUI at the end of the proposed chain to work properly, you don’t want to do that. It’s better to use something like logger if possible.
3. Logstash
Now we leave the legacy world and enter the present time of logging. Logstash is the Swiss Army knife of the logging world. It’s a very well designed application that can be used in either agent or server mode. In agent mode, you can configure different types of inputs and outputs. It supports a wide range of protocols like file, syslog, amqp, etc.
So here, I decided to use it with a syslog input to receive logs from our machines. It’s also easily load-balancable with a layer 3 load-balancer. It will then send the logs to RabbitMQ to an exchange.
4. RabbitMQ
Now, this part is optional, you could send the logs straight from Logstash to Graylog2, but I prefer to have a middleman to do a bit of queuing. Also, once the messages enter the exchange in a AMQP server, you can route them to more than one queue. For different types of processing. In order to do that though, you need to use a fanout exchange.
Why RabbitMQ? Well, it’s written in Erlang and it’s very fast. During my benchmarks, it was processing between 4000 and 5000 messages per second during the peaks. Also, it’s easily clusterable in an elastic kind of way. And all operations can be done while the cluster is live. I also recommend you install the management plugins, as that will provide a very nice webUI to manage your stack. Often, UIs of the sort are limited, but in this case, everything that can be done on the CLI is doable with the webUI as well. And it’s very well designed and pleasant to the eye.
So at this point, my messages are entering an exchange named ‘syslog’ that routes messages to two different queues: graylog and elasticsearch.
NOTE: As I write this, version 2.5.1 as just been released. At this point in time, queues can not be replicated in your cluster. So if you lose the node where the queue was created, you lose the queue. That said, you can query your queue from any node in the cluster. Support for replicated queues should be available soon. You could use DRDB though to cluster only one node. That would give you high-availability at the queue level.
5. Logstash (again)
Now, we’re almost ready to give access to our logs to our different users. At this steps, logs are ready to be sent to Graylog2. So we will use another Logstash instance with an AMQP input, that will read messages from our ‘graylog’ queue and forward them to Graylog2 using a Gelf output. That’s the preferred protocol for importing messages into Graylog2. I won’t provide an example configuration for Logstash, as it’s really easy and straightforward to configure.
6. Graylog2
This is where all the magic will happen. Graylog2 has two components: a daemon that receives logs, processes them, and inserts them in a capped collection in MongoDB. Now, take a few minutes to go read about that. As it’s important to understand well. Basically, it works like a FIFO. So in order to take advantage of the speed inherent to a FIFO, you want to make sure that your capped collection fits in RAM. MongoDB will allow you to create a capped collection larger than that, but you get major performance degradation when your data reach the amount of available RAM you have.
So with that taken into consideration, on my test machine, I created a roughly 5GB capped collection. With that, I was able to store more than 10 millions messages in it. It’s important to know, that Graylog2 is not meant to be used for archiving. Where it excels is in real-time (or close to it) view of your logs. Also, you can setup different alarms based on facilities, hosts and regex. That will then email you to alert you. Very cool. It allows you to be more proactive. And to detect issues a traditional monitoring system can’t find.
7. Elasticsearch
Remember I mentioned two different queues? The reason is simple. Once a message is consumed in RabbitMQ, it’s not available anymore. It’s deleted. So you need more than one queue, if you want to use different systems. As Graylog2 is great for short-term analysis and real-time debugging, you can’t count on it for archiving. Enters Elasticsearch: a clusterable full-text indexer/search engine. It’s based on the Lucene project from the Apache Foundation. Their main goal is to have a very simple to use and configure, elastic search engine. And from my short tests with it, it lives up to it. It discovers new nodes using multicast. So basically, you power up a new node, and the cluster detects it, recalibrate itself and voila.
That’s where I plan to store my logs for long-time archiving. Logstash (is there anything it can’t do?), when run in server mode, provides a web interface to search them. You would use again an AMQP input and a Elasticsearch output to send them to Elasticsearch. Then run another instance of Logstash in web mode. To provide the webUI.
So that’s it. That’s a home-made Splunk-like system. Obviously, it’s more work to deploy, but it’s much cheaper, more flexible and open source. It will grow as needed by your infrastructure. You can use it to aggregate logs from servers, applications and networking equipment easily. And provided granular access to your logs through graylog.

From graylog.conf:
# AMQP
amqp_enabled = false
amqp_subscribed_queues = somequeue1:gelf,somequeue2:gelf,somequeue3:syslog
amqp_host = localhost
amqp_port = 5672
amqp_username = guest
amqp_password = guest
amqp_virtualhost = /
So, you can skip step 5 and ask Graylog2 to listen straight at RabbitMQ.
You can’t actually. Not with the setup I described anyway. The reason is simple, the logs are sent in syslog format. If you connect directly to RabbitMQ from Graylog2, you will get either of the following error. Depending on if you selected syslog of gelf as the queue type.
Error 1:
Could not handle GELF client: Missing GELF message parameters. version, short_message and host are required.
Error 2:
Could not handle GELF client: null
You need step 5 to act as a middleman to convert messages from syslog to gelf.
Maybe you can consider a nice alternative: Enterprise Log Search and Archive (ELSA).
I tried it a bit and it looks real cool. I will just need to dig deeper.
Interesting. I’ll have to look into that one.
Is graylog that much faster at making logs available/searchable than elasticsearch? IOW, couldn’t ES be both the archive and the near real itme store?
Thx
Mark
Elastic Search doesn’t have it’s own web interface as far as I can tell. Haven’t had much time yet to play with it, compared to the other tools. You need Logstash to access the logs stored in it. It’s like a search engine, if you don’t query it, you won’t get results. As opposed to Graylog2, where when you log in, you’re presented with your last logs. If you leave your browser open, you’ll be notified when you have new logs to view. Also, it supports the creation of various users that can have access to different logs based on streams. Streams are a subset of your incoming logs matching specific rules. For example, let’s say that you’re sending all your Apache logs to local2. You might want to create a stream that would only analyze that type of log and look for a regex matching ‘MaxClients’. So if it finds that, it would trigger an alarm and email you.
That’s something that’s not possible if you use only ElasticSearch.
Nice post, thanks – have been playing with these 2 myself and come to similar conclusions – logstash seems a lot more versatile, but I had some issues consuming syslog with it (centos syslog has never been anywhere near an RFC as far as I can make out).
On the sending to 2 queues issue: I think logstash can consume from an AMQP topic exchange, and I just (like 2 minutes ago) wrote some support to have graylog2 do the same thing ( https://github.com/rasputnik/graylog2-server/commits/develop ). So you should be able to just send your messages once (to the exchange) and have the 2 of them pick up the ones they care about.
Actually, it’s not really Centos but Sysklogd that’s the culprit here. In Centos 6.0, the default logger is rsyslog. It allows you to log the real syslog message to files. It looks very different than the normal output we’re used to. It’s understandable why they don’t normally log the real message though, it’s much harder to read, and unless you want to centralize your logs and use something like Graylog2, it’s totally useless.
I wrote my own syslog daemon that listen udp 514 and /dev/log and send everything to the graylog server and keep a copy in /var/log/somewhere
This simply replaced rsyslog and syslogd
and I keep klogd only to redirect kernel logs to syslog
it’s based of gevent & in python, so it’s very fast. I kicked out logstash as agent in non-graylog2 servers.
because it’s based on python logging, it’s easy for me to add new output (such as AMQP) without hacking the code, just by changing the configuration file
You should publish the code on GitHub :)