1 files changed, 281 insertions, 0 deletions
diff --git a/source/whitepapers/syslog_parsing.rst b/source/whitepapers/syslog_parsing.rst
new file mode 100644
index 0000000..3e3f2de
--- /dev/null
+++ b/source/whitepapers/syslog_parsing.rst
@@ -0,0 +1,281 @@
+syslog parsing in rsyslog
+=========================
+
+*Written by* `Rainer Gerhards <https://rainer.gerhards.net/>`_,
+`Großrinderfeld <https://www.rainer-gerhards.de/grossrinderfeld/>`_
+*(2008-09-23)*
+
+**We regularly receive messages asking why**
+`rsyslog <http://www.rsyslog.com>`_ **parses this or that message
+incorrectly.** Of course, it turns out that rsyslog does the right
+thing, but the message sender does not. And also of course, this is not
+even of the slightest help to the end user experiencing the problem ;).
+So I thought I write this paper. It describes the problem source and
+shows potential solutions (aha!).
+
+Syslog Standardization
+----------------------
+
+The syslog protocol has not been standardized until relatively
+recently. The first document "smelling" a bit like a standard is
+:rfc:`3164`, which dates back to August
+2001. The problem is that this document is no real standard. It has
+assigned "informational" status by the `IETF <http://www.ietf.org>`_
+which means it provides some hopefully useful information but does not
+demand anything. It is impossible to "comply" to an informational
+document. This, of course, doesn't stop marketing guys from telling they
+comply to RFC3164 and it also does not stop some techs to tell you "this
+and that does not comply to RFC3164, so it is <anybody else but them>'s
+fault".
+
+Then, there is :rfc:`3195`, which
+is a real standard. In it's section 3 it makes (a somewhat questionable)
+reference to (informational) RFC 3164 which may be interpreted in a way
+that RFC3195 standardizes the format layed out in RFC 3164 by virtue of
+referencing them. So RFC3195 seems to extend its standardization domain
+to the concepts layed out in RFC 3164 (which is why I tend to find that
+reference questionable). In that sense, RFC3195 standardizes the format
+informationally described in RFC3164, Section 4. But it demands it only
+for the scope of RFC3195, which is syslog over BEEP - and NOT syslog
+over UDP. So one may argue whether or not the RFC3164 format could be
+considered a standard for any non-BEEP (including UDP) syslog, too. In
+the strict view I tend to have, it does not. Refering to the RFC3195
+context usually does not help, because there are virtually no RFC3195
+implementations available (at this time, I would consider this RFC a
+failure).
+
+Now let's for a short moment assume that RFC3195 would somehow be able
+to demand RFC3164 format for non-BEEP syslog. So we could use RFC3164
+format as a standard. But does that really help? Let's cite RFC 3164,
+right at the beginning of section 4 (actually, this is the first
+sentence):
+
+    ::
+
+           The payload of any IP packet that has a UDP destination port of 514
+           MUST be treated as a syslog message. 
+
+
+        Think a bit about it: this means that whatever is send to port 514 must be considered
+        a valid syslog message. No format at all is demanded. So if "this is junk" is sent to
+        UDP port 514 - voila, we have a valid message (interestingly, it is no longer a syslog
+        message if it is sent to port 515 ;)). You may now argue that I am overdoing. So let's
+        cite RFC 3164, Section 5.4, Example 2:
+
+
+          Example 2
+
+                Use the BFG!
+
+           While this is a valid message, it has extraordinarily little useful
+           information.
+
+As you can see, RFC3164 explicitly states that no format at all is
+required.
+
+Now a side-note is due: all of this does not mean that the RFC3164
+authors did not know what they were doing. No, right the contrary is
+true: RFC3164 mission is to describe what has been seen in practice as
+syslog messages and the conclusion is quite right that there is no
+common understanding on the message format. This is also the reason why
+RFC3164 is an informational document: it provides useful information,
+but does not precisely specify anything.
+
+After all of this bashing, I now have to admit that RFC3164 has some
+format recommendations layed out in section 4. The format described has
+quite some value in it and implementors recently try to follow it. This
+format is usually meant when someone tells you that a software is
+"RFC3164 compliant" or expects "RFC3164 compliant messages". I also have
+to admit that rsyslog also uses this format and, in the sense outlined
+here, expects messages received to be "RFC3164 compliant" (knowingly
+that such a beast does not exist - I am simply lying here ;)).
+
+Please note that there is some relief of the situation in reach. There
+is a new normative syslog RFC series upcoming, and it specifies a
+standard message format. At the time of this writing, the main documents
+are sitting in the RFC editor queue waiting for a transport mapping to
+be completed. I personally expect them to be assigned RFC numbers in
+2009.
+
+Update: the numbers are now assigned and the base RFC is :rfc:`5424`.
+
+Practical Format Requirements
+-----------------------------
+
+From a practical point of view, the message format expected (and
+generated by default in legacy mode) is:
+
+::
+
+    <PRI>TIMESTAMP SP HOST SP TAG MSG(Freetext)
+
+SP is the ASCII "space" character and the definition of the rest of the
+fields can be taken from RFC3164. Please note that there also is a lot
+of confusion on what syntax and semantics the TAG actually has. This
+format is called "legacy syslog" because it is not well specified (as
+you know by now) and has been "inherited from the real world".
+
+Rsyslog offers two parsers: one for the upcoming RFC series and one for
+legacy format. We concentrate on the later. That parser applies some
+logic to detect missing hostnames, is able to handle various ways the
+TIMESTAMP is typically malformed. In short it applies a lot of guesswork
+in trying to figure out what a message really means. I am sure the
+guessing algorithm can be improved, and I am always trying that when I
+see new malformed messages (and there is an ample set of them...).
+However, this finds its limits where it is not possible to differentiate
+between two entities which could be either. For example, look at this
+message:
+
+::
+
+    <144>Tue Sep 23 11:40:01 taghost sample message
+
+Does it contain a hostname? Maybe. The value "taghost" is a valid
+hostname. Of course, it is also a valid tag. If it is a hostname, the
+tag's value is "sample" and the msg value is "message". Or is the
+hostname missing, the tag is "taghost" and msg is "sample message"? As a
+human, I tend to say the later interpretation is correct. But that's
+hard to tell the message parser (and, no, I do not intend to apply
+artificial intelligence just to guess what the hostname value is...).
+
+One approach is to configure the parser so that it never expects
+hostnames. This becomes problematic if you receive messages from
+multiple devices. Over time, I may implement parser conditionals, but
+this is not yet available and I am not really sure if it is needed
+complexity...
+
+Things like this, happen. Even more scary formats happen in practice.
+Even from mainstream vendors. For example, I was just asked about this
+message (which, btw, finally made me write this article here):
+
+::
+
+    "<130> [ERROR] iapp_socket_task.c 399: iappSocketTask: iappRecvPkt returned error"
+
+If you compare it with the format RFC3164 "suggests", you'll quickly
+notice that the message is "a bit" malformed. Actually, even my human
+intelligence is not sufficient to guess if there is a TAG or not (is
+"[ERROR]" a tag or part of the message). I may not be the smartest guy,
+but don't expect me to program a parser that is smarter than me.
+
+To the best of my knowledge, these vendor's device's syslog format can
+be configured, so it would probably be a good idea to include a
+(sufficiently well-formed) timestamp, the sending hostname and (maybe?)
+a tag to make this message well parseable. I will also once again take
+this sample and see if we can apply some guesswork. For example, "[" can
+not be part of a well-formed TIMESTAMP, so logic can conclude there is
+not TIMESTAMP. Also, "[" can not be used inside a valid hostname, so
+logic can conclude that the message contains no hostname. Even if I
+implement this logic (which I will probably do), this is a partial
+solution: it is impossible to guess if there is a tag or not
+(honestly!). And, even worse, it is a solution only for those set of
+messages that can be handled by the logic described. Now consider this
+hypothetical message:
+
+::
+
+    "<130> [ERROR] host.example.net 2008-09-23 11-40-22 PST iapp_socket_task.c 399: iappSocketTask: iappRecvPkt returned error"
+
+Obviously, it requires additional guesswork. If we iterate over all the
+cases, we can very quickly see that it is impossible to guess everything
+correct. In the example above we can not even surely tell if PST should
+be a timezone or some other message property.
+
+A potential solution is to generate a parser-table based parser, but
+this requires considerable effort and also has quite some runtime
+overhead. I try to avoid this for now (but I may do it, especially if
+someone sponsors this work ;)). Side-note: if you want to be a bit
+scared about potential formats, you may want to have a look at my paper
+"`On the Nature of Syslog
+Data <http://www.monitorware.com/en/workinprogress/nature-of-syslog-data.php>`_\ ".
+
+Work-Around
+-----------
+
+**The number one work-around is to configure your devices so that they
+emit (sufficiently) well-formed messages.** You should by now know what
+these look like.
+
+If that cure is not available, there are some things you can do in
+rsyslog to handle the situation. First of all, be sure to read about
+:doc:`rsyslog.conf format <../configuration/basic_structure>` and the
+:doc:`property replacer <../configuration/property_replacer>` specifically. You need to
+understand that everything is configured in rsyslog. And that the
+message is parsed into properties. There are also properties available
+which do not stem back directly to parsing. Most importantly, %fromhost%
+property holds the name of the system rsyslog received the message from.
+In non-relay cases, this can be used instead of hostname. In relay
+cases, there is no cure other than to either fix the original sender or
+at least one of the relays in front of the rsyslog instance in question.
+Similarly, you can use %timegenerated% instead of %timereported%.
+Timegenerated is the time the message hit rsyslog for the first time.
+For non-relayed, locally connected peers, Timegenerated should be a very
+close approximation of the actual time a message was formed at the
+sender (depending, of course, on potential internal queueing inside the
+sender). Also, you may use the %rawmsg% property together with the
+several extraction modes the property replacer supports. Rawmsg contains
+the message as it is received from the remote peer. In a sense, you can
+implement a post-parser with this method.
+
+To use these properties, you need to define your own templates and
+assign them. Details can be found in the above-quoted documentation.
+Just let's do a quick example. Let's say you have the horrible message
+shown above and can not fix the sending device for some good reason. In
+rsyslog.conf, you used to say:
+
+::
+
+    *.* /var/log/somefile
+
+Of course, things do not work out well with that ill-formed message. So
+you decide to dump the rawmsg to the file and pull the remote host and
+time of message generation from rsyslog's internal properties (which,
+btw, is clever, because otherwise there is no indication of these two
+properties...). So you need to define a template for that and make sure
+the template is used with your file logging action. This is how it may
+look:
+
+::
+
+    $template, MalfromedMsgFormater,"%timegenerated% %fromhost% %rawmsg:::drop-last-lf%\n"
+    *.* /var/log/somefile;MalformedMsgFormatter
+
+This will make your log much nicer, but not look perfect. Experiment a
+bit with the available properties and replacer extraction options to
+fine-tune it to your needs.
+
+The Ultimate Solution...
+------------------------
+
+Is available with rsyslog 5.3.4 and above. Here, we can define so-called
+custom parsers. These are plugin modules, written in C and adapted to a
+specific message format need. The big plus of custom parsers is that
+they offer excellent performance and unlimited possibilities - far
+better than any work-around could do. Custom parsers can be `bound to
+specific rule sets <rsconf1_rulesetparser.html>`_ (and thus listening)
+ports with relative ease. The only con is that they must be written.
+However, if you are lucky, a parser for your device may already exist.
+If not, you can opt to write it yourself, what is not too hard if you
+know some C. Alternatively, Adiscon can program one for you as part of
+the `rsyslog professional services
+offering <http://www.rsyslog.com/professional-services>`_. In any case,
+you should seriously consider custom parsers as an alternative if you
+can not reconfigure your device to send decent message format.
+
+Wrap-Up
+-------
+
+Syslog message format is not sufficiently standardized. There exists a
+weak "standard" format, which is used by a good number of
+implementations. However, there exist many others, including mainstream
+vendor implementations, which have a (sometimes horribly) different
+format. Rsyslog tries to deal with anomalies but can not guess right in
+all instances. If possible, the sender should be configured to submit
+well-formed messages. If that is not possible, you can work around these
+issues with rsyslog's property replacer and template system. Or you can
+use a suitable message parser or write one for your needs.
+
+I hope this is a useful guide. You may also have a look at the `rsyslog
+troubleshooting guide <troubleshoot.html>`_ for further help and places
+where to ask questions.
+