diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-15 16:27:18 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-15 16:27:18 +0000 |
commit | f7f20c3f5e0be02585741f5f54d198689ccd7866 (patch) | |
tree | 190d5e080f6cbcc40560b0ceaccfd883cb3faa01 /source/concepts/messageparser.rst | |
parent | Initial commit. (diff) | |
download | rsyslog-doc-f7f20c3f5e0be02585741f5f54d198689ccd7866.tar.xz rsyslog-doc-f7f20c3f5e0be02585741f5f54d198689ccd7866.zip |
Adding upstream version 8.2402.0+dfsg.upstream/8.2402.0+dfsg
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'source/concepts/messageparser.rst')
-rw-r--r-- | source/concepts/messageparser.rst | 304 |
1 files changed, 304 insertions, 0 deletions
diff --git a/source/concepts/messageparser.rst b/source/concepts/messageparser.rst new file mode 100644 index 0000000..f9886be --- /dev/null +++ b/source/concepts/messageparser.rst @@ -0,0 +1,304 @@ +Message parsers in rsyslog +========================== + +Written by `Rainer Gerhards <http://www.gerhards.net/rainer>`_ +(2009-11-06) + +Intro +----- + +Message parsers are a feature of rsyslog 5.3.4 and above. In this +article, I describe what message parsers are, what they can do and how +they relate to the relevant standards. I will also describe what you can +not do with time. Finally, I give some advice on implementing your own +custom parser. + +What are message parsers? +------------------------- + +Well, the quick answer is that message parsers are the component of +rsyslog that parses the syslog message after it is being received. Prior +to rsyslog 5.3.4, message parsers where built in into the rsyslog core +itself and could not be modified (other than by modifying the rsyslog +code). + +In 5.3.4, we changed that: message parsers are now loadable modules +(just like input and output modules). That means that new message +parsers can be added without modifying the rsyslog core, even without +contributing something back to the project. + +But that doesn't answer what a message parser really is. What does it +mean to "parse a message" and, maybe more importantly, what is a +message? To answer these questions correctly, we need to dig down into +the relevant standards. `RFC5424 <http://tools.ietf.org/html/rfc5424>`_ +specifies a layered architecture for the syslog protocol: + +.. figure:: rfc5424layers.png + :align: center + :alt: RFC5424 syslog protocol layers + + RFC5424 syslog protocol layers + +For us important is the distinction between the syslog transport and the +upper layers. The transport layer specifies how a stream of messages is +assembled at the sender side and how this stream of messages is +disassembled into the individual messages at the receiver side. In +networking terminology, this is called "framing". The core idea is that +each message is put into a so-called "frame", which then is transmitted +over the communications link. + +The framing used is depending on the protocol. For example, in UDP the +"frame"-equivalent is a packet that is being sent (this also means that +no two messages can travel within a single UDP packet). In "plain tcp +syslog", the industry standard, LF is used as a frame delimiter (which +also means that no multi-line message can properly be transmitted, a +"design" flaw in plain tcp syslog). In +`RFC5425 <http://tools.ietf.org/html/rfc5425>`_ there is a header in +front of each frame that contains the size of the message. With this +framing, any message content can properly be transferred. + +And now comes the important part: **message parsers do NOT operate at +the transport layer**, they operate, as their name implies, on messages. +So we can not use message parsers to change the underlying framing. For +example, if a sender splits (for whatever reason) a single message into +two and encapsulates these into two frames, there is no way a message +parser could undo that. + +A typical example may be a multi-line message: let's assume some +originator has generated a message for the format "A\\nB" (where \\n +means LF). If that message is being transmitted via plain tcp syslog, +the frame delimiter is LF. So the sender will delimit the frame with LF, +but otherwise send the message unmodified onto the wire (because that is +how things are -unfortunately- done in plain tcp syslog...). So wire +will see "A\\nB\\n". When this arrives at the receiver, the transport +layer will undo the framing. When it sees the LF after A, it thinks it +finds a valid frame delimiter (in fact, this is the correct view!). So +the receive will extract one complete message A and one complete message +B, not knowing that they once were both part of a large multi-line +message. These two messages are then passed to the upper layers, where +the message parsers receive them and extract information. However, the +message parsers never know (or even have a chance to see) that A and B +belonged together. Even further, in rsyslog there is no guarantee that A +will be parsed before B - concurrent operations may cause the reverse +order (and do so very validly). + +The important lesson is: **message parsers can not be used to fix a +broken framing**. You need a full protocol implementation to do that, +what is the domain of input and output modules. + +I have now told you what you can not do with message parsers. But what +they are good for? Thankfully, broken framing is not the primary problem +of the syslog world. A wealth of different formats is. Unfortunately, +many real-world implementations violate the relevant standards in one +way or another. That makes it often very hard to extract meaningful +information from a message or to process messages from different sources +by the same rules. In my article `syslog parsing in +rsyslog <syslog_parsing.html>`_ I have elaborated on all the real-world +evil that you can usually see. So I won't repeat that here. But in +short, the real problem is not the framing, but how to make malformed +messages well-looking. + +**This is what message parsers permit you to do: take a (well-known) +malformed message, parse it according to its semantics and generate +perfectly valid internal message representations from it.** So as long +as messages are consistently in the same wrong format (and they usually +are!), a message parser can look at that format, parse it, and make the +message processable just like it were well formed in the first place. +Plus, one can abuse the interface to do some other "interesting" tricks, +but that would take us to far. + +While this functionality may not sound exciting, it actually solves a +very big issue (that you only really understand if you have managed a +system with various different syslog sources). Note that we were often +able to process malformed messages in the past with the help of the +property replacer and regular expressions. While this is nice, it has a +performance hit. A message parser is a C code, compiled to native +language, and thus typically much faster than any regular expression +based method (depending, of course, on the quality of the +implementation...). + +How are message parsers used? +----------------------------- + +In a simplified view, rsyslog + +#. first receives messages (via the input module), +#. *then parses them (at the message level!)* and +#. then processes them (operating on the internal message + representation). + +Message parsers are utilized in the second step (written in italics). +Thus, they take the raw message (NOT frame!) received from the remote +system and create the internal structure out of it that the other parts +of rsyslog need in order to perform their processing. Parsing is vital, +because an unparsed message can not be processed in the third stage, the +actual application-level processing (like forwarding or writing to +files). + +Parser Chains and how they Operate +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Rsyslog chains parsers together to provide flexibility. A **parser +chain** contains all parsers that can potentially be used to parse a +message. It is assumed that there is some way a parser can detect if the +message it is being presented is supported by it. If so, the parser will +tell the rsyslog engine and parse the message. The rsyslog engine now +calls each parser inside the chain (in sequence!) until the first parser +is able to parse the message. After one parser has been found, the +message is considered parsed and no others parsers are called on that +message. + +Side-note: this method implies there are some "not-so-dirty" tricks +available to modify the message by a parser module that declares itself +as "unable to parse" but still does some message modification. This was +not a primary design goal, but may be utilized, and the interface +probably extended, to support generic filter modules. These would need +to go to the root of the parser chain. As mentioned, the current system +already supports this. + +The position inside the parser chain can be thought of as a priority: +parser sitting earlier in the chain take precedence over those sitting +later in it. So more specific parser should go earlier in the chain. A +good example of how this works is the default parser set provided by +rsyslog: rsyslog.rfc5424 and rsyslog.rfc3164, each one parses according +to the rfc that has named it. RFC5424 was designed to be distinguishable +from RFC3164 message by the sequence "1 " immediately after the +so-called PRI-part (don't worry about these words, it is sufficient if +you understand there is a well-defined sequence used to identify RFC5424 +messages). In contrary, RFC3164 actually permits everything as a valid +message. Thus the RFC3164 parser will always parse a message, sometimes +with quite unexpected outcome (there is a lot of guesswork involved in +that parser, which unfortunately is unavoidable due to existing +technology limits). So the default parser chain is to try the RFC5424 +parser first and after it the RFC3164 parser. If we have a +5424-formatted message, that parser will identify and parse it and the +rsyslog engine will stop processing. But if we receive a legacy syslog +message, the RFC5424 will detect that it can not parse it, return this +status to the engine which then calls the next parser inside the chain. +That usually happens to be the RFC3164 parser, which will always process +the message. But there could also be any other parser inside the chain, +and then each one would be called unless one that is able to parse can +be found. + +If we reversed the parser order, RFC5424 messages would incorrectly +parsed. Why? Because the RFC3164 parser will always parse every message, +so if it were asked first, it would parse (and misinterpret) the +5424-formatted message, return it did so and the rsyslog engine would +never call the 5424 parser. So order of sequence is very important. + +What happens if no parser in the chain could parse a message? Well, then +we could not obtain the in-memory representation that is needed to +further process the message. In that case, rsyslog has no other choice +than to discard the message. If it does so, it will emit a warning +message, but only in the first 1,000 incidents. This limit is a safety +measure against message-loops, which otherwise could quickly result from +a parser chain misconfiguration. **If you do not tolerate loss of +unparsable messages, you must ensure that each message can be parsed.** +You can easily achieve this by always using the "rsyslog-rfc3164" parser +as the *last* parser inside parser chains. That may result in invalid +parsing, but you will have a chance to see the invalid message (in debug +mode, a warning message will be written to the debug log each time a +message is dropped due to inability to parse it). + +Where are parser chains used? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +We now know what parser chains are and how they operate. The question is +now how many parser chains can be active and how it is decided which +parser chain is used on which message. This is controlled via +:doc:`rsyslog's rulesets <multi_ruleset>`. In short, multiple rulesets can be +defined and there always exist at least one ruleset. +A parser chain is bound to a +specific ruleset. This is done by virtue of defining parsers via the +:doc:`$RulesetParser <../configuration/ruleset/rsconf1_rulesetparser>` +configuration directive +(for specifics, see there). If no such directive is specified, the +default parser chain is used. As of this writing, the default parser +chain always consists of "rsyslog.rfc5424", "rsyslog.rfc3164", in that +order. As soon as a parser is configured, the default list is cleared +and the new parser is added to the end of the (initially empty) +ruleset's parser chain. + +The important point to know is that parser chains are defined on a +per-ruleset basis. + +Can I use different parser chains for different devices? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The correct answer is: generally yes, but it depends. First of all, +remember that input modules (and specific listeners) may be bound to +specific rulesets. As parser chains "reside" in rulesets, binding to a +ruleset also binds to the parser chain that is bound to that ruleset. As +a number one prerequisite, the input module must support binding to +different rulesets. Not all do, but their number is growing. For +example, the important `imudp <imudp.html>`_ and `imtcp <imtcp.html>`_ +input modules support that functionality. Those that do not (for example +`im3195 <im3195>`_) can only utilize the default ruleset and thus the +parser chain defined in that ruleset. + +If you do not know if the input module in question supports ruleset +binding, check its documentation page. Those that support it have the +required directives. + +Note that it is currently under evaluation if rsyslog will support +binding parser chains to specific inputs directly, without depending on +the ruleset. There are some concerns that this may not be necessary but +adds considerable complexity to the configuration. So this may or may +not be possible in the future. In any case, if we decide to add it, +input modules need to support it, so this functionality would require +some time to implement. + +The cookbook recipe for using different parsers for different devices +is given as an actual in-depth example in the +`$RulesetParser` configuration directive +doc page. In short, it is accomplished by defining specific rulesets for +the required parser chains, defining different listener ports for each +of the devices with different format and binding these listeners to the +correct ruleset (and thus parser chains). Using that approach, a variety +of different message formats can be supported via a single rsyslog +instance. + +Which message parsers are available +----------------------------------- + +As of this writing, there exist only two message parsers, one for +RFC5424 format and one for legacy syslog (loosely described in +`RFC3164 <http://tools.ietf.org/html/rfc3164>`_). These parsers are +built-in and must not be explicitly loaded. However, message parsers can +be added with relative ease by anyone knowing to code in C. Then, they +can be loaded via $ModLoad just like any other loadable module. It is +expected that the rsyslog project will be contributed additional message +parsers over time, so that at some point there hopefully is a rich +choice of them (I intend to add a browsable repository as soon as new +parsers pop up). + +How to write a message parser? +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +As a prerequisite, you need to know the exact format that the device is +sending. Then, you need moderate C coding skills, and a little bit of +rsyslog internals. I guess the rsyslog specific part should not be that +hard, as almost all information can be gained from the existing parsers. +They are rather simple in structure and can be found under the "./tools" +directory. They are named pmrfc3164.c and pmrfc5424.c. You need to +follow the usual loadable module guidelines. It is my expectation that +writing a parser should typically not take longer than a single day, +with maybe a day more to get acquainted with rsyslog. Of course, I am +not sure if the number is actually right. + +If you can not program or have no time to do it, Adiscon can also write +a message parser for you as part of the `rsyslog professional services +offering <http://www.rsyslog.com/professional-services>`_. + +Conclusion +---------- + +Malformed syslog messages are a pain and unfortunately often seen in +practice. Message parsers provide a fast and efficient solution for this +problem. Different parsers can be defined for different devices, and +they all convert message information into rsyslog's well-defined +internal format. Message parsers were first introduced in rsyslog 5.3.4 +and also offer some interesting ideas that may be explored in the future +- up to full message normalization capabilities. It is strongly +recommended that anyone with a heterogeneous environment take a look at +message parser capabilities. |