diff options
Diffstat (limited to 'doc/README.xml-output')
-rw-r--r-- | doc/README.xml-output | 253 |
1 files changed, 253 insertions, 0 deletions
diff --git a/doc/README.xml-output b/doc/README.xml-output new file mode 100644 index 00000000..1ab5f3f9 --- /dev/null +++ b/doc/README.xml-output @@ -0,0 +1,253 @@ +Protocol Dissection in XML Format +================================= +Copyright (c) 2003 by Gilbert Ramirez <gram@alumni.rice.edu> + +Wireshark has the ability to export its protocol dissection in an +XML format, tshark has similar functionality by using the "-Tpdml" +option. + +The XML that Wireshark produces follows the Packet Details Markup +Language (PDML) specified by the group at the Politecnico Di Torino +working on Analyzer. The specification was found at: + +http://analyzer.polito.it/30alpha/docs/dissectors/PDMLSpec.htm + +That URL is not working anymore, but a copy can be found at the Internet +Archive: + +https://web.archive.org/web/20050305174853/http://analyzer.polito.it/30alpha/docs/dissectors/PDMLSpec.htm + +This is similar to the NetPDL language specification: + +http://www.nbee.org/doku.php?id=netpdl:index + +The domain registration there has also expired, but an Internet Archive +copy is also available at: + +https://web.archive.org/web/20160305211810/http://nbee.org/doku.php?id=netpdl:index + +A related XML format, the Packet Summary Markup Language (PSML), is +also defined by the Analyzer group to provide packet summary information. +The PSML format is not documented in a publicly-available HTML document, +but its format is simple. Wireshark can export this format too, and +tshark can produce it with the "-Tpsml" option. + +PDML +==== +The PDML that Wireshark produces is known not to be loadable into Analyzer. +It causes Analyzer to crash. As such, the PDML that Wireshark produces +is labeled with a version number of "0", which means that the PDML does +not fully follow the PDML spec. Furthermore, a creator attribute in the +"<pdml>" tag gives the version number of wireshark/tshark that produced the +PDML. + +In that way, as the PDML produced by Wireshark matures, but still does not +meet the PDML spec, scripts can make intelligent decisions about how to +best parse the PDML, based on the "creator" attribute. + +A PDML file is delimited by a "<pdml>" tag. +A PDML file contains multiple packets, denoted by the "<packet>" tag. +A packet will contain multiple protocols, denoted by the "<proto>" tag. +A protocol might contain one or more fields, denoted by the "<field>" tag. + +A pseudo-protocol named "geninfo" is produced, as is required by the PDML +spec, and exported as the first protocol after the opening "<packet>" tag. +Its information comes from wireshark's "frame" protocol, which serves +the similar purpose of storing packet meta-data. Both "geninfo" and +"frame" protocols are provided in the PDML output. + +The "<pdml>" tag +================ +Example: + <pdml version="0" creator="wireshark/0.9.17"> + +The creator is "wireshark" (i.e., the "wireshark" engine. It will always say +"wireshark", not "tshark") version 0.9.17. + + +The "<proto>" tag +================= +"<proto>" tags can have the following attributes: + + name - the display filter name for the protocol + showname - the label used to describe this protocol in the protocol + tree. This is usually the descriptive name of the protocol, + but it can be modified by dissectors to include more data + (tcp can do this) + pos - the starting offset within the packet data where this + protocol starts + size - the number of octets in the packet data that this protocol + covers. + +The "<field>" tag +================= +"<field>" tags can have the following attributes: + + name - the display filter name for the field + showname - the label used to describe this field in the protocol + tree. This is usually the descriptive name of the protocol, + followed by some representation of the value. + pos - the starting offset within the packet data where this + field starts + size - the number of octets in the packet data that this field + covers. + value - the actual packet data, in hex, that this field covers + show - the representation of the packet data ('value') as it would + appear in a display filter. + + +Deviations from the PDML standard +================================= +Various dissectors parse packets in a way that does not fit all the assumptions +in the PDML specification. In some cases Wireshark adjusts the output to match +the spec more closely, but exceptions exist. + +Some dissectors sometimes place text into the protocol tree, without using +a field with a field-name. Those appear in PDML as "<field>" tags with no +'name' attribute, but with a 'show' attribute giving that text. + +Some dissectors place field items at the top level instead of inside a +protocol. In these cases, in the PDML output the field items are placed +inside a fake "<proto>" element named "fake-field-wrapper" in order to +maximize compliance. + +Many dissectors label the undissected payload of a protocol as belonging +to a "data" protocol, and the "data" protocol often resides inside +that last protocol dissected. In the PDML, the "data" protocol becomes +a "data" field, placed exactly where the "data" protocol is in Wireshark's +protocol tree. So, if Wireshark would normally show: + ++-- Frame +| ++-- Ethernet +| ++-- IP +| ++-- TCP +| ++-- HTTP + | + +-- Data + +In PDML, the "Data" protocol would become another field under HTTP: + +<packet> + <proto name="frame"> + ... + </proto> + + <proto name="eth"> + ... + </proto> + + <proto name="ip"> + ... + </proto> + + <proto name="tcp"> + ... + </proto> + + <proto name="http"> + ... + <field name="data" value="........."/> + </proto> +</packet> + +In cases where the "data" protocol appears at the top level, it is +still converted to a field, and placed inside the "fake-field-wrapper" +protocol, just as any other top level field. + +Similarly, expert info items in Wireshark belong to an internal protocol +named "_ws.expert", which is likewise converted into a "<field>" element +of that name. + +Some dissectors also place subdissected protocols in a subtree instead of +at the top level. Unlike with the "data" protocol, the PDML output does +_not_ change these protocols to fields, but rather outputs them as "<proto>" +elements. This results in well-formed XML that does, however, violate the +PDML spec, as "<proto>" elements should only appear as direct children of +"<packet>" elements, with only "<field>" elements nested therein. + +Note that packet tag may have nonstandard color attributes, "foreground" and "background" + + +tools/WiresharkXML.py +==================== +This is a python module which provides some infrastructure for +Python developers who wish to parse PDML. It is designed to read +a PDML file and call a user's callback function every time a packet +is constructed from the protocols and fields for a single packet. + +The python user should import the module, define a callback function +which accepts one argument, and call the parse_fh function: + +------------------------------------------------------------ +import WiresharkXML + +def my_callback(packet): + # do something + +# If the PDML is stored in a file, you can: +fh = open(xml_filename) +WiresharkXML.parse_fh(fh, my_callback) + +# or, if the PDML is contained within a string, you can: +WiresharkXML.parse_string(my_string, my_callback) + +# Now that the script has the packet data, do something. +------------------------------------------------------------ + +The object that is passed to the callback function is an +WiresharkXML.Packet object, which corresponds to a single packet. +WiresharkXML Provides 3 classes, each of which corresponds to a PDML tag: + + Packet - "<packet>" tag + Protocol - "<proto>" tag + Field - "<field>" tag + +Each of these classes has accessors which will return the defined attributes: + + get_name() + get_showname() + get_pos() + get_size() + get_value() + get_show() + +Protocols and fields can contain other fields. Thus, the Protocol and +Field class have a "children" member, which is a simple list of the +Field objects, if any, that are contained. The "children" list can be +directly accessed by code using the object. The "children" list will be +empty if this Protocol or Field contains no Fields. + +Furthermore, the Packet class is a sub-class of the PacketList class. +The PacketList class provides methods to look for protocols and fields. +The term "item" is used when the item being looked for can be +a protocol or a field: + + item_exists(name) - checks if an item exists in the PacketList + get_items(name) - returns a PacketList of all matching items + + +General Notes +============= +Generally, parsing XML is slow. If you're writing a script to parse +the PDML output of tshark, pass a read filter with "-R" to tshark to +try to reduce as much as possible the number of packets coming out of tshark. +The less your script has to process, the faster it will be. + +tools/msnchat +============= +tools/msnchat is a sample Python program that uses WiresharkXML to parse +PDML. Given one or more capture files, it runs tshark on each of them, +providing a read filter to reduce tshark's output. It finds MSN Chat +conversations in the capture file and produces nice HTML showing the +conversations. It has only been tested with capture files containing +non-simultaneous chat sessions, but was written to more-or-less handle any +number of simultaneous chat sessions. + +pdml2html.xsl +============= +pdml2html.xsl is a XSLT file to convert PDML files into HTML. +See https://gitlab.com/wireshark/wireshark/-/wikis/PDML for more details. |