Protocol Dissection in XML Format ================================= Copyright (c) 2003 by Gilbert Ramirez <gram@alumni.rice.edu> Wireshark has the ability to export its protocol dissection in an XML format, tshark has similar functionality by using the "-Tpdml" option. The XML that Wireshark produces follows the Packet Details Markup Language (PDML) specified by the group at the Politecnico Di Torino working on Analyzer. The specification was found at: http://analyzer.polito.it/30alpha/docs/dissectors/PDMLSpec.htm That URL is not working anymore, but a copy can be found at the Internet Archive: https://web.archive.org/web/20050305174853/http://analyzer.polito.it/30alpha/docs/dissectors/PDMLSpec.htm This is similar to the NetPDL language specification: http://www.nbee.org/doku.php?id=netpdl:index The domain registration there has also expired, but an Internet Archive copy is also available at: https://web.archive.org/web/20160305211810/http://nbee.org/doku.php?id=netpdl:index A related XML format, the Packet Summary Markup Language (PSML), is also defined by the Analyzer group to provide packet summary information. The PSML format is not documented in a publicly-available HTML document, but its format is simple. Wireshark can export this format too, and tshark can produce it with the "-Tpsml" option. PDML ==== The PDML that Wireshark produces is known not to be loadable into Analyzer. It causes Analyzer to crash. As such, the PDML that Wireshark produces is labeled with a version number of "0", which means that the PDML does not fully follow the PDML spec. Furthermore, a creator attribute in the "<pdml>" tag gives the version number of wireshark/tshark that produced the PDML. In that way, as the PDML produced by Wireshark matures, but still does not meet the PDML spec, scripts can make intelligent decisions about how to best parse the PDML, based on the "creator" attribute. A PDML file is delimited by a "<pdml>" tag. A PDML file contains multiple packets, denoted by the "<packet>" tag. A packet will contain multiple protocols, denoted by the "<proto>" tag. A protocol might contain one or more fields, denoted by the "<field>" tag. A pseudo-protocol named "geninfo" is produced, as is required by the PDML spec, and exported as the first protocol after the opening "<packet>" tag. Its information comes from wireshark's "frame" protocol, which serves the similar purpose of storing packet meta-data. Both "geninfo" and "frame" protocols are provided in the PDML output. The "<pdml>" tag ================ Example: <pdml version="0" creator="wireshark/0.9.17"> The creator is "wireshark" (i.e., the "wireshark" engine. It will always say "wireshark", not "tshark") version 0.9.17. The "<proto>" tag ================= "<proto>" tags can have the following attributes: name - the display filter name for the protocol showname - the label used to describe this protocol in the protocol tree. This is usually the descriptive name of the protocol, but it can be modified by dissectors to include more data (tcp can do this) pos - the starting offset within the packet data where this protocol starts size - the number of octets in the packet data that this protocol covers. The "<field>" tag ================= "<field>" tags can have the following attributes: name - the display filter name for the field showname - the label used to describe this field in the protocol tree. This is usually the descriptive name of the protocol, followed by some representation of the value. pos - the starting offset within the packet data where this field starts size - the number of octets in the packet data that this field covers. value - the actual packet data, in hex, that this field covers show - the representation of the packet data ('value') as it would appear in a display filter. Deviations from the PDML standard ================================= Various dissectors parse packets in a way that does not fit all the assumptions in the PDML specification. In some cases Wireshark adjusts the output to match the spec more closely, but exceptions exist. Some dissectors sometimes place text into the protocol tree, without using a field with a field-name. Those appear in PDML as "<field>" tags with no 'name' attribute, but with a 'show' attribute giving that text. Some dissectors place field items at the top level instead of inside a protocol. In these cases, in the PDML output the field items are placed inside a fake "<proto>" element named "fake-field-wrapper" in order to maximize compliance. Many dissectors label the undissected payload of a protocol as belonging to a "data" protocol, and the "data" protocol often resides inside that last protocol dissected. In the PDML, the "data" protocol becomes a "data" field, placed exactly where the "data" protocol is in Wireshark's protocol tree. So, if Wireshark would normally show: +-- Frame | +-- Ethernet | +-- IP | +-- TCP | +-- HTTP | +-- Data In PDML, the "Data" protocol would become another field under HTTP: <packet> <proto name="frame"> ... </proto> <proto name="eth"> ... </proto> <proto name="ip"> ... </proto> <proto name="tcp"> ... </proto> <proto name="http"> ... <field name="data" value="........."/> </proto> </packet> In cases where the "data" protocol appears at the top level, it is still converted to a field, and placed inside the "fake-field-wrapper" protocol, just as any other top level field. Similarly, expert info items in Wireshark belong to an internal protocol named "_ws.expert", which is likewise converted into a "<field>" element of that name. Some dissectors also place subdissected protocols in a subtree instead of at the top level. Unlike with the "data" protocol, the PDML output does _not_ change these protocols to fields, but rather outputs them as "<proto>" elements. This results in well-formed XML that does, however, violate the PDML spec, as "<proto>" elements should only appear as direct children of "<packet>" elements, with only "<field>" elements nested therein. Note that packet tag may have nonstandard color attributes, "foreground" and "background" tools/WiresharkXML.py ==================== This is a python module which provides some infrastructure for Python developers who wish to parse PDML. It is designed to read a PDML file and call a user's callback function every time a packet is constructed from the protocols and fields for a single packet. The python user should import the module, define a callback function which accepts one argument, and call the parse_fh function: ------------------------------------------------------------ import WiresharkXML def my_callback(packet): # do something # If the PDML is stored in a file, you can: fh = open(xml_filename) WiresharkXML.parse_fh(fh, my_callback) # or, if the PDML is contained within a string, you can: WiresharkXML.parse_string(my_string, my_callback) # Now that the script has the packet data, do something. ------------------------------------------------------------ The object that is passed to the callback function is an WiresharkXML.Packet object, which corresponds to a single packet. WiresharkXML Provides 3 classes, each of which corresponds to a PDML tag: Packet - "<packet>" tag Protocol - "<proto>" tag Field - "<field>" tag Each of these classes has accessors which will return the defined attributes: get_name() get_showname() get_pos() get_size() get_value() get_show() Protocols and fields can contain other fields. Thus, the Protocol and Field class have a "children" member, which is a simple list of the Field objects, if any, that are contained. The "children" list can be directly accessed by code using the object. The "children" list will be empty if this Protocol or Field contains no Fields. Furthermore, the Packet class is a sub-class of the PacketList class. The PacketList class provides methods to look for protocols and fields. The term "item" is used when the item being looked for can be a protocol or a field: item_exists(name) - checks if an item exists in the PacketList get_items(name) - returns a PacketList of all matching items General Notes ============= Generally, parsing XML is slow. If you're writing a script to parse the PDML output of tshark, pass a read filter with "-R" to tshark to try to reduce as much as possible the number of packets coming out of tshark. The less your script has to process, the faster it will be. tools/msnchat ============= tools/msnchat is a sample Python program that uses WiresharkXML to parse PDML. Given one or more capture files, it runs tshark on each of them, providing a read filter to reduce tshark's output. It finds MSN Chat conversations in the capture file and produces nice HTML showing the conversations. It has only been tested with capture files containing non-simultaneous chat sessions, but was written to more-or-less handle any number of simultaneous chat sessions. pdml2html.xsl ============= pdml2html.xsl is a XSLT file to convert PDML files into HTML. See https://gitlab.com/wireshark/wireshark/-/wikis/PDML for more details.