summaryrefslogtreecommitdiffstats
path: root/doc/README.xml-output
diff options
context:
space:
mode:
Diffstat (limited to 'doc/README.xml-output')
-rw-r--r--doc/README.xml-output253
1 files changed, 253 insertions, 0 deletions
diff --git a/doc/README.xml-output b/doc/README.xml-output
new file mode 100644
index 00000000..1ab5f3f9
--- /dev/null
+++ b/doc/README.xml-output
@@ -0,0 +1,253 @@
+Protocol Dissection in XML Format
+=================================
+Copyright (c) 2003 by Gilbert Ramirez <gram@alumni.rice.edu>
+
+Wireshark has the ability to export its protocol dissection in an
+XML format, tshark has similar functionality by using the "-Tpdml"
+option.
+
+The XML that Wireshark produces follows the Packet Details Markup
+Language (PDML) specified by the group at the Politecnico Di Torino
+working on Analyzer. The specification was found at:
+
+http://analyzer.polito.it/30alpha/docs/dissectors/PDMLSpec.htm
+
+That URL is not working anymore, but a copy can be found at the Internet
+Archive:
+
+https://web.archive.org/web/20050305174853/http://analyzer.polito.it/30alpha/docs/dissectors/PDMLSpec.htm
+
+This is similar to the NetPDL language specification:
+
+http://www.nbee.org/doku.php?id=netpdl:index
+
+The domain registration there has also expired, but an Internet Archive
+copy is also available at:
+
+https://web.archive.org/web/20160305211810/http://nbee.org/doku.php?id=netpdl:index
+
+A related XML format, the Packet Summary Markup Language (PSML), is
+also defined by the Analyzer group to provide packet summary information.
+The PSML format is not documented in a publicly-available HTML document,
+but its format is simple. Wireshark can export this format too, and
+tshark can produce it with the "-Tpsml" option.
+
+PDML
+====
+The PDML that Wireshark produces is known not to be loadable into Analyzer.
+It causes Analyzer to crash. As such, the PDML that Wireshark produces
+is labeled with a version number of "0", which means that the PDML does
+not fully follow the PDML spec. Furthermore, a creator attribute in the
+"<pdml>" tag gives the version number of wireshark/tshark that produced the
+PDML.
+
+In that way, as the PDML produced by Wireshark matures, but still does not
+meet the PDML spec, scripts can make intelligent decisions about how to
+best parse the PDML, based on the "creator" attribute.
+
+A PDML file is delimited by a "<pdml>" tag.
+A PDML file contains multiple packets, denoted by the "<packet>" tag.
+A packet will contain multiple protocols, denoted by the "<proto>" tag.
+A protocol might contain one or more fields, denoted by the "<field>" tag.
+
+A pseudo-protocol named "geninfo" is produced, as is required by the PDML
+spec, and exported as the first protocol after the opening "<packet>" tag.
+Its information comes from wireshark's "frame" protocol, which serves
+the similar purpose of storing packet meta-data. Both "geninfo" and
+"frame" protocols are provided in the PDML output.
+
+The "<pdml>" tag
+================
+Example:
+ <pdml version="0" creator="wireshark/0.9.17">
+
+The creator is "wireshark" (i.e., the "wireshark" engine. It will always say
+"wireshark", not "tshark") version 0.9.17.
+
+
+The "<proto>" tag
+=================
+"<proto>" tags can have the following attributes:
+
+ name - the display filter name for the protocol
+ showname - the label used to describe this protocol in the protocol
+ tree. This is usually the descriptive name of the protocol,
+ but it can be modified by dissectors to include more data
+ (tcp can do this)
+ pos - the starting offset within the packet data where this
+ protocol starts
+ size - the number of octets in the packet data that this protocol
+ covers.
+
+The "<field>" tag
+=================
+"<field>" tags can have the following attributes:
+
+ name - the display filter name for the field
+ showname - the label used to describe this field in the protocol
+ tree. This is usually the descriptive name of the protocol,
+ followed by some representation of the value.
+ pos - the starting offset within the packet data where this
+ field starts
+ size - the number of octets in the packet data that this field
+ covers.
+ value - the actual packet data, in hex, that this field covers
+ show - the representation of the packet data ('value') as it would
+ appear in a display filter.
+
+
+Deviations from the PDML standard
+=================================
+Various dissectors parse packets in a way that does not fit all the assumptions
+in the PDML specification. In some cases Wireshark adjusts the output to match
+the spec more closely, but exceptions exist.
+
+Some dissectors sometimes place text into the protocol tree, without using
+a field with a field-name. Those appear in PDML as "<field>" tags with no
+'name' attribute, but with a 'show' attribute giving that text.
+
+Some dissectors place field items at the top level instead of inside a
+protocol. In these cases, in the PDML output the field items are placed
+inside a fake "<proto>" element named "fake-field-wrapper" in order to
+maximize compliance.
+
+Many dissectors label the undissected payload of a protocol as belonging
+to a "data" protocol, and the "data" protocol often resides inside
+that last protocol dissected. In the PDML, the "data" protocol becomes
+a "data" field, placed exactly where the "data" protocol is in Wireshark's
+protocol tree. So, if Wireshark would normally show:
+
++-- Frame
+|
++-- Ethernet
+|
++-- IP
+|
++-- TCP
+|
++-- HTTP
+ |
+ +-- Data
+
+In PDML, the "Data" protocol would become another field under HTTP:
+
+<packet>
+ <proto name="frame">
+ ...
+ </proto>
+
+ <proto name="eth">
+ ...
+ </proto>
+
+ <proto name="ip">
+ ...
+ </proto>
+
+ <proto name="tcp">
+ ...
+ </proto>
+
+ <proto name="http">
+ ...
+ <field name="data" value="........."/>
+ </proto>
+</packet>
+
+In cases where the "data" protocol appears at the top level, it is
+still converted to a field, and placed inside the "fake-field-wrapper"
+protocol, just as any other top level field.
+
+Similarly, expert info items in Wireshark belong to an internal protocol
+named "_ws.expert", which is likewise converted into a "<field>" element
+of that name.
+
+Some dissectors also place subdissected protocols in a subtree instead of
+at the top level. Unlike with the "data" protocol, the PDML output does
+_not_ change these protocols to fields, but rather outputs them as "<proto>"
+elements. This results in well-formed XML that does, however, violate the
+PDML spec, as "<proto>" elements should only appear as direct children of
+"<packet>" elements, with only "<field>" elements nested therein.
+
+Note that packet tag may have nonstandard color attributes, "foreground" and "background"
+
+
+tools/WiresharkXML.py
+====================
+This is a python module which provides some infrastructure for
+Python developers who wish to parse PDML. It is designed to read
+a PDML file and call a user's callback function every time a packet
+is constructed from the protocols and fields for a single packet.
+
+The python user should import the module, define a callback function
+which accepts one argument, and call the parse_fh function:
+
+------------------------------------------------------------
+import WiresharkXML
+
+def my_callback(packet):
+ # do something
+
+# If the PDML is stored in a file, you can:
+fh = open(xml_filename)
+WiresharkXML.parse_fh(fh, my_callback)
+
+# or, if the PDML is contained within a string, you can:
+WiresharkXML.parse_string(my_string, my_callback)
+
+# Now that the script has the packet data, do something.
+------------------------------------------------------------
+
+The object that is passed to the callback function is an
+WiresharkXML.Packet object, which corresponds to a single packet.
+WiresharkXML Provides 3 classes, each of which corresponds to a PDML tag:
+
+ Packet - "<packet>" tag
+ Protocol - "<proto>" tag
+ Field - "<field>" tag
+
+Each of these classes has accessors which will return the defined attributes:
+
+ get_name()
+ get_showname()
+ get_pos()
+ get_size()
+ get_value()
+ get_show()
+
+Protocols and fields can contain other fields. Thus, the Protocol and
+Field class have a "children" member, which is a simple list of the
+Field objects, if any, that are contained. The "children" list can be
+directly accessed by code using the object. The "children" list will be
+empty if this Protocol or Field contains no Fields.
+
+Furthermore, the Packet class is a sub-class of the PacketList class.
+The PacketList class provides methods to look for protocols and fields.
+The term "item" is used when the item being looked for can be
+a protocol or a field:
+
+ item_exists(name) - checks if an item exists in the PacketList
+ get_items(name) - returns a PacketList of all matching items
+
+
+General Notes
+=============
+Generally, parsing XML is slow. If you're writing a script to parse
+the PDML output of tshark, pass a read filter with "-R" to tshark to
+try to reduce as much as possible the number of packets coming out of tshark.
+The less your script has to process, the faster it will be.
+
+tools/msnchat
+=============
+tools/msnchat is a sample Python program that uses WiresharkXML to parse
+PDML. Given one or more capture files, it runs tshark on each of them,
+providing a read filter to reduce tshark's output. It finds MSN Chat
+conversations in the capture file and produces nice HTML showing the
+conversations. It has only been tested with capture files containing
+non-simultaneous chat sessions, but was written to more-or-less handle any
+number of simultaneous chat sessions.
+
+pdml2html.xsl
+=============
+pdml2html.xsl is a XSLT file to convert PDML files into HTML.
+See https://gitlab.com/wireshark/wireshark/-/wikis/PDML for more details.