diff options
Diffstat (limited to 'doc/README.developer')
-rw-r--r-- | doc/README.developer | 995 |
1 files changed, 995 insertions, 0 deletions
diff --git a/doc/README.developer b/doc/README.developer new file mode 100644 index 00000000..6d6ede97 --- /dev/null +++ b/doc/README.developer @@ -0,0 +1,995 @@ +This file is a HOWTO for Wireshark developers. It describes general development +and coding practices for contributing to Wireshark no matter which part of +Wireshark you want to work on. + +To learn how to write a dissector, read this first, then read the file +README.dissector. + +This file is compiled to give in depth information on Wireshark. +It is by no means all inclusive and complete. Please feel free to discuss on +the developer mailing list or upload merge requests to gitlab. + +0. Prerequisites. + +Before starting to develop a new dissector, a "running" Wireshark build +environment is required - there's no such thing as a standalone "dissector +build toolkit". + +How to setup such an environment is platform dependent; detailed +information about these steps can be found in the "Developer's Guide" +(available from: https://www.wireshark.org) and in the INSTALL and +README.md files of the sources root dir. + +0.1. General README files. + +You'll find additional information in the following README files: + +- doc/README.capture - the capture engine internals +- doc/README.design - Wireshark software design - incomplete +- doc/README.developer - this file +- doc/README.dissector - How to dissect a packet +- doc/README.display_filter - Display Filter Engine +- doc/README.idl2wrs - CORBA IDL converter +- doc/README.regression - regression testing of WS and TS +- doc/README.stats_tree - a tree statistics counting specific packets +- doc/README.tapping - "tap" a dissector to get protocol specific events +- doc/README.vagrant - how to create a development VM using vagrant +- doc/README.wslua - working with LUA +- doc/README.xml-output - how to work with the PDML exported output +- wiretap/README.developer - how to add additional capture file types to + Wiretap + +0.2. Dissector related README files. + +You'll find additional dissector related information in the file +README.dissector as well as the following README files: + +- doc/README.heuristic - what are heuristic dissectors and how to write + them +- doc/README.plugins - how to "pluginize" a dissector +- doc/README.request_response_tracking - how to track req./resp. times and such +- doc/README.wmem - how to obtain "memory leak free" memory + +0.3 Contributors + +James Coe <jammer[AT]cin.net> +Gilbert Ramirez <gram[AT]alumni.rice.edu> +Jeff Foster <jfoste[AT]woodward.com> +Olivier Abad <oabad[AT]cybercable.fr> +Laurent Deniel <laurent.deniel[AT]free.fr> +Gerald Combs <gerald[AT]wireshark.org> +Guy Harris <guy[AT]alum.mit.edu> +Ulf Lamping <ulf.lamping[AT]web.de> + +1. Portability. + +Wireshark runs on many platforms, and can be compiled with a number of +different compilers; here are some rules for writing code that will work +on multiple platforms. + +Building Wireshark requires a compiler that supports C11. This includes +reasonably recent versions of GCC and clang. Microsoft Visual Studio supports +C11 from Visual Studio 2019 version 16.8 and later. Support requires an updated +Universal C Runtime (UCRT) and Windows SDK version to work properly with the +conforming preprocessor. The minimum SDK version is 10.0.20348.0 (version 2104). + +The C11 has some optional parts that are not a requirement to build Wireshark. +In particular the following optional C11 features must NOT be used: + - Variable length arrays + - Bounds-checking interfaces (Annex K) + +We don't allow them because their value is questionable and requiring them +would exclude a lot of compilers and runtimes that we wish to support. + +Don't initialize global or static variables (variables with static +storage duration) in their declaration with non-constant values. This is not +permitted in C. E.g., if "i" is a static or global +variable, don't declare "i" as + + uint32_t i = somearray[2]; + +outside a function, or as + + static uint32_t i = somearray[2]; + +inside or outside a function, declare it as just + + uint32_t i; + +or + + static uint32_t i; + +and later, in code, initialize it with + + i = somearray[2]; + +instead. Initializations of variables with automatic storage duration - +i.e., local variables - with non-constant values is permitted, so, +within a function + + uint32_t i = somearray[2]; + +is allowed. + +Don't use zero-length arrays as structure members, use flexible array members +instead. + +Don't use "uchar", "u_char", "ushort", "u_short", "uint", "u_int", +"ulong", "u_long" or "boolean"; they aren't defined on all platforms. + +GLib typedefs have historically been used extensively throughout the +codebase (gchar, guint8, gint16, etc). We are moving towards the fixed +width integers provided in C since C99. These are defined in <stdint.h>, +which is included in <wireshark.h>. You should choose stdint types when +possible, but realise that until we can fully migrate our APIs, in many +situations the GLib types still make sense. + +If you want an 8-bit unsigned quantity, use "uint8_t"; if you want an +8-bit character value with the 8th bit not interpreted as a sign bit, +use "unsigned char"; if you want a 16-bit unsigned quantity, use "uint16_t"; +if you want a 32-bit unsigned quantity, use "uint32_t"; and if you want +an "int-sized" unsigned quantity, use "unsigned"; if you want a boolean, +use "bool" (defined in <stdbool.h>). You don't need to explicitly include +these headers; they are included in <wireshark.h>. Use that instead. + +To print fixed width integers you must use the macros provided in <inttypes.h>. + + uint32_t var; + printf("var = " PRIu32 "\n", var); + +Don't use "long" to mean "signed 32-bit integer", and don't use +"unsigned long" to mean "unsigned 32-bit integer"; "long"s are 64 bits +long on many platforms. Use "int32_t" for signed 32-bit integers and use +"uint32_t" for unsigned 32-bit integers. + +Don't use "long" to mean "signed 64-bit integer" and don't use "unsigned +long" to mean "unsigned 64-bit integer"; "long"s are 32 bits long on +many other platforms. Don't use "long long" or "unsigned long long", +either, as not all platforms support them; use "int64_t" or "uint64_t", +which will be defined as the appropriate types for 64-bit signed and +unsigned integers. + +On LLP64 data model systems (notably 64-bit Windows), "int" and "long" +are 32 bits while "size_t" and "ptrdiff_t" are 64 bits. This means that +the following will generate a compiler warning: + + int i; + i = strlen("hello, sailor"); /* Compiler warning */ + +Normally, you'd just make "i" a size_t. However, many GLib and Wireshark +functions won't accept a size_t on LLP64: + + size_t i; + char greeting[] = "hello, sailor"; + unsigned byte_after_greet; + + i = strlen(greeting); + byte_after_greet = tvb_get_guint8(tvb, i); /* Compiler warning */ + +Try to use the appropriate data type when you can. When you can't, you +will have to cast to a compatible data type, e.g. + + size_t i; + char greeting[] = "hello, sailor"; + uint8_t byte_after_greet; + + i = strlen(greeting); + byte_after_greet = tvb_get_guint8(tvb, (int) i); /* OK */ + +or + + int i; + char greeting[] = "hello, sailor"; + uint8_t byte_after_greet; + + i = (int) strlen(greeting); + byte_after_greet = tvb_get_guint8(tvb, i); /* OK */ + +See http://www.unix.org/version2/whatsnew/lp64_wp.html for more +information on the sizes of common types in different data models. + +A lot of legacy code still uses GLib types and I/O replacement API. These +should be gradually transitioned to use the standard interfaces provided in +C11. Sometimes it may be necessary to use an unsavory cast or two or abuse +a macro to bridge the two codebases during the transition. Such is life, +use your judgement and do the best possible under the circumstances. + +Avoid GLib synonyms like gchar and gint and especially don't +use gpointer and gconstpointer, unless you are writing GLib callbacks +and trying to match their signature exactly. These just obscure the +code and gconstpointer in particular is just semantically weird and poor style. + +When printing or displaying the values of 64-bit integral data types, +don't use "%lld", "%llu", "%llx", or "%llo" - not all platforms +support "%ll" for printing 64-bit integral data types. Instead use +the macros in <inttypes.h>, for example: + + proto_tree_add_uint64_format_value(tree, hf_uint64, tvb, offset, len, + val, "%" PRIu64, val); + +For GLib routines, and only those, you can choose whichever format style +you prefer: + + uint64_t val = UINT64_C(1); + char *str1 = g_string_printf("%" G_GUINT64_FORMAT, val); + char *str2 = g_string_printf("%" PRIu64, val); + +These format macros will be the same modulo any GLib bugs. + +When specifying an integral constant that doesn't fit in 32 bits, don't +use "LL" at the end of the constant - not all compilers use "LL" for +that. Instead, put the constant in a call to the "INT64_C()" or "UINT64_C()" +macro, e.g. + + INT64_C(-11644473600), UINT64_C(11644473600) + +rather than + + -11644473600LL, 11644473600ULL + +Don't assume that you can scan through a va_list initialized by va_start +more than once without closing it with va_end and re-initializing it with +va_start. This applies even if you're not scanning through it yourself, +but are calling a routine that scans through it, such as vfprintf() or +one of the routines in Wireshark that takes a format and a va_list as an +argument. You must do + + va_start(ap, format); + call_routine1(xxx, format, ap); + va_end(ap); + va_start(ap, format); + call_routine2(xxx, format, ap); + va_end(ap); + +rather than + + va_start(ap, format); + call_routine1(xxx, format, ap); + call_routine2(xxx, format, ap); + va_end(ap); + +Don't use a label without a statement following it. For example, +something such as + + if (...) { + + ... + + done: + } + +will not work with all compilers - you have to do + + if (...) { + + ... + + done: + ; + } + +with some statement, even if it's a null statement, after the label. +Preferably don't do it at all. + +Don't use "bzero()", "bcopy()", or "bcmp()"; instead, use the ANSI C +routines + + "memset()" (with zero as the second argument, so that it sets + all the bytes to zero); + + "memcpy()" or "memmove()" (note that the first and second + arguments to "memcpy()" are in the reverse order to the + arguments to "bcopy()"; note also that "bcopy()" is typically + guaranteed to work on overlapping memory regions, while + "memcpy()" isn't, so if you may be copying from one region to a + region that overlaps it, use "memmove()", not "memcpy()" - but + "memcpy()" might be faster as a result of not guaranteeing + correct operation on overlapping memory regions); + + and "memcmp()" (note that "memcmp()" returns 0, 1, or -1, doing + an ordered comparison, rather than just returning 0 for "equal" + and 1 for "not equal", as "bcmp()" does). + +Not all platforms necessarily have "bzero()"/"bcopy()"/"bcmp()", and +those that do might not declare them in the header file on which they're +declared on your platform. + +Don't use "index()" or "rindex()"; instead, use the ANSI C equivalents, +"strchr()" and "strrchr()". Not all platforms necessarily have +"index()" or "rindex()", and those that do might not declare them in the +header file on which they're declared on your platform. + +Don't use "tvb_get_ptr()". If you must use it, keep in mind that the pointer +returned by a call to "tvb_get_ptr()" is not guaranteed to be aligned on any +particular byte boundary; this means that you cannot safely cast it to any +data type other than a pointer to "char", "unsigned char", "guint8", or other +one-byte data types. Casting a pointer returned by tvb_get_ptr() into any +multi-byte data type or structure may cause crashes on some platforms (even +if it does not crash on x86-based PCs). Even if such mis-aligned accesses +don't crash on your platform they will be slower than properly aligned +accesses would be. Furthermore, the data in a packet is not necessarily in +the byte order of the machine on which Wireshark is running. Use the tvbuff +routines to extract individual items from the packet, or, better yet, use +"proto_tree_add_item()" and let it extract the items for you. + +Don't use structures that overlay packet data, or into which you copy +packet data; the C programming language does not guarantee any +particular alignment of fields within a structure, and even the +extensions that try to guarantee that are compiler-specific and not +necessarily supported by all compilers used to build Wireshark. Using +bitfields in those structures is even worse; the order of bitfields +is not guaranteed. + +Don't use "ntohs()", "ntohl()", "htons()", or "htonl()"; the header +files required to define or declare them differ between platforms, and +you might be able to get away with not including the appropriate header +file on your platform but that might not work on other platforms. +Instead, use "g_ntohs()", "g_ntohl()", "g_htons()", and "g_htonl()"; +those are declared by <glib.h>, and you'll need to include that anyway, +as Wireshark header files that all dissectors must include use stuff from +<glib.h>. + +Don't fetch a little-endian value using "tvb_get_ntohs() or +"tvb_get_ntohl()" and then using "g_ntohs()", "g_htons()", "g_ntohl()", +or "g_htonl()" on the resulting value - the g_ routines in question +convert between network byte order (big-endian) and *host* byte order, +not *little-endian* byte order; not all machines on which Wireshark runs +are little-endian, even though PCs are. Fetch those values using +"tvb_get_letohs()" and "tvb_get_letohl()". + +Do not use "open()", "rename()", "mkdir()", "stat()", "unlink()", "remove()", +"fopen()", "freopen()" directly. Instead use "ws_open()", "ws_rename()", +"ws_mkdir()", "ws_stat()", "ws_unlink()", "ws_remove()", "ws_fopen()", +"ws_freopen()": these wrapper functions change the path and file name from +UTF-8 to UTF-16 on Windows allowing the functions to work correctly when the +path or file name contain non-ASCII characters. + +Also, use ws_read(), ws_write(), ws_lseek(), ws_dup(), ws_fstat(), and +ws_fdopen(), rather than read(), write(), lseek(), dup(), fstat(), and +fdopen() on descriptors returned by ws_open(). + +Those functions are declared in <wsutil/file_util.h>; include that +header in any code that uses any of those routines. + +When opening a file with "ws_fopen()", "ws_freopen()", or "ws_fdopen()", if +the file contains ASCII text, use "r", "w", "a", and so on as the open mode +- but if it contains binary data, use "rb", "wb", and so on. On +Windows, if a file is opened in a text mode, writing a byte with the +value of octal 12 (newline) to the file causes two bytes, one with the +value octal 15 (carriage return) and one with the value octal 12, to be +written to the file, and causes bytes with the value octal 15 to be +discarded when reading the file (to translate between C's UNIX-style +lines that end with newline and Windows' DEC-style lines that end with +carriage return/line feed). + +In addition, that also means that when opening or creating a binary +file, you must use "ws_open()" (with O_CREAT and possibly O_TRUNC if the +file is to be created if it doesn't exist), and OR in the O_BINARY flag, +even on UN*X - O_BINARY is defined by <wsutil/file_util.h> as 0 on UN*X. + +Do not include <unistd.h>, <fcntl.h>, or <io.h> to declare any of the +routines listed as replaced by routines in <wsutil/file_util.h>; +instead, just include <wsutil/file_util.h>. + +If you need the declarations of other functions defined by <unistd.h>, +don't include it without protecting it with + + #ifdef HAVE_UNISTD_H + + ... + + #endif + +Don't use forward declarations of static arrays without a specified size +in a fashion such as this: + + static const value_string foo_vals[]; + + ... + + static const value_string foo_vals[] = { + { 0, "Red" }, + { 1, "Green" }, + { 2, "Blue" }, + { 0, NULL } + }; + +as some compilers will reject the first of those statements. Instead, +initialize the array at the point at which it's first declared, so that +the size is known. + +For #define names and enum member names, prefix the names with a tag so +as to avoid collisions with other names - this might be more of an issue +on Windows, as it appears to #define names such as DELETE and +OPTIONAL. + +Don't use the "positional parameters" extension that many UNIX printf's +implement, e.g.: + + snprintf(add_string, 30, " - (%1$d) (0x%1$04x)", value); + +as not all UNIX printf's implement it, and Windows printf doesn't appear +to implement it. Use something like + + snprintf(add_string, 30, " - (%d) (0x%04x)", value, value); + +instead. + +Don't use + + case N ... M: + +as that's not supported by all compilers. + +Prefer the C99 output functions from <stdio.h> instead of their GLib +replacements (note that positional format parameters are not part of C99). +In the past we used to recommend using g_snprintf() and g_vsnprintf() +instead but since Visual Studio 2015 native C99 implementations are +available on all platforms we support. These are optimized better than +the gnulib (GLib) implementation and on hot codepaths that can be a +noticeable difference in execution speed. + +tmpnam() -> mkstemp() +tmpnam is insecure and should not be used any more. Wireshark brings its +own mkstemp implementation for use on platforms that lack mkstemp. +Note: mkstemp does not accept NULL as a parameter. + +Wireshark requires minimum versions of each of the libraries it uses, in +particular GLib 2.54.0 and Qt 5.12.0 or newer. If you require a mechanism +that is available only in a newer version of a library then use its +version detection macros, e.g. "#if GLIB_CHECK_VERSION(...)" and "#if +QT_VERSION_CHECK(...)" to conditionally compile code using that +mechanism. + +When different code must be used on UN*X and Win32, use a #if or #ifdef +that tests _WIN32, not WIN32. Try to write code portably whenever +possible, however; note that there are some routines in Wireshark with +platform-dependent implementations and platform-independent APIs, such +as the routines in epan/filesystem.c, allowing the code that calls it to +be written portably without #ifdefs. + +We support building on Windows using MinGW-w64 (experimental) so be mindful +of the difference between an #ifdef on _WIN32 and _MSC_VER. The first tests +if the platform is some version of Windows and also applies to MinGW. The +latter tests if the toolchain is Microsoft Visual Studio. Sometimes you need +one or the other, depending on whether the condition applies to all Windows +compilers or only Microsoft's compiler. Use #ifdef __MINGW32__ to test for +a MinGW toolchain, including MinGW-w64. The same concern applies to CMake +code. Depending on the particular situation you may need to use if(WIN32) or +if(MSVC) or if(MINGW). + +Wireshark uses Libgcrypt as general-purpose crypto library. Some Wireshark +specific extensions are defined in wsutil/wsgcrypt.h. You might want to +include that file instead. + +2. String handling + +Do not use functions such as strcat() or strcpy(). +A lot of work has been done to remove the existing calls to these functions and +we do not want any new callers of these functions. + +Instead use snprintf() since that function will if used correctly prevent +buffer overflows for large strings. + +Be sure that all pointers passed to %s specifiers in format strings are non- +NULL. Some implementations will automatically replace NULL pointers with the +string "(NULL)", but most will not. + +When using a buffer to create a string, do not use a buffer stored on the stack. +I.e. do not use a buffer declared as + + char buffer[1024]; + +instead allocate a buffer dynamically using the string-specific or plain wmem +routines (see README.wmem) such as + + wmem_strbuf_t *strbuf; + strbuf = wmem_strbuf_new(pinfo->pool, ""); + wmem_strbuf_append_printf(strbuf, ... + +or + + char *buffer=NULL; + ... + #define MAX_BUFFER 1024 + buffer=wmem_alloc(pinfo->pool, MAX_BUFFER); + buffer[0]='\0'; + ... + snprintf(buffer, MAX_BUFFER, ... + +This avoids the stack from being corrupted in case there is a bug in your code +that accidentally writes beyond the end of the buffer. + + +If you write a routine that will create and return a pointer to a filled in +string and if that buffer will not be further processed or appended to after +the routine returns (except being added to the proto tree), +do not preallocate the buffer to fill in and pass as a parameter instead +pass a pointer to a pointer to the function and return a pointer to a +wmem-allocated buffer that will be automatically freed. (see README.wmem) + +I.e. do not write code such as + static void + foo_to_str(char *string, ... ){ + <fill in string> + } + ... + char buffer[1024]; + ... + foo_to_str(buffer, ... + proto_tree_add_string(... buffer ... + +instead write the code as + static void + foo_to_str(char **buffer, ... + #define MAX_BUFFER x + *buffer=wmem_alloc(pinfo->pool, MAX_BUFFER); + <fill in *buffer> + } + ... + char *buffer; + ... + foo_to_str(&buffer, ... + proto_tree_add_string(... *buffer ... + +Use wmem_ allocated buffers. They are very fast and nice. These buffers are all +automatically free()d when the dissection of the current packet ends so you +don't have to worry about free()ing them explicitly in order to not leak memory. +Please read README.wmem. + +Source files can use UTF-8 encoding, but characters outside the ASCII +range should be used sparingly. It should be safe to use non-ASCII +characters in comments and strings, but some compilers (such as GCC +versions prior to 10) may not support extended identifiers very well. +There is also no guarantee that a developer's text editor will interpret +the characters the way you intend them to be interpreted. + +The majority of Wireshark encodes strings as UTF-8. The main exception +is the code that uses the Qt API, which uses UTF-16. Console output is +UTF-8, but as with the source code extended characters should be used +sparingly since some consoles (most notably Windows' cmd.exe) have +limited support for UTF-8. + +3. Robustness. + +Wireshark is not guaranteed to read only network traces that contain correctly- +formed packets. Wireshark is commonly used to track down networking +problems, and the problems might be due to a buggy protocol implementation +sending out bad packets. + +Therefore, code does not only have to be able to handle +correctly-formed packets without, for example, crashing or looping +infinitely, they also have to be able to handle *incorrectly*-formed +packets without crashing or looping infinitely. + +Here are some suggestions for making code more robust in the face +of incorrectly-formed packets: + +Do *NOT* use "ws_assert()" or "ws_assert_not_reached()" with input data in dissectors. +*NO* value in a packet's data should be considered "wrong" in the sense +that it's a problem with the dissector if found; if it cannot do +anything else with a particular value from a packet's data, the +dissector should put into the protocol tree an indication that the +value is invalid, and should return. The "expert" mechanism should be +used for that purpose. + +Use assertions to catch logic errors in your program. A failed assertion +indicates a bug in the code. Use ws_assert() instead of g_assert() to +test a logic condition. Note that ws_assert() can be removed at compile +time. Therefore assertions should not have any side-effects, +otherwise the program may behave inconsistently. + +Use ws_assert_not_reached() instead of g_assert_not_reached() for +unreachable error conditions. For example if (and only if) you know +'myvar' can only have the values 1 and 2 do: + switch(myvar) { + case 1: + (...) + break; + case 2: + (...) + break; + default: + ws_assert_not_reached(); + break; + } + +For dissectors use DISSECTOR_ASSERT() and DISSECTOR_ASSERT_NOT_REACHED() +instead, with the same caveats as above. + +You should continue to use g_assert_true(), g_assert_cmpstr(), etc for +"test code", such as unit testing. These assertions are always active. +See the GLib Testing API documentation for the details on each of those +functions. + +If there is a case where you are checking not for an invalid data item +in the packet, but for a bug in the dissector (for example, an +assumption being made at a particular point in the code about the +internal state of the dissector), use the DISSECTOR_ASSERT macro for +that purpose; this will put into the protocol tree an indication that +the dissector has a bug in it, and will not crash the application. + +If you are allocating a chunk of memory to contain data from a packet, +or to contain information derived from data in a packet, and the size of +the chunk of memory is derived from a size field in the packet, make +sure all the data is present in the packet before allocating the buffer. +Doing so means that: + + 1) Wireshark won't leak that chunk of memory if an attempt to + fetch data not present in the packet throws an exception. + +and + + 2) it won't crash trying to allocate an absurdly-large chunk of + memory if the size field has a bogus large value. + +If you're fetching into such a chunk of memory a sequence of bytes from +the buffer, and the sequence has a specified size, you can use +"tvb_memdup()", which will check whether the entire sequence is present +before allocating a buffer for it. + +Otherwise, you can check whether the data is present by using +"tvb_ensure_bytes_exist()" although this frequently is not needed: the +TVB-accessor routines can handle requests to read data beyond the end of +the TVB (by throwing an exception which will either mark the frame as +truncated--not all the data was captured--or as malformed). + +If you're fetching a string only to add it to the tree, you should +generally be using "proto_tree_add_item()" instead. If you also need +the string, you can use the variant "proto_tree_add_item_ret_string()" +or "proto_tree_add_item_ret_string_and_length()" forms. + +If you must fetch it from the tvbuff, and the string has a specified +size and known encoding, you can use "tvb_get_string_enc()" for most +encodings, which will check whether the entire string is present before +allocating a buffer for the string, will put a trailing '\0' at the end +of the buffer, and will also check for invalid characters in the supplied +encoding and convert the string to UTF-8. The "tvb_get_*_string()" set of +functions is available as well, and must be used for some encodings, +primarily non byte aligned ones. If the string has a known encoding and +is null terminated, the "stringz" variants can be used. (Note that these +functions are called with memory allocators, and if called with a NULL +allocator you are required to free the string when finished with it.) + +If the string has a known encoding but requires token parsing or other +text manipulation to determine the offset and size, do so by calling +tvb_*() functions on the tvbuff that perform bounds checking if possible. +Only extract the bytes into a newly allocated buffer to extract a string +if absolutely necessary. If you do so, then you *must* ensure that the +string is valid UTF-8 when passing it to a libwireshark API function +such as proto_tree_add_string(). (Cf. 7.5: Unicode and string encoding +best practices.) + +Conversion to UTF-8 can produce a string with a length longer than +that of the string in the original packet data; this includes strings +encoded in ASCII or UTF-8 itself if they have invalid character sequences +that are replaced with the 3 byte UTF-8 REPLACEMENT CHARACTER. Truncating +a valid UTF-8 string to an arbitrary number of bytes does not guarantee +that the result is a valid UTF-8 string, because a multibyte character +might span the boundary. + +Note also that you should only fetch string data into a fixed-length +buffer if the code ensures that no more bytes than will fit into the +buffer are fetched ("the protocol ensures" isn't good enough, as +protocol specifications can't ensure only packets that conform to the +specification will be transmitted or that only packets for the protocol +in question will be interpreted as packets for that protocol by +Wireshark). + +If you have gotten a pointer using "tvb_get_ptr()" (which you should not +have: you should seriously consider a better alternative to this function), +you must make sure that you do not refer to any data past the length passed +as the last argument to "tvb_get_ptr()"; while the various "tvb_get" +routines perform bounds checking and throw an exception if you refer to data +not available in the tvbuff, direct references through a pointer gotten from +"tvb_get_ptr()" do not do any bounds checking. + +If you have a loop that dissects a sequence of items, each of which has +a length field, with the offset in the tvbuff advanced by the length of +the item, then, if the length field is the total length of the item, and +thus can be zero, you *MUST* check for a zero-length item and abort the +loop if you see one. Otherwise, a zero-length item could cause the +dissector to loop infinitely. You should also check that the offset, +after having the length added to it, is greater than the offset before +the length was added to it, if the length field is greater than 24 bits +long, so that, if the length value is *very* large and adding it to the +offset causes an overflow, that overflow is detected. + +If you have a + + for (i = {start}; i < {end}; i++) + +loop, make sure that the type of the loop index variable is large enough +to hold the maximum {end} value plus 1; otherwise, the loop index +variable can overflow before it ever reaches its maximum value. In +particular, be very careful when using int8_t, uint8_t, int16_t, or uint16_t +(or the deprecated Glib synonyms gint8, guint8, gint16, or guint16) +variables as loop indices; you almost always want to use an "int"/"gint" +or "unsigned"/"guint" as the loop index rather than a shorter type. + +If you are fetching a length field from the buffer, corresponding to the +length of a portion of the packet, and subtracting from that length a +value corresponding to the length of, for example, a header in the +packet portion in question, *ALWAYS* check that the value of the length +field is greater than or equal to the length you're subtracting from it, +and report an error in the packet and stop dissecting the packet if it's +less than the length you're subtracting from it. Otherwise, the +resulting length value will be negative, which will either cause errors +in the dissector or routines called by the dissector, or, if the value +is interpreted as an unsigned integer, will cause the value to be +interpreted as a very large positive value. + +Any tvbuff offset that is added to as processing is done on a packet +should be stored in a 32-bit variable, such as an "int"; if you store it +in an 8-bit or 16-bit variable, you run the risk of the variable +overflowing. + +sprintf() -> snprintf() +Prevent yourself from using the sprintf() function, as it does not test the +length of the given output buffer and might be writing into unintended memory +areas. This function is one of the main causes of security problems like buffer +exploits and many other bugs that are very hard to find. It's much better to +use the snprintf() function declared by <stdio.h> instead. + +You should test your dissector against incorrectly-formed packets. This +can be done using the randpkt and editcap utilities that come with the +Wireshark distribution. Testing using randpkt can be done by generating +output at the same layer as your protocol, and forcing Wireshark/TShark +to decode it as your protocol, e.g. if your protocol sits on top of UDP: + + randpkt -c 50000 -t dns randpkt.pcap + tshark -nVr randpkt.pcap -d udp.port==53,<myproto> + +Testing using editcap can be done using preexisting capture files and the +"-E" flag, which introduces errors in a capture file. E.g.: + + editcap -E 0.03 infile.pcap outfile.pcap + tshark -nVr outfile.pcap + +tools/fuzz-test.sh is available to help automate these tests. + +4. Name convention. + +Wireshark uses the underscore_convention rather than the InterCapConvention for +function names, so new code should probably use underscores rather than +intercaps for functions and variable names. This is especially important if you +are writing code that will be called from outside your code. We are just +trying to keep things consistent for other developers. + +C symbols exported from libraries shipped with Wireshark should start with a +prefix that helps avoiding name collision with public symbols from other shared +libraries. The current suggested prefixes for newly added symbols are +ws_, wslua_, wmem_ and wtap_. + +5. White space convention. + +Most of the C and C++ files in Wireshark use 4-space or 2-space indentation. +When creating new files you are you are strongly encouraged to use 4-space +indentation for source code in order to ensure consistency between files. + +Please avoid using tab expansions different from 8 column widths, as not all +text editors in use by the developers support this. For a detailed discussion +of tabs, spaces, and indentation, see + + http://www.jwz.org/doc/tabs-vs-spaces.html + +We use EditorConfig (http://editorconfig.org) files to provide formatting +hints. Most editors and IDEs support EditorConfig, either directly or via +a plugin. If yours requires a plugin we encourage you to install it. Our +default EditorConfig indentation style for C and C++ files is 4 spaces. + +Many files also have a short comment (modelines) on the indentation logic at +the end of the file. This was required in the past but has been superseded by +EditorConfig. See + + https://www.wireshark.org/tools/modelines.html + +for more information. + +Please do not leave trailing whitespace (spaces/tabs) on lines. + +Quite a bit of our source code has varying indentation styles. When editing an +existing file, try following the existing indentation logic. If you wish to +convert a file to 4 space indentation, please do so in its own commit and be +sure to remove its .editorconfig entry so that the default setting takes +effect. + +6. Compiler warnings + +You should write code that is free of compiler warnings. Such warnings will +often indicate questionable code and sometimes even real bugs, so it's best +to avoid warnings at all. + +The compiler flags in the Makefiles are set to "treat warnings as errors", +so your code won't even compile when warnings occur. + +7. General observations about architecture + +7.1 The global header "wireshark.h" + +You should include the global header <wireshark.h> in your code. However +there are some things to keep in mind when using it and especially +if you are considering modifying it. + +** wireshark.h needs to be minimal: for efficiency reasons, to reduce the +error surface and because every time this header changes everything must be +rebuilt. Consider carefully if another header/module should be included +globally with every project file and exported as public header. + +** No configuration: configuration is specific to the build environment +and target machine. wireshark.h must not depend on that. + +** Only wireshark system headers allowed: plugins use this header and +cannot depend on any header (even indirectly) that is not installed on the +target system. + +** Only global definitions allowed: for example it is acceptable to include +'wsutil' headers in wireshark.h because every component of Wireshark is allowed +to depend on wsutil. wiretap is not acceptable because we cannot introduce +dependencies on wiretap globally (and wireshark.h must be usable everywhere). + +7.2 Best practices using headers + +C files can be categorized in three types: source files, private headers and +public headers. + +A module "foobar" can have only a private header, only a public header, or +both. If it's only one it is named "foobar.h" in both cases. If it is both they +are named "foobar-int.h" and "foobar.h" respectively. + +In general the order of #include's for a C module source files (foobar.c), +assuming foobar implements any kind of interface should be: + + #include "config.h" + #define WS_LOG_DOMAIN "mydomain" + #include "foobar-int.h" + + followed by <system headers> + followed by <wireshark public headers> + followed by <wireshark private headers> + +For header files (private and public) config.h must NOT be included. A public +header file (foobar.h) looks like this: + + #ifndef __FOOBAR_H__ + #define __FOOBAR_H__ + #include <wireshark.h> + followed by <system headers> + followed by <wireshark public headers> + + #ifdef __cplusplus + extern "C" { + #endif + (declarations) + #ifdef __cplusplus + } + #endif + #endif /* FOOBAR_H */ + +A private header (foobar-int.h) is the public header plus the declarations +with private scope: + + #ifndef __FOOBAR_INT_H__ + #define __FOOBAR_INT_H__ + #include "foobar.h" + followed by <system headers> + followed by <wireshark public headers> + followed by <wireshark private headers> + (etc.) + +Again if there are only public or private declarations the name foobar-int.h +is not used. The macro symbol WS_LOG_DOMAIN can be defined in source files or +private headers as long as it comes before wireshark.h. + +7.3 Wireshark internal and external API policy + +Wireshark has several APIs. We need to distinguish between internal +Wireshark library APIs and external Wireshark APIs. Wireshark the project is +composed of many different programs and these executable binaries use a number +of internal libraries to share code efficiently. These internal shared +libraries need to be installed on the system to run the programs (wireshark, +tshark, etc). + +A library's public API includes the symbols exported by the DSO (wsutil, +libwireshark, etc). The internal API is made available in the shared libraries +and exists to support the goals of the project. It is public from the point +of view of Wireshark programs (client users of the internal API). The +external API exists to support plugins (client users of the external API) +and is a loosely defined subset of the internal API plus any infrastructure +required to support a plugin system. Note that these two uses of shared +libraries coexist with a lot of overlap, but are nevertheless distinct. + +The internal (public) API is not considered to be stable and will regularly +change as a normal part of development to support new features, remove cruft, +and whatever else is necessary to make the project sustainable and ease the +burden on developers. There is less freedom to change something that could +break a lot of plugins but this is also acceptable (with cause). + +The plugin ABI policy is to be compatible only between micro releases (also +called patch releases). That means we try to make it unnecessary to recompile +plugins with each micro release (on a best-effort basis). For major.minor +releases it is explicitly required to recompile plugins. There is no stable +ABI contract of any kind in that case. + +Keep in mind that APIs can exist in different scopes and levels of abstraction. +Don't get stuck thinking the words public/private have a very specific +meaning, like being decorated or not with WS_DLL_PUBLIC, although that is a +big part of it usually. + +Also the Wireshark developers have historically tried to keep the Lua API +very stable and provide strong backward-compatibility guarantees. Under this +policy moving from Lua 5.2 is unlikely to happen in the foreseeable future. + +7.4 libwireshark is not a single monolithic entity + +One day we might conceivably wish to load dissectors on demand and do other +more sophisticated kinds of unit test. Plus other scenarios not immediately +obvious. For this to be possible it is important that the code in epan/ does +not depend on code in epan/dissectors, i.e it is possible to compile epan +without linking with dissector code. It helps to view dissectors as clients +of an API provided by epan (libwireshark being constituted by two distinct +components "epan" and "dissectors" bundled together, plus other bits and +pieces). The reverse is not* true; epan should not be the client of an API +provided by dissectors. + +The main way this separation of concerns is achieved is by using runtime +registration interfaces in epan for dissectors, preferences, etc. that are +dynamic and do not have any dissector routines hard coded. Naturally this +is also an essential component of a plugin system (libwireshark has plugins +for taps, dissectors and an experimental interface to augment dissection with +new extension languages). + +7.5 Unicode and string encoding best practices + +Wireshark strings are always encoded in UTF-8 internally, regardless of the +platform where it is running. The C datatype used is "pointer to char" and this +is assumed to point to a valid UTF-8 string. Sometimes older code uses char to +point to opaque byte strings but this archaic usage should be avoided. A better +data type for that is uint8_t. + +Every untrusted string needs to be validated for correct and error-free UTF-8 +encoding, or converted from the source encoding to UTF-8. This should be done +at the periphery of the code. This means converting input during dissection or +when reading input generally. To reiterate: all the Wireshark APIs expect to +receive valid UTF-8 strings. These include proto_tree_add_string(), +proto_item_append_text() and col_append_fstr() just to name a few. + +If a dissector uses standard API functions to handle strings, such as +proto_tree_add_item() with an FT_STRING header field type, the API will +transparently handle the conversion from the source encoding to UTF-8 and +nothing else needs to be done to ensure valid string input. + +If your dissector does text manipulation, token parsing and such and generally +extracts text strings from the TVBuff or tries to do line oriented input from +TVBuffs it *must* make sure it passes only valid UTF-8 to libwireshark APIs. +This should be done using tvb_get_string_enc() to extract a string from a TVbuff +or get_utf_8_string() to validate a string after it has been constructed. + +The Qt API uses UTF-16 for its QString class; when converting between a +QString and a pointer to char, functions that convert to or from UTF-8 +encoded pointers to char (or QByteArrays) such as toUtf8() should be used, +not toLocal8Bit() or toLatin1(). + +8. Miscellaneous notes + +Each commit in your branch corresponds to a different VCSVERSION string +automatically defined in the header 'vcs_version.h' during the build. If you happen +to find it convenient to disable this feature it can be done using: + + touch .git/wireshark-disable-versioning + +i.e., the file 'wireshark-disable-versioning' must exist in the git repo dir. + +/* + * Editor modelines - https://www.wireshark.org/tools/modelines.html + * + * Local variables: + * c-basic-offset: 4 + * tab-width: 8 + * indent-tabs-mode: nil + * End: + * + * vi: set shiftwidth=4 tabstop=8 expandtab: + * :indentSize=4:tabSize=8:noTabs=true: + */ |