diff options
Diffstat (limited to '')
-rw-r--r-- | doc/magic.man | 816 |
1 files changed, 816 insertions, 0 deletions
diff --git a/doc/magic.man b/doc/magic.man new file mode 100644 index 0000000..754d745 --- /dev/null +++ b/doc/magic.man @@ -0,0 +1,816 @@ +.\" $File: magic.man,v 1.101 2022/10/09 18:51:04 christos Exp $ +.Dd October 9, 2022 +.Dt MAGIC __FSECTION__ +.Os +.\" install as magic.4 on USG, magic.5 on V7, Berkeley and Linux systems. +.Sh NAME +.Nm magic +.Nd file command's magic pattern file +.Sh DESCRIPTION +This manual page documents the format of magic files as +used by the +.Xr file __CSECTION__ +command, version __VERSION__. +The +.Xr file __CSECTION__ +command identifies the type of a file using, +among other tests, +a test for whether the file contains certain +.Dq "magic patterns" . +The database of these +.Dq "magic patterns" +is usually located in a binary file in +.Pa __MAGIC__.mgc +or a directory of source text magic pattern fragment files in +.Pa __MAGIC__ . +The database specifies what patterns are to be tested for, what message or +MIME type to print if a particular pattern is found, +and additional information to extract from the file. +.Pp +The format of the source fragment files that are used to build this database +is as follows: +Each line of a fragment file specifies a test to be performed. +A test compares the data starting at a particular offset +in the file with a byte value, a string or a numeric value. +If the test succeeds, a message is printed. +The line consists of the following fields: +.Bl -tag -width ".Dv message" +.It Dv offset +A number specifying the offset (in bytes) into the file of the data +which is to be tested. +This offset can be a negative number if it is: +.Bl -bullet -compact +.It +The first direct offset of the magic entry (at continuation level 0), +in which case it is interpreted an offset from end end of the file +going backwards. +This works only when a file descriptor to the file is available and it +is a regular file. +.It +A continuation offset relative to the end of the last up-level field +.Dv ( \*[Am] ) . +.El +.It Dv type +The type of the data to be tested. +The possible values are: +.Bl -tag -width ".Dv lestring16" +.It Dv byte +A one-byte value. +.It Dv short +A two-byte value in this machine's native byte order. +.It Dv long +A four-byte value in this machine's native byte order. +.It Dv quad +An eight-byte value in this machine's native byte order. +.It Dv float +A 32-bit single precision IEEE floating point number in this machine's native byte order. +.It Dv double +A 64-bit double precision IEEE floating point number in this machine's native byte order. +.It Dv string +A string of bytes. +The string type specification can be optionally followed by a /<width> +option and optionally followed by a set of flags /[bCcftTtWw]*. +The width limits the number of characters to be copied. +Zero means all characters. +The following flags are supported: +.Bl -tag -width B -compact -offset XXXX +.It b +Force binary file test. +.It C +Use upper case insensitive matching: upper case +characters in the magic match both lower and upper case characters in the +target, whereas lower case characters in the magic only match upper case +characters in the target. +.It c +Use lower case insensitive matching: lower case +characters in the magic match both lower and upper case characters in the +target, whereas upper case characters in the magic only match upper case +characters in the target. +To do a complete case insensitive match, specify both +.Dq c +and +.Dq C . +.It f +Require that the matched string is a full word, not a partial word match. +.It T +Trim the string, i.e. leading and trailing whitespace +.It t +Force text file test. +.It W +Compact whitespace in the target, which must +contain at least one whitespace character. +If the magic has +.Dv n +consecutive blanks, the target needs at least +.Dv n +consecutive blanks to match. +.It w +Treat every blank in the magic as an optional blank. +is deleted before the string is printed. +.El +.It Dv pstring +A Pascal-style string where the first byte/short/int is interpreted as the +unsigned length. +The length defaults to byte and can be specified as a modifier. +The following modifiers are supported: +.Bl -tag -width B -compact -offset XXXX +.It B +A byte length (default). +.It H +A 2 byte big endian length. +.It h +A 2 byte little endian length. +.It L +A 4 byte big endian length. +.It l +A 4 byte little endian length. +.It J +The length includes itself in its count. +.El +The string is not NUL terminated. +.Dq J +is used rather than the more +valuable +.Dq I +because this type of length is a feature of the JPEG +format. +.It Dv date +A four-byte value interpreted as a UNIX date. +.It Dv qdate +An eight-byte value interpreted as a UNIX date. +.It Dv ldate +A four-byte value interpreted as a UNIX-style date, but interpreted as +local time rather than UTC. +.It Dv qldate +An eight-byte value interpreted as a UNIX-style date, but interpreted as +local time rather than UTC. +.It Dv qwdate +An eight-byte value interpreted as a Windows-style date. +.It Dv beid3 +A 32-bit ID3 length in big-endian byte order. +.It Dv beshort +A two-byte value in big-endian byte order. +.It Dv belong +A four-byte value in big-endian byte order. +.It Dv bequad +An eight-byte value in big-endian byte order. +.It Dv befloat +A 32-bit single precision IEEE floating point number in big-endian byte order. +.It Dv bedouble +A 64-bit double precision IEEE floating point number in big-endian byte order. +.It Dv bedate +A four-byte value in big-endian byte order, +interpreted as a Unix date. +.It Dv beqdate +An eight-byte value in big-endian byte order, +interpreted as a Unix date. +.It Dv beldate +A four-byte value in big-endian byte order, +interpreted as a UNIX-style date, but interpreted as local time rather +than UTC. +.It Dv beqldate +An eight-byte value in big-endian byte order, +interpreted as a UNIX-style date, but interpreted as local time rather +than UTC. +.It Dv beqwdate +An eight-byte value in big-endian byte order, +interpreted as a Windows-style date. +.It Dv bestring16 +A two-byte unicode (UCS16) string in big-endian byte order. +.It Dv leid3 +A 32-bit ID3 length in little-endian byte order. +.It Dv leshort +A two-byte value in little-endian byte order. +.It Dv lelong +A four-byte value in little-endian byte order. +.It Dv lequad +An eight-byte value in little-endian byte order. +.It Dv lefloat +A 32-bit single precision IEEE floating point number in little-endian byte order. +.It Dv ledouble +A 64-bit double precision IEEE floating point number in little-endian byte order. +.It Dv ledate +A four-byte value in little-endian byte order, +interpreted as a UNIX date. +.It Dv leqdate +An eight-byte value in little-endian byte order, +interpreted as a UNIX date. +.It Dv leldate +A four-byte value in little-endian byte order, +interpreted as a UNIX-style date, but interpreted as local time rather +than UTC. +.It Dv leqldate +An eight-byte value in little-endian byte order, +interpreted as a UNIX-style date, but interpreted as local time rather +than UTC. +.It Dv leqwdate +An eight-byte value in little-endian byte order, +interpreted as a Windows-style date. +.It Dv lestring16 +A two-byte unicode (UCS16) string in little-endian byte order. +.It Dv melong +A four-byte value in middle-endian (PDP-11) byte order. +.It Dv medate +A four-byte value in middle-endian (PDP-11) byte order, +interpreted as a UNIX date. +.It Dv meldate +A four-byte value in middle-endian (PDP-11) byte order, +interpreted as a UNIX-style date, but interpreted as local time rather +than UTC. +.It Dv indirect +Starting at the given offset, consult the magic database again. +The offset of the +.Dv indirect +magic is by default absolute in the file, but one can specify +.Dv /r +to indicate that the offset is relative from the beginning of the entry. +.It Dv name +Define a +.Dq named +magic instance that can be called from another +.Dv use +magic entry, like a subroutine call. +Named instance direct magic offsets are relative to the offset of the +previous matched entry, but indirect offsets are relative to the beginning +of the file as usual. +Named magic entries always match. +.It Dv use +Recursively call the named magic starting from the current offset. +If the name of the referenced begins with a +.Dv ^ +then the endianness of the magic is switched; if the magic mentioned +.Dv leshort +for example, +it is treated as +.Dv beshort +and vice versa. +This is useful to avoid duplicating the rules for different endianness. +.It Dv regex +A regular expression match in extended POSIX regular expression syntax +(like egrep). +Regular expressions can take exponential time to process, and their +performance is hard to predict, so their use is discouraged. +When used in production environments, their performance +should be carefully checked. +The size of the string to search should also be limited by specifying +.Dv /<length> , +to avoid performance issues scanning long files. +The type specification can also be optionally followed by +.Dv /[c][s][l] . +The +.Dq c +flag makes the match case insensitive, while the +.Dq s +flag update the offset to the start offset of the match, rather than the end. +The +.Dq l +modifier, changes the limit of length to mean number of lines instead of a +byte count. +Lines are delimited by the platforms native line delimiter. +When a line count is specified, an implicit byte count also computed assuming +each line is 80 characters long. +If neither a byte or line count is specified, the search is limited automatically +to 8KiB. +.Dv ^ +and +.Dv $ +match the beginning and end of individual lines, respectively, +not beginning and end of file. +.It Dv search +A literal string search starting at the given offset. +The same modifier flags can be used as for string patterns. +The search expression must contain the range in the form +.Dv /number, +that is the number of positions at which the match will be +attempted, starting from the start offset. +This is suitable for +searching larger binary expressions with variable offsets, using +.Dv \e +escapes for special characters. +The order of modifier and number is not relevant. +.It Dv default +This is intended to be used with the test +.Em x +(which is always true) and it has no type. +It matches when no other test at that continuation level has matched before. +Clearing that matched tests for a continuation level, can be done using the +.Dv clear +test. +.It Dv clear +This test is always true and clears the match flag for that continuation level. +It is intended to be used with the +.Dv default +test. +.It Dv der +Parse the file as a DER Certificate file. +The test field is used as a der type that needs to be matched. +The DER types are: +.Dv eoc , +.Dv bool , +.Dv int , +.Dv bit_str , +.Dv octet_str , +.Dv null , +.Dv obj_id , +.Dv obj_desc , +.Dv ext , +.Dv real , +.Dv enum , +.Dv embed , +.Dv utf8_str , +.Dv rel_oid , +.Dv time , +.Dv res2 , +.Dv seq , +.Dv set , +.Dv num_str , +.Dv prt_str , +.Dv t61_str , +.Dv vid_str , +.Dv ia5_str , +.Dv utc_time , +.Dv gen_time , +.Dv gr_str , +.Dv vis_str , +.Dv gen_str , +.Dv univ_str , +.Dv char_str , +.Dv bmp_str , +.Dv date , +.Dv tod , +.Dv datetime , +.Dv duration , +.Dv oid-iri , +.Dv rel-oid-iri . +These types can be followed by an optional numeric size, which indicates +the field width in bytes. +.It Dv guid +A Globally Unique Identifier, parsed and printed as +XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX. +It's format is a string. +.It Dv offset +This is a quad value indicating the current offset of the file. +It can be used to determine the size of the file or the magic buffer. +For example the magic entries: +.Bd -literal -offset indent +-0 offset x this file is %lld bytes +-0 offset <=100 must be more than 100 \e + bytes and is only %lld +.Ed +.It Dv octal +A string representing an octal number. +.El +.El +.Pp +For compatibility with the Single +.Ux +Standard, the type specifiers +.Dv dC +and +.Dv d1 +are equivalent to +.Dv byte , +the type specifiers +.Dv uC +and +.Dv u1 +are equivalent to +.Dv ubyte , +the type specifiers +.Dv dS +and +.Dv d2 +are equivalent to +.Dv short , +the type specifiers +.Dv uS +and +.Dv u2 +are equivalent to +.Dv ushort , +the type specifiers +.Dv dI , +.Dv dL , +and +.Dv d4 +are equivalent to +.Dv long , +the type specifiers +.Dv uI , +.Dv uL , +and +.Dv u4 +are equivalent to +.Dv ulong , +the type specifier +.Dv d8 +is equivalent to +.Dv quad , +the type specifier +.Dv u8 +is equivalent to +.Dv uquad , +and the type specifier +.Dv s +is equivalent to +.Dv string . +In addition, the type specifier +.Dv dQ +is equivalent to +.Dv quad +and the type specifier +.Dv uQ +is equivalent to +.Dv uquad . +.Pp +Each top-level magic pattern (see below for an explanation of levels) +is classified as text or binary according to the types used. +Types +.Dq regex +and +.Dq search +are classified as text tests, unless non-printable characters are used +in the pattern. +All other tests are classified as binary. +A top-level +pattern is considered to be a test text when all its patterns are text +patterns; otherwise, it is considered to be a binary pattern. +When +matching a file, binary patterns are tried first; if no match is +found, and the file looks like text, then its encoding is determined +and the text patterns are tried. +.Pp +The numeric types may optionally be followed by +.Dv \*[Am] +and a numeric value, +to specify that the value is to be AND'ed with the +numeric value before any comparisons are done. +Prepending a +.Dv u +to the type indicates that ordered comparisons should be unsigned. +.It Dv test +The value to be compared with the value from the file. +If the type is +numeric, this value +is specified in C form; if it is a string, it is specified as a C string +with the usual escapes permitted (e.g. \en for new-line). +.Pp +Numeric values +may be preceded by a character indicating the operation to be performed. +It may be +.Dv = , +to specify that the value from the file must equal the specified value, +.Dv \*[Lt] , +to specify that the value from the file must be less than the specified +value, +.Dv \*[Gt] , +to specify that the value from the file must be greater than the specified +value, +.Dv \*[Am] , +to specify that the value from the file must have set all of the bits +that are set in the specified value, +.Dv ^ , +to specify that the value from the file must have clear any of the bits +that are set in the specified value, or +.Dv ~ , +the value specified after is negated before tested. +.Dv x , +to specify that any value will match. +If the character is omitted, it is assumed to be +.Dv = . +Operators +.Dv \*[Am] , +.Dv ^ , +and +.Dv ~ +don't work with floats and doubles. +The operator +.Dv !\& +specifies that the line matches if the test does +.Em not +succeed. +.Pp +Numeric values are specified in C form; e.g. +.Dv 13 +is decimal, +.Dv 013 +is octal, and +.Dv 0x13 +is hexadecimal. +.Pp +Numeric operations are not performed on date types, instead the numeric +value is interpreted as an offset. +.Pp +For string values, the string from the +file must match the specified string. +The operators +.Dv = , +.Dv \*[Lt] +and +.Dv \*[Gt] +(but not +.Dv \*[Am] ) +can be applied to strings. +The length used for matching is that of the string argument +in the magic file. +This means that a line can match any non-empty string (usually used to +then print the string), with +.Em \*[Gt]\e0 +(because all non-empty strings are greater than the empty string). +.Pp +Dates are treated as numerical values in the respective internal +representation. +.Pp +The special test +.Em x +always evaluates to true. +.It Dv message +The message to be printed if the comparison succeeds. +If the string contains a +.Xr printf 3 +format specification, the value from the file (with any specified masking +performed) is printed using the message as the format string. +If the string begins with +.Dq \eb , +the message printed is the remainder of the string with no whitespace +added before it: multiple matches are normally separated by a single +space. +.El +.Pp +An APPLE 4+4 character APPLE creator and type can be specified as: +.Bd -literal -offset indent +!:apple CREATYPE +.Ed +.Pp +A MIME type is given on a separate line, which must be the next +non-blank or comment line after the magic line that identifies the +file type, and has the following format: +.Bd -literal -offset indent +!:mime MIMETYPE +.Ed +.Pp +i.e. the literal string +.Dq !:mime +followed by the MIME type. +.Pp +An optional strength can be supplied on a separate line which refers to +the current magic description using the following format: +.Bd -literal -offset indent +!:strength OP VALUE +.Ed +.Pp +The operand +.Dv OP +can be: +.Dv + , +.Dv - , +.Dv * , +or +.Dv / +and +.Dv VALUE +is a constant between 0 and 255. +This constant is applied using the specified operand +to the currently computed default magic strength. +.Pp +Some file formats contain additional information which is to be printed +along with the file type or need additional tests to determine the true +file type. +These additional tests are introduced by one or more +.Em \*[Gt] +characters preceding the offset. +The number of +.Em \*[Gt] +on the line indicates the level of the test; a line with no +.Em \*[Gt] +at the beginning is considered to be at level 0. +Tests are arranged in a tree-like hierarchy: +if the test on a line at level +.Em n +succeeds, all following tests at level +.Em n+1 +are performed, and the messages printed if the tests succeed, until a line +with level +.Em n +(or less) appears. +For more complex files, one can use empty messages to get just the +"if/then" effect, in the following way: +.Bd -literal -offset indent +0 string MZ +\*[Gt]0x18 leshort \*[Lt]0x40 MS-DOS executable +\*[Gt]0x18 leshort \*[Gt]0x3f extended PC executable (e.g., MS Windows) +.Ed +.Pp +Offsets do not need to be constant, but can also be read from the file +being examined. +If the first character following the last +.Em \*[Gt] +is a +.Em \&( +then the string after the parenthesis is interpreted as an indirect offset. +That means that the number after the parenthesis is used as an offset in +the file. +The value at that offset is read, and is used again as an offset +in the file. +Indirect offsets are of the form: +.Em (( x [[.,][bBcCeEfFgGhHiIlmsSqQ]][+\-][ y ]) . +The value of +.Em x +is used as an offset in the file. +A byte, id3 length, short or long is read at that offset depending on the +.Em [bBcCeEfFgGhHiIlmsSqQ] +type specifier. +The value is treated as signed if +.Dq , +is specified or unsigned if +.Dq . +is specified. +The capitalized types interpret the number as a big endian +value, whereas the small letter versions interpret the number as a little +endian value; +the +.Em m +type interprets the number as a middle endian (PDP-11) value. +To that number the value of +.Em y +is added and the result is used as an offset in the file. +The default type if one is not specified is long. +The following types are recognized: +.Bl -column -offset indent "Type" "Half/Short" "Little" "Size" +.It Sy Type Sy Mnemonic Sy Endian Sy Size +.It bcBc Byte/Char N/A 1 +.It efg Double Little 8 +.It EFG Double Big 8 +.It hs Half/Short Little 2 +.It HS Half/Short Big 2 +.It i ID3 Little 4 +.It I ID3 Big 4 +.It m Middle Middle 4 +.It o Octal Textual Variable +.It q Quad Little 8 +.It Q Quad Big 8 +.El +.Pp +That way variable length structures can be examined: +.Bd -literal -offset indent +# MS Windows executables are also valid MS-DOS executables +0 string MZ +\*[Gt]0x18 leshort \*[Lt]0x40 MZ executable (MS-DOS) +# skip the whole block below if it is not an extended executable +\*[Gt]0x18 leshort \*[Gt]0x3f +\*[Gt]\*[Gt](0x3c.l) string PE\e0\e0 PE executable (MS-Windows) +\*[Gt]\*[Gt](0x3c.l) string LX\e0\e0 LX executable (OS/2) +.Ed +.Pp +This strategy of examining has a drawback: you must make sure that you +eventually print something, or users may get empty output (such as when +there is neither PE\e0\e0 nor LE\e0\e0 in the above example). +.Pp +If this indirect offset cannot be used directly, simple calculations are +possible: appending +.Em [+-*/%\*[Am]|^]number +inside parentheses allows one to modify +the value read from the file before it is used as an offset: +.Bd -literal -offset indent +# MS Windows executables are also valid MS-DOS executables +0 string MZ +# sometimes, the value at 0x18 is less that 0x40 but there's still an +# extended executable, simply appended to the file +\*[Gt]0x18 leshort \*[Lt]0x40 +\*[Gt]\*[Gt](4.s*512) leshort 0x014c COFF executable (MS-DOS, DJGPP) +\*[Gt]\*[Gt](4.s*512) leshort !0x014c MZ executable (MS-DOS) +.Ed +.Pp +Sometimes you do not know the exact offset as this depends on the length or +position (when indirection was used before) of preceding fields. +You can specify an offset relative to the end of the last up-level +field using +.Sq \*[Am] +as a prefix to the offset: +.Bd -literal -offset indent +0 string MZ +\*[Gt]0x18 leshort \*[Gt]0x3f +\*[Gt]\*[Gt](0x3c.l) string PE\e0\e0 PE executable (MS-Windows) +# immediately following the PE signature is the CPU type +\*[Gt]\*[Gt]\*[Gt]\*[Am]0 leshort 0x14c for Intel 80386 +\*[Gt]\*[Gt]\*[Gt]\*[Am]0 leshort 0x184 for DEC Alpha +.Ed +.Pp +Indirect and relative offsets can be combined: +.Bd -literal -offset indent +0 string MZ +\*[Gt]0x18 leshort \*[Lt]0x40 +\*[Gt]\*[Gt](4.s*512) leshort !0x014c MZ executable (MS-DOS) +# if it's not COFF, go back 512 bytes and add the offset taken +# from byte 2/3, which is yet another way of finding the start +# of the extended executable +\*[Gt]\*[Gt]\*[Gt]\*[Am](2.s-514) string LE LE executable (MS Windows VxD driver) +.Ed +.Pp +Or the other way around: +.Bd -literal -offset indent +0 string MZ +\*[Gt]0x18 leshort \*[Gt]0x3f +\*[Gt]\*[Gt](0x3c.l) string LE\e0\e0 LE executable (MS-Windows) +# at offset 0x80 (-4, since relative offsets start at the end +# of the up-level match) inside the LE header, we find the absolute +# offset to the code area, where we look for a specific signature +\*[Gt]\*[Gt]\*[Gt](\*[Am]0x7c.l+0x26) string UPX \eb, UPX compressed +.Ed +.Pp +Or even both! +.Bd -literal -offset indent +0 string MZ +\*[Gt]0x18 leshort \*[Gt]0x3f +\*[Gt]\*[Gt](0x3c.l) string LE\e0\e0 LE executable (MS-Windows) +# at offset 0x58 inside the LE header, we find the relative offset +# to a data area where we look for a specific signature +\*[Gt]\*[Gt]\*[Gt]\*[Am](\*[Am]0x54.l-3) string UNACE \eb, ACE self-extracting archive +.Ed +.Pp +If you have to deal with offset/length pairs in your file, even the +second value in a parenthesized expression can be taken from the file itself, +using another set of parentheses. +Note that this additional indirect offset is always relative to the +start of the main indirect offset. +.Bd -literal -offset indent +0 string MZ +\*[Gt]0x18 leshort \*[Gt]0x3f +\*[Gt]\*[Gt](0x3c.l) string PE\e0\e0 PE executable (MS-Windows) +# search for the PE section called ".idata"... +\*[Gt]\*[Gt]\*[Gt]\*[Am]0xf4 search/0x140 .idata +# ...and go to the end of it, calculated from start+length; +# these are located 14 and 10 bytes after the section name +\*[Gt]\*[Gt]\*[Gt]\*[Gt](\*[Am]0xe.l+(-4)) string PK\e3\e4 \eb, ZIP self-extracting archive +.Ed +.Pp +If you have a list of known values at a particular continuation level, +and you want to provide a switch-like default case: +.Bd -literal -offset indent +# clear that continuation level match +\*[Gt]18 clear +\*[Gt]18 lelong 1 one +\*[Gt]18 lelong 2 two +\*[Gt]18 default x +# print default match +\*[Gt]\*[Gt]18 lelong x unmatched 0x%x +.Ed +.Sh SEE ALSO +.Xr file __CSECTION__ +\- the command that reads this file. +.Sh BUGS +The formats +.Dv long , +.Dv belong , +.Dv lelong , +.Dv melong , +.Dv short , +.Dv beshort , +and +.Dv leshort +do not depend on the length of the C data types +.Dv short +and +.Dv long +on the platform, even though the Single +.Ux +Specification implies that they do. However, as OS X Mountain Lion has +passed the Single +.Ux +Specification validation suite, and supplies a version of +.Xr file __CSECTION__ +in which they do not depend on the sizes of the C data types and that is +built for a 64-bit environment in which +.Dv long +is 8 bytes rather than 4 bytes, presumably the validation suite does not +test whether, for example +.Dv long +refers to an item with the same size as the C data type +.Dv long . +There should probably be +.Dv type +names +.Dv int8 , +.Dv uint8 , +.Dv int16 , +.Dv uint16 , +.Dv int32 , +.Dv uint32 , +.Dv int64 , +and +.Dv uint64 , +and specified-byte-order variants of them, +to make it clearer that those types have specified widths. +.\" +.\" From: guy@sun.uucp (Guy Harris) +.\" Newsgroups: net.bugs.usg +.\" Subject: /etc/magic's format isn't well documented +.\" Message-ID: <2752@sun.uucp> +.\" Date: 3 Sep 85 08:19:07 GMT +.\" Organization: Sun Microsystems, Inc. +.\" Lines: 136 +.\" +.\" Here's a manual page for the format accepted by the "file" made by adding +.\" the changes I posted to the S5R2 version. +.\" +.\" Modified for Ian Darwin's version of the file command. |