From efe11df4b26426c3db4f96592e899b1f005a878e Mon Sep 17 00:00:00 2001 From: Daniel Baumann Date: Sun, 7 May 2017 17:50:01 +0200 Subject: Merging upstream version 1.9. Signed-off-by: Daniel Baumann --- doc/clzip.info | 1029 +++++++++++++++++++++++++++++++++++++++++++++++++++++--- 1 file changed, 977 insertions(+), 52 deletions(-) (limited to 'doc/clzip.info') diff --git a/doc/clzip.info b/doc/clzip.info index c590473..f6e483f 100644 --- a/doc/clzip.info +++ b/doc/clzip.info @@ -11,21 +11,24 @@ File: clzip.info, Node: Top, Next: Introduction, Up: (dir) Clzip Manual ************ -This manual is for Clzip (version 1.8, 13 May 2016). +This manual is for Clzip (version 1.9, 13 April 2017). * Menu: * Introduction:: Purpose and features of clzip * Invoking clzip:: Command line interface +* Quality assurance:: Design, development and testing of lzip * File format:: Detailed format of the compressed file * Algorithm:: How clzip compresses the data +* Stream format:: Format of the LZMA stream in lzip files * Trailing data:: Extra data appended to the file * Examples:: A small tutorial with examples * Problems:: Reporting bugs +* Reference source code:: Source code illustrating stream format * Concept index:: Index of concepts - Copyright (C) 2010-2016 Antonio Diaz Diaz. + Copyright (C) 2010-2017 Antonio Diaz Diaz. This manual is free documentation: you have unlimited permission to copy, distribute and modify it. @@ -36,15 +39,16 @@ File: clzip.info, Node: Introduction, Next: Invoking clzip, Prev: Top, Up: T 1 Introduction ************** -Clzip is a lossless data compressor with a user interface similar to the -one of gzip or bzip2. Clzip is about as fast as gzip, compresses most -files more than bzip2, and is better than both from a data recovery -perspective. +Clzip is a C language version of lzip, fully compatible with lzip-1.4 or +newer. As clzip is written in C, it may be easier to integrate in +applications like package managers, embedded devices, or systems lacking +a C++ compiler. - Clzip uses the lzip file format; the files produced by clzip are -fully compatible with lzip-1.4 or newer, and can be rescued with -lziprecover. Clzip is in fact a C language version of lzip, intended -for embedded devices or systems lacking a C++ compiler. + Lzip is a lossless data compressor with a user interface similar to +the one of gzip or bzip2. Lzip can compress about as fast as gzip +(lzip -0), or compress most files more than bzip2 (lzip -9). +Decompression speed is intermediate between gzip and bzip2. Lzip is +better than gzip and bzip2 from a data recovery perspective. The lzip file format is designed for data sharing and long-term archiving, taking into account both data integrity and decoder @@ -58,11 +62,11 @@ availability: (lziprecover)Data safety. * The lzip format is as simple as possible (but not simpler). The - lzip manual provides the code of a simple decompressor along with - a detailed explanation of how it works, so that with the only help - of the lzip manual it would be possible for a digital - archaeologist to extract the data from a lzip file long after - quantum computers eventually render LZMA obsolete. + lzip manual provides the source code of a simple decompressor + along with a detailed explanation of how it works, so that with + the only help of the lzip manual it would be possible for a + digital archaeologist to extract the data from a lzip file long + after quantum computers eventually render LZMA obsolete. * Additionally the lzip reference implementation is copylefted, which guarantees that it will remain free forever. @@ -128,9 +132,9 @@ two or more compressed files. The result is the concatenation of the corresponding uncompressed files. Integrity testing of concatenated compressed files is also supported. - Clzip can produce multimember files and safely recover, with -lziprecover, the undamaged members in case of file damage. Clzip can -also split the compressed output in volumes of a given size, even when + Clzip can produce multimember files, and lziprecover can safely +recover the undamaged members in case of file damage. Clzip can also +split the compressed output in volumes of a given size, even when reading from standard input. This allows the direct creation of multivolume compressed tar archives. @@ -138,8 +142,12 @@ multivolume compressed tar archives. automatically creating multimember output. The members so created are large, about 2 PiB each. + LANGUAGE NOTE: Uncompressed = not compressed = plain data; it may +never have been compressed. Decompressed is used to refer to data which +have undergone the process of decompression. +  -File: clzip.info, Node: Invoking clzip, Next: File format, Prev: Introduction, Up: Top +File: clzip.info, Node: Invoking clzip, Next: Quality assurance, Prev: Introduction, Up: Top 2 Invoking clzip **************** @@ -205,6 +213,21 @@ command line. Keep (don't delete) input files during compression or decompression. +'-l' +'--list' + Print the uncompressed size, compressed size and percentage saved + of the specified file(s). Trailing data are ignored. The values + produced are correct even for multimember files. If more than one + file is given, a final line containing the cumulative sizes is + printed. With '-v', the dictionary size, the number of members in + the file, and the amount of trailing data (if any) are also + printed. With '-vv', the positions and sizes of each member in + multimember files are also printed. '-lq' can be used to verify + quickly (without decompressing) the structural integrity of the + specified files. (Use '--test' to verify the data integrity). + '-alq' additionally verifies that none of the specified files + contain trailing data. + '-m BYTES' '--match-length=BYTES' Set the match length limit in bytes. After a match this long is @@ -254,8 +277,9 @@ command line. Check integrity of the specified file(s), but don't decompress them. This really performs a trial decompression and throws away the result. Use it together with '-v' to see information about - the file(s). If a file fails the test, clzip continues checking - the rest of the files. + the file(s). If a file fails the test, does not exist, can't be + opened, or is a terminal, clzip continues checking the rest of the + files. '-v' '--verbose' @@ -265,7 +289,8 @@ command line. When decompressing or testing, further -v's (up to 4) increase the verbosity level, showing status, compression ratio, dictionary size, trailer contents (CRC, data size, member size), and up to 6 - bytes of trailing data (if any). + bytes of trailing data (if any) both in hexadecimal and as a + string of printable ASCII characters. '-0 .. -9' Set the compression parameters (dictionary size and match length @@ -317,9 +342,186 @@ invalid input file, 3 for an internal consistency error (eg, bug) which caused clzip to panic.  -File: clzip.info, Node: File format, Next: Algorithm, Prev: Invoking clzip, Up: Top +File: clzip.info, Node: Quality assurance, Next: File format, Prev: Invoking clzip, Up: Top + +3 Design, development and testing of lzip +***************************************** + +There are two ways of constructing a software design: One way is to make +it so simple that there are obviously no deficiencies and the other way +is to make it so complicated that there are no obvious deficiencies. The +first method is far more difficult. +-- C.A.R. Hoare + + Lzip has been designed, written and tested with great care to be the +standard general-purpose compressor for unix-like systems. This chapter +describes the lessons learned from previous compressors (gzip and +bzip2), and their application to the design of lzip. + + +3.1 Format design +================= + +When gzip was designed in 1992, computers and operating systems were +much less capable than they are today. Gzip tried to work around some of +those limitations, like 8.3 file names, with additional fields in its +file format. + + Today those limitations have mostly disappeared, and the format of +gzip has proved to be unnecessarily complicated. It includes fields +that were never used, others that have lost their usefulness, and +finally others that have become too limited. + + Bzip2 was designed 5 years later, and its format is simpler than the +one of gzip. + + Probably the worst defect of the gzip format from the point of view +of data safety is the variable size of its header. If the byte at +offset 3 (flags) of a gzip member gets corrupted, it may become very +difficult to recover the data, even if the compressed blocks are +intact, because it can't be known with certainty where the compressed +blocks begin. + + By contrast, the header of a lzip member has a fixed length of 6. The +LZMA stream in a lzip member always starts at offset 6, making it +trivial to recover the data even if the whole header becomes corrupt. + + Bzip2 also provides a header of fixed length and marks the begin and +end of each compressed block with six magic bytes, making it possible to +find the compressed blocks even in case of file damage. But bzip2 does +not store the size of each compressed block, as lzip does. + + Lzip provides better data recovery capabilities than any other +gzip-like compressor because its format has been designed from the +beginning to be simple and safe. It also helps that the LZMA data +stream as used by lzip is extraordinarily safe. It provides embedded +error detection. Any distance larger than the dictionary size acts as a +forbidden symbol, allowing the decompressor to detect the approximate +position of errors, and leaving very little work for the check sequence +(CRC and data sizes) in the detection of errors. Lzip is usually able +to detect all posible bit-flips in the compressed data without +resorting to the check sequence. It would be very difficult to write an +automatic recovery tool like lziprecover for the gzip format. And, as +far as I know, it has never been written. + + Lzip, like gzip and bzip2, uses a CRC32 to check the integrity of the +decompressed data because it provides more accurate error detection than +CRC64 up to a compressed size of about 16 GiB, a size larger than that +of most files. In the case of lzip, the additional detection capability +of the decompressor reduces the probability of undetected errors more +than a million times, making CRC32 more accurate than CRC64 up to about +20 PiB of compressed size. + + The lzip format is designed for long-term archiving. Therefore it +excludes any unneeded features that may interfere with the future +extraction of the uncompressed data. + + +3.1.1 Gzip format (mis)features not present in lzip +--------------------------------------------------- + +'Multiple algorithms' + Gzip provides a CM (Compression Method) field that has never been + used because it is a bad idea to begin with. New compression + methods may require additional fields, making it impossible to + implement new methods and, at the same time, keep the same format. + This field does not solve the problem of format proliferation; it + just makes the problem less obvious. + +'Optional fields in header' + Unless special precautions are taken, optional fields are + generally a bad idea because they produce a header of variable + size. The gzip header has 2 fields that, in addition to being + optional, are zero-terminated. This means that if any byte inside + the field gets zeroed, or if the terminating zero gets altered, + gzip won't be able to find neither the header CRC nor the + compressed blocks. + +'Optional CRC for the header' + Using an optional checksum for the header is not only a bad idea, + it is an error; it may prevent the extraction of perfectly good + data. For example, if the checksum is used and the bit enabling it + is reset by a bit-flip, the header will appear to be intact (in + spite of being corrupt) while the compressed blocks will appear to + be totally unrecoverable (in spite of being intact). Very + misleading indeed. + + +3.1.2 Lzip format improvements over gzip and bzip2 +-------------------------------------------------- + +'64-bit size field' + Probably the most frequently reported shortcoming of the gzip + format is that it only stores the least significant 32 bits of the + uncompressed size. The size of any file larger than 4 GiB gets + truncated. + + Bzip2 does not store the uncompressed size of the file. + + The lzip format provides a 64-bit field for the uncompressed size. + Additionaly, lzip produces multimember output automatically when + the size is too large for a single member, allowing for an + unlimited uncompressed size. + +'Distributed index' + The lzip format provides a distributed index that, among other + things, helps plzip to decompress several times faster than pigz + and helps lziprecover do its job. Neither the gzip format nor the + bzip2 format do provide an index. + + A distributed index is safer and more scalable than a monolithic + index. The monolithic index introduces a single point of failure + in the compressed file and may limit the number of members or the + total uncompressed size. + + +3.2 Quality of implementation +============================= + +'Accurate and robust error detection' + The lzip format provides 3 factor integrity checking and the + decompressors report mismatches in each factor separately. This + way if just one byte in one factor fails but the other two factors + match the data, it probably means that the data are intact and the + corruption just affects the mismatching factor (CRC or data size) + in the check sequence. + +'Multiple implementations' + Just like the lzip format provides 3 factor protection against + undetected data corruption, the development methodology of the lzip + family of compressors provides 3 factor protection against + undetected programming errors. + + Three related but independent compressor implementations, lzip, + clzip and minilzip/lzlib, are developed concurrently. Every stable + release of any of them is subjected to a hundred hours of + intensive testing to verify that it produces identical output to + the other two. This guarantees that all three implement the same + algorithm, and makes it unlikely that any of them may contain + serious undiscovered errors. In fact, no errors have been + discovered in lzip since 2009. + + Additionally, the three implementations have been extensively + tested with unzcrash, valgrind and 'american fuzzy lop' without + finding a single vulnerability or false negative. *Note Unzcrash: + (lziprecover)Unzcrash. + +'Dictionary size' + Lzip automatically uses the smallest possible dictionary size for + each file. In addition to reducing the amount of memory required + for decompression, this feature also minimizes the probability of + being affected by RAM errors during compression. + +'Exit status' + Returning a warning status of 2 is a design flaw of compress that + leaked into the design of gzip. Both bzip2 and lzip are free from + this flaw. + + + +File: clzip.info, Node: File format, Next: Algorithm, Prev: Quality assurance, Up: Top -3 File format +4 File format ************* Perfection is reached, not when there is no longer anything to add, but @@ -371,8 +573,8 @@ additional information before, between, or after them. 'LZMA stream' The LZMA stream, finished by an end of stream marker. Uses default - values for encoder properties. *Note Stream format: (lzip)Stream - format, for a complete description. + values for encoder properties. *Note Stream format::, for a + complete description. 'CRC32 (4 bytes)' CRC of the uncompressed original data. @@ -388,9 +590,9 @@ additional information before, between, or after them.  -File: clzip.info, Node: Algorithm, Next: Trailing data, Prev: File format, Up: Top +File: clzip.info, Node: Algorithm, Next: Stream format, Prev: File format, Up: Top -4 Algorithm +5 Algorithm *********** In spite of its name (Lempel-Ziv-Markov chain-Algorithm), LZMA is not a @@ -454,21 +656,264 @@ range encoding), Igor Pavlov (for putting all the above together in LZMA), and Julian Seward (for bzip2's CLI).  -File: clzip.info, Node: Trailing data, Next: Examples, Prev: Algorithm, Up: Top +File: clzip.info, Node: Stream format, Next: Trailing data, Prev: Algorithm, Up: Top + +6 Format of the LZMA stream in lzip files +***************************************** + +The LZMA algorithm has three parameters, called "special LZMA +properties", to adjust it for some kinds of binary data. These +parameters are; 'literal_context_bits' (with a default value of 3), +'literal_pos_state_bits' (with a default value of 0), and +'pos_state_bits' (with a default value of 2). As a general purpose +compressor, lzip only uses the default values for these parameters. In +particular 'literal_pos_state_bits' has been optimized away and does +not even appear in the code. + + Lzip also finishes the LZMA stream with an "End Of Stream" marker +(the distance-length pair 0xFFFFFFFFU, 2), which in conjunction with the +"member size" field in the member trailer allows the verification of +stream integrity. The LZMA stream in lzip files always has these two +features (default properties and EOS marker) and is referred to in this +document as LZMA-302eos or LZMA-lzip. + + The second stage of LZMA is a range encoder that uses a different +probability model for each type of symbol; distances, lengths, literal +bytes, etc. Range encoding conceptually encodes all the symbols of the +message into one number. Unlike Huffman coding, which assigns to each +symbol a bit-pattern and concatenates all the bit-patterns together, +range encoding can compress one symbol to less than one bit. Therefore +the compressed data produced by a range encoder can't be split in pieces +that could be individually described. + + It seems that the only way of describing the LZMA-302eos stream is +describing the algorithm that decodes it. And given the many details +about the range decoder that need to be described accurately, the source +code of a real decoder seems the only appropriate reference to use. + + What follows is a description of the decoding algorithm for +LZMA-302eos streams using as reference the source code of "lzd", an +educational decompressor for lzip files which can be downloaded from +the lzip download directory. The source code of lzd is included in +appendix A. *Note Reference source code::. + + +6.1 What is coded +================= + +The LZMA stream includes literals, matches and repeated matches (matches +reusing a recently used distance). There are 7 different coding +sequences: + +Bit sequence Name Description +--------------------------------------------------------------------------- +0 + byte literal literal byte +1 + 0 + len + dis match distance-length pair +1 + 1 + 0 + 0 shortrep 1 byte match at latest used distance +1 + 1 + 0 + 1 + len rep0 len bytes match at latest used + distance +1 + 1 + 1 + 0 + len rep1 len bytes match at second latest + used distance +1 + 1 + 1 + 1 + 0 + len rep2 len bytes match at third latest used + distance +1 + 1 + 1 + 1 + 1 + len rep3 len bytes match at fourth latest + used distance + + + In the following tables, multibit sequences are coded in normal +order, from MSB to LSB, except where noted otherwise. + + Lengths (the 'len' in the table above) are coded as follows: + +Bit sequence Description +-------------------------------------------------------------------------- +0 + 3 bits lengths from 2 to 9 +1 + 0 + 3 bits lengths from 10 to 17 +1 + 1 + 8 bits lengths from 18 to 273 + + + The coding of distances is a little more complicated, so I'll begin +explaining a simpler version of the encoding. + + Imagine you need to code a number from 0 to 2^32 - 1, and you want +to do it in a way that produces shorter codes for the smaller numbers. +You may first send the position of the most significant bit that is set +to 1, which you may find by making a bit scan from the left (from the +MSB). A position of 0 means that the number is 0 (no bit is set), 1 +means the LSB is the first bit set (the number is 1), and 32 means the +MSB is set (i.e., the number is >= 0x80000000). Let's call this bit +position a "slot". Then, if slot is > 1, you send the remaining +slot - 1 bits. Let's call these bits "direct_bits" because they are +coded directly by value instead of indirectly by position. + + The inconvenient of this simple method is that it needs 6 bits to +code the slot, but it just uses 33 of the 64 possible values, wasting +almost half of the codes. + + The intelligent trick of LZMA is that it encodes the position of the +most significant bit set, along with the value of the next bit, in the +same 6 bits that would take to encode the position alone. This seems to +need 66 slots (2 * position + next_bit), but for slots 0 and 1 there is +no next bit, so the number of needed slots is 64 (0 to 63). + + The 6 bits representing this "slot number" are then context-coded. If +the distance is >= 4, the remaining bits are coded as follows. +'direct_bits' is the amount of remaining bits (from 0 to 30) needed to +form a complete distance, and is calculated as (slot >> 1) - 1. If a +distance needs 6 or more direct_bits, the last 4 bits are coded +separately. The last piece (all the direct_bits for distances 4 to 127 +or the last 4 bits for distances >= 128) is context-coded in reverse +order (from LSB to MSB). For distances >= 128, the 'direct_bits - 4' +part is coded with fixed 0.5 probability. + +Bit sequence Description +-------------------------------------------------------------------------- +slot distances from 0 to 3 +slot + direct_bits distances from 4 to 127 +slot + (direct_bits - 4) + 4 bits distances from 128 to 2^32 - 1 + + +6.2 The coding contexts +======================= + +These contexts ('Bit_model' in the source), are integers or arrays of +integers representing the probability of the corresponding bit being 0. + + The indices used in these arrays are: + +'state' + A state machine ('State' in the source) with 12 states (0 to 11), + coding the latest 2 to 4 types of sequences processed. The initial + state is 0. + +'pos_state' + Value of the 2 least significant bits of the current position in + the decoded data. + +'literal_state' + Value of the 3 most significant bits of the latest byte decoded. + +'len_state' + Coded value of length (length - 2), with a maximum of 3. The + resulting value is in the range 0 to 3. + + + In the following table, '!literal' is any sequence except a literal +byte. 'rep' is any one of 'rep0', 'rep1', 'rep2' or 'rep3'. The types +of previous sequences corresponding to each state are: + +State Types of previous sequences +-------------------------------------------------------- +0 literal, literal, literal +1 match, literal, literal +2 rep or (!literal, shortrep), literal, literal +3 literal, shortrep, literal, literal +4 match, literal +5 rep or (!literal, shortrep), literal +6 literal, shortrep, literal +7 literal, match +8 literal, rep +9 literal, shortrep +10 !literal, match +11 !literal, (rep or shortrep) + + + The contexts for decoding the type of coding sequence are: + +Name Indices Used when +--------------------------------------------------------------------------- +bm_match state, pos_state sequence start +bm_rep state after sequence 1 +bm_rep0 state after sequence 11 +bm_rep1 state after sequence 111 +bm_rep2 state after sequence 1111 +bm_len state, pos_state after sequence 110 + + + The contexts for decoding distances are: + +Name Indices Used when +--------------------------------------------------------------------------- +bm_dis_slot len_state, bit tree distance start +bm_dis reverse bit tree after slots 4 to 13 +bm_align reverse bit tree for distances >= 128, after + fixed probability bits + + + There are two separate sets of contexts for lengths ('Len_model' in +the source). One for normal matches, the other for repeated matches. The +contexts in each Len_model are (see 'decode_len' in the source): + +Name Indices Used when +--------------------------------------------------------------------------- +choice1 none length start +choice2 none after sequence 1 +bm_low pos_state, bit tree after sequence 0 +bm_mid pos_state, bit tree after sequence 10 +bm_high bit tree after sequence 11 + + + The context array 'bm_literal' is special. In principle it acts as a +normal bit tree context, the one selected by 'literal_state'. But if +the previous decoded byte was not a literal, two other bit tree +contexts are used depending on the value of each bit in 'match_byte' +(the byte at the latest used distance), until a bit is decoded that is +different from its corresponding bit in 'match_byte'. After the first +difference is found, the rest of the byte is decoded using the normal +bit tree context. (See 'decode_matched' in the source). + + +6.3 The range decoder +===================== + +The LZMA stream is consumed one byte at a time by the range decoder. +(See 'normalize' in the source). Every byte consumed produces a +variable number of decoded bits, depending on how well these bits agree +with their context. (See 'decode_bit' in the source). + + The range decoder state consists of two unsigned 32-bit variables; +'range' (representing the most significant part of the range size not +yet decoded), and 'code' (representing the current point within +'range'). 'range' is initialized to (2^32 - 1), and 'code' is +initialized to 0. + + The range encoder produces a first 0 byte that must be ignored by the +range decoder. This is done by shifting 5 bytes in the initialization of +'code' instead of 4. (See the 'Range_decoder' constructor in the +source). + + +6.4 Decoding the LZMA stream +============================ + +After decoding the member header and obtaining the dictionary size, the +range decoder is initialized and then the LZMA decoder enters a loop +(See 'decode_member' in the source) where it invokes the range decoder +with the appropriate contexts to decode the different coding sequences +(matches, repeated matches, and literal bytes), until the "End Of +Stream" marker is decoded. + + +File: clzip.info, Node: Trailing data, Next: Examples, Prev: Stream format, Up: Top -5 Extra data appended to the file +7 Extra data appended to the file ********************************* -Sometimes extra data is found appended to a lzip file after the last +Sometimes extra data are found appended to a lzip file after the last member. Such trailing data may be: * Padding added to make the file size a multiple of some block size, - for example when writing to a tape. - - * Garbage added by some not totally successful copy operation. + for example when writing to a tape. It is safe to append any + amount of padding zero bytes to a lzip file. * Useful data added by the user; a cryptographically secure hash, a - description of file contents, etc. + description of file contents, etc. It is safe to append any amount + of text to a lzip file as long as the text does not begin with the + string "LZIP", and does not contain any zero bytes (null + characters). Nonzero bytes and zero bytes can't be safely mixed in + trailing data. + + * Garbage added by some not totally successful copy operation. * Malicious data added to the file in order to make its total size and hash value (for a chosen hash) coincide with those of another @@ -481,15 +926,19 @@ member. Such trailing data may be: the corruption of the integrity information itself. Therefore it can be considered to be below the noise level. + Trailing data are in no way part of the lzip file format, but tools +reading lzip files are expected to behave as correctly and usefully as +possible in the presence of trailing data. + Trailing data can be safely ignored in most cases. In some cases, -like that of user-added data, it is expected to be ignored. In those +like that of user-added data, they are expected to be ignored. In those cases where a file containing trailing data must be rejected, the option '--trailing-error' can be used. *Note --trailing-error::.  File: clzip.info, Node: Examples, Next: Problems, Prev: Trailing data, Up: Top -6 A small tutorial with examples +8 A small tutorial with examples ******************************** WARNING! Even if clzip is bug-free, other causes may result in a corrupt @@ -530,8 +979,8 @@ Example 5: Compress a whole device in /dev/sdc and send the output to clzip -c /dev/sdc > file.lz -Example 6: The right way of concatenating compressed files. *Note -Trailing data::. +Example 6: The right way of concatenating the decompressed output of two +or more compressed files. *Note Trailing data::. Don't do this cat file1.lz file2.lz file3.lz | clzip -d @@ -569,9 +1018,9 @@ file with a member size of 32 MiB. clzip -b 32MiB -S 650MB big_db  -File: clzip.info, Node: Problems, Next: Concept index, Prev: Examples, Up: Top +File: clzip.info, Node: Problems, Next: Reference source code, Prev: Examples, Up: Top -7 Reporting bugs +9 Reporting bugs **************** There are probably bugs in clzip. There are certainly errors and @@ -584,7 +1033,477 @@ for all eternity, if not longer. by running 'clzip --version'.  -File: clzip.info, Node: Concept index, Prev: Problems, Up: Top +File: clzip.info, Node: Reference source code, Next: Concept index, Prev: Problems, Up: Top + +Appendix A Reference source code +******************************** + +/* Lzd - Educational decompressor for the lzip format + Copyright (C) 2013-2017 Antonio Diaz Diaz. + + This program is free software. Redistribution and use in source and + binary forms, with or without modification, are permitted provided + that the following conditions are met: + + 1. Redistributions of source code must retain the above copyright + notice, this list of conditions and the following disclaimer. + + 2. Redistributions in binary form must reproduce the above copyright + notice, this list of conditions and the following disclaimer in the + documentation and/or other materials provided with the distribution. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. +*/ +/* + Exit status: 0 for a normal exit, 1 for environmental problems + (file not found, invalid flags, I/O errors, etc), 2 to indicate a + corrupt or invalid input file. +*/ + +#include +#include +#include +#include +#include +#include +#include +#if defined(__MSVCRT__) || defined(__OS2__) || defined(_MSC_VER) +#include +#include +#endif + + +class State + { + int st; + +public: + enum { states = 12 }; + State() : st( 0 ) {} + int operator()() const { return st; } + bool is_char() const { return st < 7; } + + void set_char() + { + static const int next[states] = { 0, 0, 0, 0, 1, 2, 3, 4, 5, 6, 4, 5 }; + st = next[st]; + } + void set_match() { st = ( st < 7 ) ? 7 : 10; } + void set_rep() { st = ( st < 7 ) ? 8 : 11; } + void set_short_rep() { st = ( st < 7 ) ? 9 : 11; } + }; + + +enum { + min_dictionary_size = 1 << 12, + max_dictionary_size = 1 << 29, + literal_context_bits = 3, + literal_pos_state_bits = 0, // not used + pos_state_bits = 2, + pos_states = 1 << pos_state_bits, + pos_state_mask = pos_states - 1, + + len_states = 4, + dis_slot_bits = 6, + start_dis_model = 4, + end_dis_model = 14, + modeled_distances = 1 << (end_dis_model / 2), // 128 + dis_align_bits = 4, + dis_align_size = 1 << dis_align_bits, + + len_low_bits = 3, + len_mid_bits = 3, + len_high_bits = 8, + len_low_symbols = 1 << len_low_bits, + len_mid_symbols = 1 << len_mid_bits, + len_high_symbols = 1 << len_high_bits, + max_len_symbols = len_low_symbols + len_mid_symbols + len_high_symbols, + + min_match_len = 2, // must be 2 + + bit_model_move_bits = 5, + bit_model_total_bits = 11, + bit_model_total = 1 << bit_model_total_bits }; + +struct Bit_model + { + int probability; + Bit_model() : probability( bit_model_total / 2 ) {} + }; + +struct Len_model + { + Bit_model choice1; + Bit_model choice2; + Bit_model bm_low[pos_states][len_low_symbols]; + Bit_model bm_mid[pos_states][len_mid_symbols]; + Bit_model bm_high[len_high_symbols]; + }; + + +class CRC32 + { + uint32_t data[256]; // Table of CRCs of all 8-bit messages. + +public: + CRC32() + { + for( unsigned n = 0; n < 256; ++n ) + { + unsigned c = n; + for( int k = 0; k < 8; ++k ) + { if( c & 1 ) c = 0xEDB88320U ^ ( c >> 1 ); else c >>= 1; } + data[n] = c; + } + } + + void update_buf( uint32_t & crc, const uint8_t * const buffer, + const int size ) const + { + for( int i = 0; i < size; ++i ) + crc = data[(crc^buffer[i])&0xFF] ^ ( crc >> 8 ); + } + }; + +const CRC32 crc32; + + +typedef uint8_t File_header[6]; // 0-3 magic, 4 version, 5 coded_dict_size + +typedef uint8_t File_trailer[20]; + // 0-3 CRC32 of the uncompressed data + // 4-11 size of the uncompressed data + // 12-19 member size including header and trailer + +class Range_decoder + { + uint32_t code; + uint32_t range; + +public: + Range_decoder() : code( 0 ), range( 0xFFFFFFFFU ) + { + for( int i = 0; i < 5; ++i ) code = (code << 8) | get_byte(); + } + + uint8_t get_byte() { return std::getc( stdin ); } + + unsigned decode( const int num_bits ) + { + unsigned symbol = 0; + for( int i = num_bits; i > 0; --i ) + { + range >>= 1; + symbol <<= 1; + if( code >= range ) { code -= range; symbol |= 1; } + if( range <= 0x00FFFFFFU ) // normalize + { range <<= 8; code = (code << 8) | get_byte(); } + } + return symbol; + } + + unsigned decode_bit( Bit_model & bm ) + { + unsigned symbol; + const uint32_t bound = ( range >> bit_model_total_bits ) * bm.probability; + if( code < bound ) + { + range = bound; + bm.probability += (bit_model_total - bm.probability) >> bit_model_move_bits; + symbol = 0; + } + else + { + range -= bound; + code -= bound; + bm.probability -= bm.probability >> bit_model_move_bits; + symbol = 1; + } + if( range <= 0x00FFFFFFU ) // normalize + { range <<= 8; code = (code << 8) | get_byte(); } + return symbol; + } + + unsigned decode_tree( Bit_model bm[], const int num_bits ) + { + unsigned symbol = 1; + for( int i = 0; i < num_bits; ++i ) + symbol = ( symbol << 1 ) | decode_bit( bm[symbol] ); + return symbol - (1 << num_bits); + } + + unsigned decode_tree_reversed( Bit_model bm[], const int num_bits ) + { + unsigned symbol = decode_tree( bm, num_bits ); + unsigned reversed_symbol = 0; + for( int i = 0; i < num_bits; ++i ) + { + reversed_symbol = ( reversed_symbol << 1 ) | ( symbol & 1 ); + symbol >>= 1; + } + return reversed_symbol; + } + + unsigned decode_matched( Bit_model bm[], const unsigned match_byte ) + { + unsigned symbol = 1; + for( int i = 7; i >= 0; --i ) + { + const unsigned match_bit = ( match_byte >> i ) & 1; + const unsigned bit = decode_bit( bm[symbol+(match_bit<<8)+0x100] ); + symbol = ( symbol << 1 ) | bit; + if( match_bit != bit ) + { + while( symbol < 0x100 ) + symbol = ( symbol << 1 ) | decode_bit( bm[symbol] ); + break; + } + } + return symbol & 0xFF; + } + + unsigned decode_len( Len_model & lm, const int pos_state ) + { + if( decode_bit( lm.choice1 ) == 0 ) + return decode_tree( lm.bm_low[pos_state], len_low_bits ); + if( decode_bit( lm.choice2 ) == 0 ) + return len_low_symbols + + decode_tree( lm.bm_mid[pos_state], len_mid_bits ); + return len_low_symbols + len_mid_symbols + + decode_tree( lm.bm_high, len_high_bits ); + } + }; + + +class LZ_decoder + { + unsigned long long partial_data_pos; + Range_decoder rdec; + const unsigned dictionary_size; + uint8_t * const buffer; // output buffer + unsigned pos; // current pos in buffer + unsigned stream_pos; // first byte not yet written to stdout + uint32_t crc_; + bool pos_wrapped; + + void flush_data(); + + uint8_t peek( const unsigned distance ) const + { + if( pos > distance ) return buffer[pos - distance - 1]; + if( pos_wrapped ) return buffer[dictionary_size + pos - distance - 1]; + return 0; // prev_byte of first byte + } + + void put_byte( const uint8_t b ) + { + buffer[pos] = b; + if( ++pos >= dictionary_size ) flush_data(); + } + +public: + explicit LZ_decoder( const unsigned dict_size ) + : + partial_data_pos( 0 ), + dictionary_size( dict_size ), + buffer( new uint8_t[dictionary_size] ), + pos( 0 ), + stream_pos( 0 ), + crc_( 0xFFFFFFFFU ), + pos_wrapped( false ) + {} + + ~LZ_decoder() { delete[] buffer; } + + unsigned crc() const { return crc_ ^ 0xFFFFFFFFU; } + unsigned long long data_position() const { return partial_data_pos + pos; } + + bool decode_member(); + }; + + +void LZ_decoder::flush_data() + { + if( pos > stream_pos ) + { + const unsigned size = pos - stream_pos; + crc32.update_buf( crc_, buffer + stream_pos, size ); + errno = 0; + if( std::fwrite( buffer + stream_pos, 1, size, stdout ) != size ) + { std::fprintf( stderr, "Write error: %s\n", std::strerror( errno ) ); + std::exit( 1 ); } + if( pos >= dictionary_size ) + { partial_data_pos += pos; pos = 0; pos_wrapped = true; } + stream_pos = pos; + } + } + + +bool LZ_decoder::decode_member() // Returns false if error + { + Bit_model bm_literal[1<> ( 8 - literal_context_bits ); + Bit_model * const bm = bm_literal[literal_state]; + if( state.is_char() ) + put_byte( rdec.decode_tree( bm, 8 ) ); + else + put_byte( rdec.decode_matched( bm, peek( rep0 ) ) ); + state.set_char(); + } + else // match or repeated match + { + int len; + if( rdec.decode_bit( bm_rep[state()] ) != 0 ) // 2nd bit + { + if( rdec.decode_bit( bm_rep0[state()] ) == 0 ) // 3rd bit + { + if( rdec.decode_bit( bm_len[state()][pos_state] ) == 0 ) // 4th bit + { state.set_short_rep(); put_byte( peek( rep0 ) ); continue; } + } + else + { + unsigned distance; + if( rdec.decode_bit( bm_rep1[state()] ) == 0 ) // 4th bit + distance = rep1; + else + { + if( rdec.decode_bit( bm_rep2[state()] ) == 0 ) // 5th bit + distance = rep2; + else + { distance = rep3; rep3 = rep2; } + rep2 = rep1; + } + rep1 = rep0; + rep0 = distance; + } + state.set_rep(); + len = min_match_len + rdec.decode_len( rep_len_model, pos_state ); + } + else // match + { + rep3 = rep2; rep2 = rep1; rep1 = rep0; + len = min_match_len + rdec.decode_len( match_len_model, pos_state ); + const int len_state = std::min( len - min_match_len, len_states - 1 ); + rep0 = rdec.decode_tree( bm_dis_slot[len_state], dis_slot_bits ); + if( rep0 >= start_dis_model ) + { + const unsigned dis_slot = rep0; + const int direct_bits = ( dis_slot >> 1 ) - 1; + rep0 = ( 2 | ( dis_slot & 1 ) ) << direct_bits; + if( dis_slot < end_dis_model ) + rep0 += rdec.decode_tree_reversed( bm_dis + ( rep0 - dis_slot ), + direct_bits ); + else + { + rep0 += rdec.decode( direct_bits - dis_align_bits ) << dis_align_bits; + rep0 += rdec.decode_tree_reversed( bm_align, dis_align_bits ); + if( rep0 == 0xFFFFFFFFU ) // marker found + { + flush_data(); + return ( len == min_match_len ); // End Of Stream marker + } + } + } + state.set_match(); + if( rep0 >= dictionary_size || ( rep0 >= pos && !pos_wrapped ) ) + { flush_data(); return false; } + } + for( int i = 0; i < len; ++i ) put_byte( peek( rep0 ) ); + } + } + flush_data(); + return false; + } + + +int main( const int argc, const char * const argv[] ) + { + if( argc > 1 ) + { + std::printf( "Lzd %s - Educational decompressor for the lzip format.\n", + PROGVERSION ); + std::printf( "Study the source to learn how a lzip decompressor works.\n" + "See the lzip manual for an explanation of the code.\n" + "It is not safe to use lzd for any real work.\n" + "\nUsage: %s < file.lz > file\n", argv[0] ); + std::printf( "Lzd decompresses from standard input to standard output.\n" + "\nCopyright (C) 2017 Antonio Diaz Diaz.\n" + "This is free software: you are free to change and redistribute it.\n" + "There is NO WARRANTY, to the extent permitted by law.\n" + "Report bugs to lzip-bug@nongnu.org\n" + "Lzd home page: http://www.nongnu.org/lzip/lzd.html\n" ); + return 0; + } + +#if defined(__MSVCRT__) || defined(__OS2__) || defined(_MSC_VER) + setmode( fileno( stdin ), O_BINARY ); + setmode( fileno( stdout ), O_BINARY ); +#endif + + for( bool first_member = true; ; first_member = false ) + { + File_header header; // verify header + for( int i = 0; i < 6; ++i ) header[i] = std::getc( stdin ); + if( std::feof( stdin ) || std::memcmp( header, "LZIP\x01", 5 ) != 0 ) + { + if( first_member ) + { std::fputs( "Bad magic number (file not in lzip format).\n", stderr ); + return 2; } + break; + } + unsigned dict_size = 1 << ( header[5] & 0x1F ); + dict_size -= ( dict_size / 16 ) * ( ( header[5] >> 5 ) & 7 ); + if( dict_size < min_dictionary_size || dict_size > max_dictionary_size ) + { std::fputs( "Invalid dictionary size in member header.\n", stderr ); + return 2; } + + LZ_decoder decoder( dict_size ); // decode LZMA stream + if( !decoder.decode_member() ) + { std::fputs( "Data error\n", stderr ); return 2; } + + File_trailer trailer; // verify trailer + for( int i = 0; i < 20; ++i ) trailer[i] = std::getc( stdin ); + unsigned crc = 0; + for( int i = 3; i >= 0; --i ) { crc <<= 8; crc += trailer[i]; } + unsigned long long data_size = 0; + for( int i = 11; i >= 4; --i ) { data_size <<= 8; data_size += trailer[i]; } + if( crc != decoder.crc() || data_size != decoder.data_position() ) + { std::fputs( "CRC error\n", stderr ); return 2; } + } + + if( std::fclose( stdout ) != 0 ) + { std::fprintf( stderr, "Can't close stdout: %s\n", std::strerror( errno ) ); + return 1; } + return 0; + } + + +File: clzip.info, Node: Concept index, Prev: Reference source code, Up: Top Concept index ************* @@ -596,10 +1515,13 @@ Concept index * bugs: Problems. (line 6) * examples: Examples. (line 6) * file format: File format. (line 6) +* format of the LZMA stream: Stream format. (line 6) * getting help: Problems. (line 6) * introduction: Introduction. (line 6) * invoking: Invoking clzip. (line 6) * options: Invoking clzip. (line 6) +* quality assurance: Quality assurance. (line 6) +* reference source code: Reference source code. (line 6) * trailing data: Trailing data. (line 6) * usage: Invoking clzip. (line 6) * version: Invoking clzip. (line 6) @@ -608,16 +1530,19 @@ Concept index  Tag Table: Node: Top210 -Node: Introduction952 -Node: Invoking clzip6164 -Ref: --trailing-error6730 -Node: File format12728 -Node: Algorithm15150 -Node: Trailing data17980 -Node: Examples19355 -Ref: concat-example20537 -Node: Problems21544 -Node: Concept index22070 +Node: Introduction1154 +Node: Invoking clzip6630 +Ref: --trailing-error7202 +Node: Quality assurance14125 +Node: File format22281 +Node: Algorithm24686 +Node: Stream format27516 +Node: Trailing data38257 +Node: Examples40159 +Ref: concat-example41341 +Node: Problems42386 +Node: Reference source code42920 +Node: Concept index57238  End Tag Table -- cgit v1.2.3