diff options
Diffstat (limited to 'doc/lzip.info')
-rw-r--r-- | doc/lzip.info | 506 |
1 files changed, 259 insertions, 247 deletions
diff --git a/doc/lzip.info b/doc/lzip.info index 6854503..f0aa011 100644 --- a/doc/lzip.info +++ b/doc/lzip.info @@ -11,16 +11,16 @@ File: lzip.info, Node: Top, Next: Introduction, Up: (dir) Lzip Manual *********** -This manual is for Lzip (version 1.17-rc2, 25 May 2015). +This manual is for Lzip (version 1.17, 12 July 2015). * Menu: * Introduction:: Purpose and features of lzip -* Algorithm:: How lzip compresses the data * Invoking lzip:: Command line interface +* Quality assurance:: Design, development and testing of lzip * File format:: Detailed format of the compressed file +* Algorithm:: How lzip compresses the data * Stream format:: Format of the LZMA stream in lzip files -* Quality assurance:: Design, development and testing of lzip * Examples:: A small tutorial with examples * Problems:: Reporting bugs * Reference source code:: Source code illustrating stream format @@ -33,7 +33,7 @@ This manual is for Lzip (version 1.17-rc2, 25 May 2015). copy, distribute and modify it. -File: lzip.info, Node: Introduction, Next: Algorithm, Prev: Top, Up: Top +File: lzip.info, Node: Introduction, Next: Invoking lzip, Prev: Top, Up: Top 1 Introduction ************** @@ -51,7 +51,8 @@ availability: recovery means. The lziprecover program can repair bit-flip errors (one of the most common forms of data corruption) in lzip files, and provides data recovery capabilities, including error-checked - merging of damaged copies of a file. + merging of damaged copies of a file. *note Data safety: + (lziprecover)Data safety. * The lzip format is as simple as possible (but not simpler). The lzip manual provides the code of a simple decompressor along with @@ -85,6 +86,11 @@ which makes it safer than compressors returning ambiguous warning values (like gzip) when it is used as a back end for other programs like tar or zutils. + Lzip will automatically use the smallest possible dictionary size for +each file without exceeding the given limit. Keep in mind that the +decompression memory requirement is affected at compression time by the +choice of dictionary size limit. + The amount of memory required for compression is about 1 or 2 times the dictionary size limit (1 if input file size is less than dictionary size limit, else 2) plus 9 times the dictionary size really used. The @@ -92,11 +98,6 @@ option '-0' is special and only requires about 1.5 MiB at most. The amount of memory required for decompression is about 46 kB larger than the dictionary size really used. - Lzip will automatically use the smallest possible dictionary size for -each file without exceeding the given limit. Keep in mind that the -decompression memory requirement is affected at compression time by the -choice of dictionary size limit. - When compressing, lzip replaces every file given in the command line with a compressed version of itself, with the name "original_name.lz". When decompressing, lzip attempts to guess the name for the decompressed @@ -126,8 +127,8 @@ corresponding uncompressed files. Integrity testing of concatenated compressed files is also supported. Lzip can produce multi-member files and safely recover, with -lziprecover, the undamaged members in case of file damage. Lzip can also -split the compressed output in volumes of a given size, even when +lziprecover, the undamaged members in case of file damage. Lzip can +also split the compressed output in volumes of a given size, even when reading from standard input. This allows the direct creation of multivolume compressed tar archives. @@ -136,75 +137,9 @@ automatically creating multi-member output. The members so created are large, about 2 PiB each. -File: lzip.info, Node: Algorithm, Next: Invoking lzip, Prev: Introduction, Up: Top - -2 Algorithm -*********** - -In spite of its name (Lempel-Ziv-Markov chain-Algorithm), LZMA is not a -concrete algorithm; it is more like "any algorithm using the LZMA coding -scheme". For example, the option '-0' of lzip uses the scheme in almost -the simplest way possible; issuing the longest match it can find, or a -literal byte if it can't find a match. Inversely, a much more elaborated -way of finding coding sequences of minimum size than the one currently -used by lzip could be developed, and the resulting sequence could also -be coded using the LZMA coding scheme. - - Lzip currently implements two variants of the LZMA algorithm; fast -(used by option -0) and normal (used by all other compression levels). - - The high compression of LZMA comes from combining two basic, -well-proven compression ideas: sliding dictionaries (LZ77/78) and -markov models (the thing used by every compression algorithm that uses -a range encoder or similar order-0 entropy coder as its last stage) -with segregation of contexts according to what the bits are used for. - - Lzip is a two stage compressor. The first stage is a Lempel-Ziv -coder, which reduces redundancy by translating chunks of data to their -corresponding distance-length pairs. The second stage is a range encoder -that uses a different probability model for each type of data; -distances, lengths, literal bytes, etc. - - Here is how it works, step by step: - - 1) The member header is written to the output stream. - - 2) The first byte is coded literally, because there are no previous -bytes to which the match finder can refer to. - - 3) The main encoder advances to the next byte in the input data and -calls the match finder. - - 4) The match finder fills an array with the minimum distances before -the current byte where a match of a given length can be found. - - 5) Go back to step 3 until a sequence (formed of pairs, repeated -distances and literal bytes) of minimum price has been formed. Where the -price represents the number of output bits produced. - - 6) The range encoder encodes the sequence produced by the main -encoder and sends the produced bytes to the output stream. - - 7) Go back to step 3 until the input data are finished or until the -member or volume size limits are reached. - - 8) The range encoder is flushed. - - 9) The member trailer is written to the output stream. - - 10) If there are more data to compress, go back to step 1. - - -The ideas embodied in lzip are due to (at least) the following people: -Abraham Lempel and Jacob Ziv (for the LZ algorithm), Andrey Markov (for -the definition of Markov chains), G.N.N. Martin (for the definition of -range encoding), Igor Pavlov (for putting all the above together in -LZMA), and Julian Seward (for bzip2's CLI). - - -File: lzip.info, Node: Invoking lzip, Next: File format, Prev: Algorithm, Up: Top +File: lzip.info, Node: Invoking lzip, Next: Quality assurance, Prev: Introduction, Up: Top -3 Invoking lzip +2 Invoking lzip *************** The format for running lzip is: @@ -244,7 +179,7 @@ The format for running lzip is: '-F' '--recompress' - Force recompression of files whose name already has the '.lz' or + Force re-compression of files whose name already has the '.lz' or '.tlz' suffix. '-k' @@ -362,7 +297,155 @@ invalid input file, 3 for an internal consistency error (eg, bug) which caused lzip to panic. -File: lzip.info, Node: File format, Next: Stream format, Prev: Invoking lzip, Up: Top +File: lzip.info, Node: Quality assurance, Next: File format, Prev: Invoking lzip, Up: Top + +3 Design, development and testing of lzip +***************************************** + +There are two ways of constructing a software design. One way is to make +it so simple that there are obviously no deficiencies and the other is +to make it so complicated that there are no obvious deficiencies. +-- C.A.R. Hoare + + Lzip has been designed, written and tested with great care to be the +standard general-purpose compressor for unix-like systems. This chapter +describes the lessons learned from previous compressors (gzip and +bzip2), and their application to the design of lzip. + + +3.1 Format design +================= + +When gzip was designed in 1992, computers and operating systems were +much less capable than they are today. Gzip tried to work around some of +those limitations, like 8.3 file names, with additional fields in its +file format. + + Today those limitations have mostly disappeared, and the format of +gzip has proved to be unnecessarily complicated. It includes fields +that were never used, others that have lost its usefulness, and finally +others that have become too limited. + + Bzip2 was designed 5 years later, and its format is simpler than the +one of gzip. + + Probably the worst defect of the gzip format from the point of view +of data safety is the variable size of its header. If the byte at +offset 3 (flags) of a gzip member gets corrupted, it mat become very +difficult to recover the data, even if the compressed blocks are +intact, because it can't be known with certainty where the compressed +blocks begin. + + By contrast, the header of a lzip member has a fixed length of 6. The +lzma stream in a lzip member always starts at offset 6, making it +trivial to recover the data even if the whole header becomes corrupt. + + Bzip2 also provides a header of fixed length and marks the begin and +end of each compressed block with six magic bytes, making it possible to +find the compressed blocks even in case of file damage. But bzip2 does +not store the size of each compressed block, as lzip does. + + Lzip provides better data recovery capabilities than any other +gzip-like compressor because its format has been designed from the +beginning to be simple and safe. It would be very difficult to write an +automatic recovery tool like lziprecover for the gzip format. And, as +far as I know, it has never been writen. + + The lzip format is designed for long-term archiving. Therefore it +excludes any unneeded features that may interfere with the future +extraction of the uncompressed data. + + +3.1.1 Gzip format (mis)features not present in lzip +--------------------------------------------------- + +'Multiple algorithms' + Gzip provides a CM (Compression Method) field that has never been + used because it is a bad idea to begin with. New compression + methods may require additional fields, making it impossible to + implement new methods and, at the same time, keep the same format. + This field does not solve the problem of format proliferation; it + just makes the problem less obvious. + +'Optional fields in header' + Unless special precautions are taken, optional fields are + generally a bad idea because they produce a header of variable + size. The gzip header has 2 fields that, in addition to being + optional, are zero-terminated. This means that if any byte inside + the field gets zeroed, or if the terminating zero gets altered, + gzip won't be able to find neither the header CRC nor the + compressed blocks. + +'Optional CRC for the header' + Using an optional checksum for the header is not only a bad idea, + it is an error; it may prevent the extraction of perfectly good + data. For example, if the checksum is used and the bit enabling it + is reset by a bit-flip, the header will appear to be intact (in + spite of being corrupt) while the compressed blocks will appear to + be totally unrecoverable (in spite of being intact). Very + misleading indeed. + + +3.1.2 Lzip format improvements over gzip and bzip2 +-------------------------------------------------- + +'64-bit size field' + Probably the most frequently reported shortcoming of the gzip + format is that it only stores the least significant 32 bits of the + uncompressed size. The size of any file larger than 4 GiB gets + truncated. + + Bzip2 does not store the uncompressed size of the file. + + The lzip format provides a 64-bit field for the uncompressed size. + Additionaly, lzip produces multi-member output automatically when + the size is too large for a single member, allowing for an + unlimited uncompressed size. + +'Distributed index' + The lzip format provides a distributed index that, among other + things, helps plzip to decompress several times faster than pigz + and helps lziprecover do its job. Neither the gzip format nor the + bzip2 format do provide an index. + + A distributed index is safer and more scalable than a monolithic + index. The monolithic index introduces a single point of failure + in the compressed file and may limit the number of members or the + total uncompressed size. + + +3.2 Quality of implementation +============================= + +'Multiple implementations' + Just like the lzip format provides 4 factor protection against + undetected data corruption, the development methodology of the lzip + family of compressors provides 3 factor protection against + undetected programming errors. + + Three related but independent compressor implementations, lzip, + clzip and minilzip/lzlib, are developed concurrently. Every stable + release of any of them is subjected to a hundred hours of + intensive testing to verify that it produces identical output to + the other two. This guarantees that all three implement the same + algorithm, and makes it unlikely that any of them may contain + serious undiscovered errors. In fact, no errors have been + discovered in lzip since 2009. + +'Dictionary size' + Lzip automatically uses the smallest possible dictionary size for + each file. In addition to reducing the amount of memory required + for decompression, this feature also minimizes the probability of + being affected by RAM errors during compression. + +'Exit status' + Returning a warning status of 2 is a design flaw of compress that + leaked into the design of gzip. Both bzip2 and lzip are free from + this flaw. + + + +File: lzip.info, Node: File format, Next: Algorithm, Prev: Quality assurance, Up: Top 4 File format ************* @@ -433,9 +516,75 @@ additional information before, between, or after them. -File: lzip.info, Node: Stream format, Next: Quality assurance, Prev: File format, Up: Top +File: lzip.info, Node: Algorithm, Next: Stream format, Prev: File format, Up: Top + +5 Algorithm +*********** + +In spite of its name (Lempel-Ziv-Markov chain-Algorithm), LZMA is not a +concrete algorithm; it is more like "any algorithm using the LZMA coding +scheme". For example, the option '-0' of lzip uses the scheme in almost +the simplest way possible; issuing the longest match it can find, or a +literal byte if it can't find a match. Inversely, a much more elaborated +way of finding coding sequences of minimum size than the one currently +used by lzip could be developed, and the resulting sequence could also +be coded using the LZMA coding scheme. + + Lzip currently implements two variants of the LZMA algorithm; fast +(used by option '-0') and normal (used by all other compression levels). + + The high compression of LZMA comes from combining two basic, +well-proven compression ideas: sliding dictionaries (LZ77/78) and +markov models (the thing used by every compression algorithm that uses +a range encoder or similar order-0 entropy coder as its last stage) +with segregation of contexts according to what the bits are used for. + + Lzip is a two stage compressor. The first stage is a Lempel-Ziv +coder, which reduces redundancy by translating chunks of data to their +corresponding distance-length pairs. The second stage is a range encoder +that uses a different probability model for each type of data; +distances, lengths, literal bytes, etc. + + Here is how it works, step by step: + + 1) The member header is written to the output stream. + + 2) The first byte is coded literally, because there are no previous +bytes to which the match finder can refer to. + + 3) The main encoder advances to the next byte in the input data and +calls the match finder. + + 4) The match finder fills an array with the minimum distances before +the current byte where a match of a given length can be found. + + 5) Go back to step 3 until a sequence (formed of pairs, repeated +distances and literal bytes) of minimum price has been formed. Where the +price represents the number of output bits produced. + + 6) The range encoder encodes the sequence produced by the main +encoder and sends the produced bytes to the output stream. + + 7) Go back to step 3 until the input data are finished or until the +member or volume size limits are reached. + + 8) The range encoder is flushed. + + 9) The member trailer is written to the output stream. + + 10) If there are more data to compress, go back to step 1. + + +The ideas embodied in lzip are due to (at least) the following people: +Abraham Lempel and Jacob Ziv (for the LZ algorithm), Andrey Markov (for +the definition of Markov chains), G.N.N. Martin (for the definition of +range encoding), Igor Pavlov (for putting all the above together in +LZMA), and Julian Seward (for bzip2's CLI). + + +File: lzip.info, Node: Stream format, Next: Examples, Prev: Algorithm, Up: Top -5 Format of the LZMA stream in lzip files +6 Format of the LZMA stream in lzip files ***************************************** The LZMA algorithm has three parameters, called "special LZMA @@ -473,7 +622,7 @@ the lzip download directory. The source code of lzd is included in appendix A. *note Reference source code:: -5.1 What is coded +6.1 What is coded ================= The LZMA stream includes literals, matches and repeated matches (matches @@ -525,7 +674,7 @@ slot + direct_bits distances from 4 to 127 slot + (direct_bits - 4) + 4 bits distances from 128 to 2^32 - 1 -5.2 The coding contexts +6.2 The coding contexts ======================= These contexts ('Bit_model' in the source), are integers or arrays of @@ -615,7 +764,7 @@ difference is found, the rest of the byte is decoded using the normal bit tree context. (See 'decode_matched' in the source). -5.3 The range decoder +6.3 The range decoder ===================== The LZMA stream is consumed one byte at a time by the range decoder. @@ -635,7 +784,7 @@ range decoder. This is done by shifting 5 bytes in the initialization of source). -5.4 Decoding the LZMA stream +6.4 Decoding the LZMA stream ============================ After decoding the member header and obtaining the dictionary size, the @@ -646,144 +795,7 @@ with the appropriate contexts to decode the different coding sequences Stream" marker is decoded. -File: lzip.info, Node: Quality assurance, Next: Examples, Prev: Stream format, Up: Top - -6 Design, development and testing of lzip -***************************************** - -There are two ways of constructing a software design. One way is to make -it so simple that there are obviously no deficiencies and the other is -to make it so complicated that there are no obvious deficiencies. --- C.A.R. Hoare - - Lzip has been designed, written and tested with great care to be the -standard general-purpose compressor for unix-like systems. This chapter -describes the lessons learned from previous compressors (gzip and -bzip2), and their application to the design of lzip. - - -6.1 Format design -================= - -When gzip was designed in 1992, computers and operating systems were -much less capable than they are today. Gzip tried to work around some of -those limitations, like 8.3 file names, with additional fields in its -file format. - - Today those limitations have mostly disappeared, and the format of -gzip has proved to be unnecessarily complicated. It includes fields -that were never used, others that have lost its usefulness, and finally -others that have become too limited. - - Bzip2 was designed 5 years later, and its format is in some aspects -simpler than the one of gzip. But bzip2 also shows complexities in its -file format which slow down decompression and, in retrospect, are -unnecessary. - - Probably the worst defect of the gzip format from the point of view -of data safety is the variable size of its header. If the byte at -offset 3 (flags) of a gzip member gets corrupted, it mat become very -difficult to recover the data, even if the compressed blocks are -intact, because it can't be known with certainty where the compressed -blocks begin. - - By contrast, the lzma stream in a lzip member always starts at -offset 6, making it trivial to recover the data even if the whole -header becomes corrupt. - - Lzip provides better data recovery capabilities than any other -gzip-like compressor because its format has been designed from the -beginning to be simple and safe. It would be very difficult to write an -automatic recovery tool like lziprecover for the gzip format. And, as -far as I know, it has never been writen. - - The lzip format is designed for long-term archiving. Therefore it -excludes any unneeded features that may interfere with the future -extraction of the uncompressed data. - - -6.1.1 Gzip format (mis)features not present in lzip ---------------------------------------------------- - -'Multiple algorithms' - Gzip provides a CM (Compression Method) field that has never been - used because it is a bad idea to begin with. New compression - methods may require additional fields, making it impossible to - implement new methods and, at the same time, keep the same format. - This field does not solve the problem of format proliferation; it - just makes the problem less obvious. - -'Optional fields in header' - Unless special precautions are taken, optional fields are - generally a bad idea because they produce a header of variable - size. The gzip header has 2 fields that, in addition to being - optional, are zero-terminated. This means that if any byte inside - the field gets zeroed, or if the terminating zero gets altered, - gzip won't be able to find neither the header CRC nor the - compressed blocks. - - Using an optional checksum for the header is not only a bad idea, - it is an error; it may prevent the extraction of perfectly good - data. For example, if the checksum is used and the bit enabling it - is reset by a bit-flip, the header will appear to be intact (in - spite of being corrupt) while the compressed blocks will appear to - be totally unrecoverable (in spite of being intact). Very - misleading indeed. - - -6.1.2 Lzip format improvements over gzip ----------------------------------------- - -'64-bit size field' - Probably the most frequently reported shortcoming of the gzip - format is that it only stores the least significant 32 bits of the - uncompressed size. The size of any file larger than 4 GiB gets - truncated. - - The lzip format provides a 64-bit field for the uncompressed size. - Additionaly, lzip produces multi-member output automatically when - the size is too large for a single member, allowing an unlimited - uncompressed size. - -'Distributed index' - The lzip format provides a distributed index that, among other - things, helps plzip to decompress several times faster than pigz - and helps lziprecover do its job. The gzip format does not provide - an index. - - A distributed index is safer and more scalable than a monolithic - index. The monolithic index introduces a single point of failure - in the compressed file and may limit the number of members or the - total uncompressed size. - - -6.2 Quality of implementation -============================= - -Three related but independent compressor implementations, lzip, clzip -and minilzip/lzlib, are developed concurrently. Every stable release of -any of them is subjected to a hundred hours of intensive testing to -verify that it produces identical output to the other two. This -guarantees that all three implement the same algorithm, and makes it -unlikely that any of them may contain serious undiscovered errors. In -fact, no errors have been discovered in lzip since 2009. - - Just like the lzip format provides 4 factor protection against -undetected data corruption, the development methodology described above -provides 3 factor protection against undetected programming errors in -lzip. - - Lzip automatically uses the smallest possible dictionary size for -each file. In addition to reducing the amount of memory required for -decompression, this feature also minimizes the probability of being -affected by RAM errors during compression. - - Returning a warning status of 2 is a design flaw of compress that -leaked into the design of gzip. Both bzip2 and lzip are free form this -flaw. - - -File: lzip.info, Node: Examples, Next: Problems, Prev: Quality assurance, Up: Top +File: lzip.info, Node: Examples, Next: Problems, Prev: Stream format, Up: Top 7 A small tutorial with examples ******************************** @@ -876,7 +888,7 @@ File: lzip.info, Node: Reference source code, Next: Concept index, Prev: Prob Appendix A Reference source code ******************************** -/* Lzd - Educational decompressor for lzip files +/* Lzd - Educational decompressor for the lzip format Copyright (C) 2013-2015 Antonio Diaz Diaz. This program is free software: you have unlimited permission @@ -1133,7 +1145,7 @@ class LZ_decoder } public: - LZ_decoder( const unsigned dict_size ) + explicit LZ_decoder( const unsigned dict_size ) : partial_data_pos( 0 ), dictionary_size( dict_size ), @@ -1160,7 +1172,7 @@ void LZ_decoder::flush_data() crc32.update_buf( crc_, buffer + stream_pos, size ); errno = 0; if( std::fwrite( buffer + stream_pos, 1, size, stdout ) != size ) - { std::fprintf( stderr, "Write error: %s.\n", std::strerror( errno ) ); + { std::fprintf( stderr, "Write error: %s\n", std::strerror( errno ) ); std::exit( 1 ); } if( pos >= dictionary_size ) { partial_data_pos += pos; pos = 0; } stream_pos = pos; @@ -1202,7 +1214,7 @@ bool LZ_decoder::decode_member() // Returns false if error put_byte( rdec.decode_matched( bm, peek( rep0 ) ) ); state.set_char(); } - else + else // match or repeated match { int len; if( rdec.decode_bit( bm_rep[state()] ) != 0 ) // 2nd bit @@ -1231,7 +1243,7 @@ bool LZ_decoder::decode_member() // Returns false if error state.set_rep(); len = min_match_len + rdec.decode_len( rep_len_model, pos_state ); } - else + else // match { rep3 = rep2; rep2 = rep1; rep1 = rep0; len = min_match_len + rdec.decode_len( match_len_model, pos_state ); @@ -1273,7 +1285,7 @@ int main( const int argc, const char * const argv[] ) { if( argc > 1 ) { - std::printf( "Lzd %s - Educational decompressor for lzip files.\n", + std::printf( "Lzd %s - Educational decompressor for the lzip format.\n", PROGVERSION ); std::printf( "Study the source to learn how a lzip decompressor works.\n" "See the lzip manual for an explanation of the code.\n" @@ -1300,19 +1312,19 @@ int main( const int argc, const char * const argv[] ) if( std::feof( stdin ) || std::memcmp( header, "LZIP\x01", 5 ) != 0 ) { if( first_member ) - { std::fprintf( stderr, "Bad magic number (file not in lzip format)\n" ); + { std::fputs( "Bad magic number (file not in lzip format).\n", stderr ); return 2; } break; } unsigned dict_size = 1 << ( header[5] & 0x1F ); dict_size -= ( dict_size / 16 ) * ( ( header[5] >> 5 ) & 7 ); if( dict_size < min_dictionary_size || dict_size > max_dictionary_size ) - { std::fprintf( stderr, "Invalid dictionary size in member header\n" ); + { std::fputs( "Invalid dictionary size in member header.\n", stderr ); return 2; } LZ_decoder decoder( dict_size ); if( !decoder.decode_member() ) - { std::fprintf( stderr, "Data error\n" ); return 2; } + { std::fputs( "Data error\n", stderr ); return 2; } File_trailer trailer; for( int i = 0; i < 20; ++i ) trailer[i] = std::getc( stdin ); @@ -1321,11 +1333,11 @@ int main( const int argc, const char * const argv[] ) unsigned long long data_size = 0; for( int i = 11; i >= 4; --i ) { data_size <<= 8; data_size += trailer[i]; } if( crc != decoder.crc() || data_size != decoder.data_position() ) - { std::fprintf( stderr, "CRC error\n" ); return 2; } + { std::fputs( "CRC error\n", stderr ); return 2; } } if( std::fclose( stdout ) != 0 ) - { std::fprintf( stderr, "Can't close stdout: %s.\n", std::strerror( errno ) ); + { std::fprintf( stderr, "Can't close stdout: %s\n", std::strerror( errno ) ); return 1; } return 0; } @@ -1357,16 +1369,16 @@ Concept index Tag Table: Node: Top208 -Node: Introduction1090 -Node: Algorithm6008 -Node: Invoking lzip8833 -Node: File format14421 -Node: Stream format16806 -Node: Quality assurance26247 -Node: Examples32269 -Node: Problems34230 -Node: Reference source code34760 -Node: Concept index48358 +Node: Introduction1087 +Node: Invoking lzip6060 +Node: Quality assurance11658 +Node: File format18171 +Node: Algorithm20556 +Node: Stream format23382 +Node: Examples32812 +Node: Problems34769 +Node: Reference source code35299 +Node: Concept index48952 End Tag Table |