diff options
Diffstat (limited to 'doc/lzip.texi')
-rw-r--r-- | doc/lzip.texi | 475 |
1 files changed, 248 insertions, 227 deletions
diff --git a/doc/lzip.texi b/doc/lzip.texi index ac44ee9..69f44ae 100644 --- a/doc/lzip.texi +++ b/doc/lzip.texi @@ -6,8 +6,8 @@ @finalout @c %**end of header -@set UPDATED 25 May 2015 -@set VERSION 1.17-rc2 +@set UPDATED 12 July 2015 +@set VERSION 1.17 @dircategory Data Compression @direntry @@ -36,11 +36,11 @@ This manual is for Lzip (version @value{VERSION}, @value{UPDATED}). @menu * Introduction:: Purpose and features of lzip -* Algorithm:: How lzip compresses the data * Invoking lzip:: Command line interface +* Quality assurance:: Design, development and testing of lzip * File format:: Detailed format of the compressed file +* Algorithm:: How lzip compresses the data * Stream format:: Format of the LZMA stream in lzip files -* Quality assurance:: Design, development and testing of lzip * Examples:: A small tutorial with examples * Problems:: Reporting bugs * Reference source code:: Source code illustrating stream format @@ -70,10 +70,14 @@ availability: @itemize @bullet @item The lzip format provides very safe integrity checking and some data -recovery means. The lziprecover program can repair bit-flip errors (one -of the most common forms of data corruption) in lzip files, and provides -data recovery capabilities, including error-checked merging of damaged -copies of a file. +recovery means. The +@uref{http://www.nongnu.org/lzip/manual/lziprecover_manual.html#Data-safety,,lziprecover} +program can repair bit-flip errors (one of the most common forms of data +corruption) in lzip files, and provides data recovery capabilities, +including error-checked merging of damaged copies of a file. +@ifnothtml +@ref{Data safety,,,lziprecover}. +@end ifnothtml @item The lzip format is as simple as possible (but not simpler). The lzip @@ -109,6 +113,11 @@ makes it safer than compressors returning ambiguous warning values (like gzip) when it is used as a back end for other programs like tar or zutils. +Lzip will automatically use the smallest possible dictionary size for +each file without exceeding the given limit. Keep in mind that the +decompression memory requirement is affected at compression time by the +choice of dictionary size limit. + The amount of memory required for compression is about 1 or 2 times the dictionary size limit (1 if input file size is less than dictionary size limit, else 2) plus 9 times the dictionary size really used. The option @@ -116,11 +125,6 @@ limit, else 2) plus 9 times the dictionary size really used. The option of memory required for decompression is about 46 kB larger than the dictionary size really used. -Lzip will automatically use the smallest possible dictionary size for -each file without exceeding the given limit. Keep in mind that the -decompression memory requirement is affected at compression time by the -choice of dictionary size limit. - When compressing, lzip replaces every file given in the command line with a compressed version of itself, with the name "original_name.lz". When decompressing, lzip attempts to guess the name for the decompressed @@ -152,8 +156,8 @@ corresponding uncompressed files. Integrity testing of concatenated compressed files is also supported. Lzip can produce multi-member files and safely recover, with -lziprecover, the undamaged members in case of file damage. Lzip can also -split the compressed output in volumes of a given size, even when +lziprecover, the undamaged members in case of file damage. Lzip can +also split the compressed output in volumes of a given size, even when reading from standard input. This allows the direct creation of multivolume compressed tar archives. @@ -162,72 +166,6 @@ automatically creating multi-member output. The members so created are large, about 2 PiB each. -@node Algorithm -@chapter Algorithm -@cindex algorithm - -In spite of its name (Lempel-Ziv-Markov chain-Algorithm), LZMA is not a -concrete algorithm; it is more like "any algorithm using the LZMA coding -scheme". For example, the option '-0' of lzip uses the scheme in almost -the simplest way possible; issuing the longest match it can find, or a -literal byte if it can't find a match. Inversely, a much more elaborated -way of finding coding sequences of minimum size than the one currently -used by lzip could be developed, and the resulting sequence could also -be coded using the LZMA coding scheme. - -Lzip currently implements two variants of the LZMA algorithm; fast -(used by option -0) and normal (used by all other compression levels). - -The high compression of LZMA comes from combining two basic, well-proven -compression ideas: sliding dictionaries (LZ77/78) and markov models (the -thing used by every compression algorithm that uses a range encoder or -similar order-0 entropy coder as its last stage) with segregation of -contexts according to what the bits are used for. - -Lzip is a two stage compressor. The first stage is a Lempel-Ziv coder, -which reduces redundancy by translating chunks of data to their -corresponding distance-length pairs. The second stage is a range encoder -that uses a different probability model for each type of data; -distances, lengths, literal bytes, etc. - -Here is how it works, step by step: - -1) The member header is written to the output stream. - -2) The first byte is coded literally, because there are no previous -bytes to which the match finder can refer to. - -3) The main encoder advances to the next byte in the input data and -calls the match finder. - -4) The match finder fills an array with the minimum distances before the -current byte where a match of a given length can be found. - -5) Go back to step 3 until a sequence (formed of pairs, repeated -distances and literal bytes) of minimum price has been formed. Where the -price represents the number of output bits produced. - -6) The range encoder encodes the sequence produced by the main encoder -and sends the produced bytes to the output stream. - -7) Go back to step 3 until the input data are finished or until the -member or volume size limits are reached. - -8) The range encoder is flushed. - -9) The member trailer is written to the output stream. - -10) If there are more data to compress, go back to step 1. - -@sp 1 -@noindent -The ideas embodied in lzip are due to (at least) the following people: -Abraham Lempel and Jacob Ziv (for the LZ algorithm), Andrey Markov (for -the definition of Markov chains), G.N.N. Martin (for the definition of -range encoding), Igor Pavlov (for putting all the above together in -LZMA), and Julian Seward (for bzip2's CLI). - - @node Invoking lzip @chapter Invoking lzip @cindex invoking @@ -274,7 +212,7 @@ Force overwrite of output files. @item -F @itemx --recompress -Force recompression of files whose name already has the @samp{.lz} or +Force re-compression of files whose name already has the @samp{.lz} or @samp{.tlz} suffix. @item -k @@ -392,6 +330,157 @@ invalid input file, 3 for an internal consistency error (eg, bug) which caused lzip to panic. +@node Quality assurance +@chapter Design, development and testing of lzip +@cindex quality assurance + +There are two ways of constructing a software design. One way is to make +it so simple that there are obviously no deficiencies and the other is +to make it so complicated that there are no obvious deficiencies.@* +--- C.A.R. Hoare + +Lzip has been designed, written and tested with great care to be the +standard general-purpose compressor for unix-like systems. This chapter +describes the lessons learned from previous compressors (gzip and +bzip2), and their application to the design of lzip. + +@sp 1 +@section Format design + +When gzip was designed in 1992, computers and operating systems were +much less capable than they are today. Gzip tried to work around some of +those limitations, like 8.3 file names, with additional fields in its +file format. + +Today those limitations have mostly disappeared, and the format of gzip +has proved to be unnecessarily complicated. It includes fields that were +never used, others that have lost its usefulness, and finally others +that have become too limited. + +Bzip2 was designed 5 years later, and its format is simpler than the one +of gzip. + +Probably the worst defect of the gzip format from the point of view of +data safety is the variable size of its header. If the byte at offset 3 +(flags) of a gzip member gets corrupted, it mat become very difficult to +recover the data, even if the compressed blocks are intact, because it +can't be known with certainty where the compressed blocks begin. + +By contrast, the header of a lzip member has a fixed length of 6. The +lzma stream in a lzip member always starts at offset 6, making it +trivial to recover the data even if the whole header becomes corrupt. + +Bzip2 also provides a header of fixed length and marks the begin and end +of each compressed block with six magic bytes, making it possible to +find the compressed blocks even in case of file damage. But bzip2 does +not store the size of each compressed block, as lzip does. + +Lzip provides better data recovery capabilities than any other gzip-like +compressor because its format has been designed from the beginning to be +simple and safe. It would be very difficult to write an automatic +recovery tool like lziprecover for the gzip format. And, as far as I +know, it has never been writen. + +The lzip format is designed for long-term archiving. Therefore it +excludes any unneeded features that may interfere with the future +extraction of the uncompressed data. + +@sp 1 +@subsection Gzip format (mis)features not present in lzip + +@table @samp +@item Multiple algorithms + +Gzip provides a CM (Compression Method) field that has never been used +because it is a bad idea to begin with. New compression methods may +require additional fields, making it impossible to implement new methods +and, at the same time, keep the same format. This field does not solve +the problem of format proliferation; it just makes the problem less +obvious. + +@item Optional fields in header + +Unless special precautions are taken, optional fields are generally a +bad idea because they produce a header of variable size. The gzip header +has 2 fields that, in addition to being optional, are zero-terminated. +This means that if any byte inside the field gets zeroed, or if the +terminating zero gets altered, gzip won't be able to find neither the +header CRC nor the compressed blocks. + +@item Optional CRC for the header + +Using an optional checksum for the header is not only a bad idea, it is +an error; it may prevent the extraction of perfectly good data. For +example, if the checksum is used and the bit enabling it is reset by a +bit-flip, the header will appear to be intact (in spite of being +corrupt) while the compressed blocks will appear to be totally +unrecoverable (in spite of being intact). Very misleading indeed. + +@end table + +@subsection Lzip format improvements over gzip and bzip2 + +@table @samp +@item 64-bit size field + +Probably the most frequently reported shortcoming of the gzip format is +that it only stores the least significant 32 bits of the uncompressed +size. The size of any file larger than 4 GiB gets truncated. + +Bzip2 does not store the uncompressed size of the file. + +The lzip format provides a 64-bit field for the uncompressed size. +Additionaly, lzip produces multi-member output automatically when the +size is too large for a single member, allowing for an unlimited +uncompressed size. + +@item Distributed index + +The lzip format provides a distributed index that, among other things, +helps plzip to decompress several times faster than pigz and helps +lziprecover do its job. Neither the gzip format nor the bzip2 format do +provide an index. + +A distributed index is safer and more scalable than a monolithic index. +The monolithic index introduces a single point of failure in the +compressed file and may limit the number of members or the total +uncompressed size. + +@end table + +@section Quality of implementation + +@table @samp +@item Multiple implementations + +Just like the lzip format provides 4 factor protection against +undetected data corruption, the development methodology of the lzip +family of compressors provides 3 factor protection against undetected +programming errors. + +Three related but independent compressor implementations, lzip, clzip +and minilzip/lzlib, are developed concurrently. Every stable release of +any of them is subjected to a hundred hours of intensive testing to +verify that it produces identical output to the other two. This +guarantees that all three implement the same algorithm, and makes it +unlikely that any of them may contain serious undiscovered errors. In +fact, no errors have been discovered in lzip since 2009. + +@item Dictionary size + +Lzip automatically uses the smallest possible dictionary size for each +file. In addition to reducing the amount of memory required for +decompression, this feature also minimizes the probability of being +affected by RAM errors during compression. + +@item Exit status + +Returning a warning status of 2 is a design flaw of compress that leaked +into the design of gzip. Both bzip2 and lzip are free from this flaw. + +@end table + + @node File format @chapter File format @cindex file format @@ -468,6 +557,72 @@ facilitates safe recovery of undamaged members from multi-member files. @end table +@node Algorithm +@chapter Algorithm +@cindex algorithm + +In spite of its name (Lempel-Ziv-Markov chain-Algorithm), LZMA is not a +concrete algorithm; it is more like "any algorithm using the LZMA coding +scheme". For example, the option @samp{-0} of lzip uses the scheme in almost +the simplest way possible; issuing the longest match it can find, or a +literal byte if it can't find a match. Inversely, a much more elaborated +way of finding coding sequences of minimum size than the one currently +used by lzip could be developed, and the resulting sequence could also +be coded using the LZMA coding scheme. + +Lzip currently implements two variants of the LZMA algorithm; fast +(used by option @samp{-0}) and normal (used by all other compression levels). + +The high compression of LZMA comes from combining two basic, well-proven +compression ideas: sliding dictionaries (LZ77/78) and markov models (the +thing used by every compression algorithm that uses a range encoder or +similar order-0 entropy coder as its last stage) with segregation of +contexts according to what the bits are used for. + +Lzip is a two stage compressor. The first stage is a Lempel-Ziv coder, +which reduces redundancy by translating chunks of data to their +corresponding distance-length pairs. The second stage is a range encoder +that uses a different probability model for each type of data; +distances, lengths, literal bytes, etc. + +Here is how it works, step by step: + +1) The member header is written to the output stream. + +2) The first byte is coded literally, because there are no previous +bytes to which the match finder can refer to. + +3) The main encoder advances to the next byte in the input data and +calls the match finder. + +4) The match finder fills an array with the minimum distances before the +current byte where a match of a given length can be found. + +5) Go back to step 3 until a sequence (formed of pairs, repeated +distances and literal bytes) of minimum price has been formed. Where the +price represents the number of output bits produced. + +6) The range encoder encodes the sequence produced by the main encoder +and sends the produced bytes to the output stream. + +7) Go back to step 3 until the input data are finished or until the +member or volume size limits are reached. + +8) The range encoder is flushed. + +9) The member trailer is written to the output stream. + +10) If there are more data to compress, go back to step 1. + +@sp 1 +@noindent +The ideas embodied in lzip are due to (at least) the following people: +Abraham Lempel and Jacob Ziv (for the LZ algorithm), Andrey Markov (for +the definition of Markov chains), G.N.N. Martin (for the definition of +range encoding), Igor Pavlov (for putting all the above together in +LZMA), and Julian Seward (for bzip2's CLI). + + @node Stream format @chapter Format of the LZMA stream in lzip files @cindex format of the LZMA stream @@ -690,140 +845,6 @@ sequences (matches, repeated matches, and literal bytes), until the "End Of Stream" marker is decoded. -@node Quality assurance -@chapter Design, development and testing of lzip -@cindex quality assurance - -There are two ways of constructing a software design. One way is to make -it so simple that there are obviously no deficiencies and the other is -to make it so complicated that there are no obvious deficiencies.@* ---- C.A.R. Hoare - -Lzip has been designed, written and tested with great care to be the -standard general-purpose compressor for unix-like systems. This chapter -describes the lessons learned from previous compressors (gzip and -bzip2), and their application to the design of lzip. - -@sp 1 -@section Format design - -When gzip was designed in 1992, computers and operating systems were -much less capable than they are today. Gzip tried to work around some of -those limitations, like 8.3 file names, with additional fields in its -file format. - -Today those limitations have mostly disappeared, and the format of gzip -has proved to be unnecessarily complicated. It includes fields that were -never used, others that have lost its usefulness, and finally others -that have become too limited. - -Bzip2 was designed 5 years later, and its format is in some aspects -simpler than the one of gzip. But bzip2 also shows complexities in its -file format which slow down decompression and, in retrospect, are -unnecessary. - -Probably the worst defect of the gzip format from the point of view of -data safety is the variable size of its header. If the byte at offset 3 -(flags) of a gzip member gets corrupted, it mat become very difficult to -recover the data, even if the compressed blocks are intact, because it -can't be known with certainty where the compressed blocks begin. - -By contrast, the lzma stream in a lzip member always starts at offset 6, -making it trivial to recover the data even if the whole header becomes -corrupt. - -Lzip provides better data recovery capabilities than any other gzip-like -compressor because its format has been designed from the beginning to be -simple and safe. It would be very difficult to write an automatic -recovery tool like lziprecover for the gzip format. And, as far as I -know, it has never been writen. - -The lzip format is designed for long-term archiving. Therefore it -excludes any unneeded features that may interfere with the future -extraction of the uncompressed data. - -@sp 1 -@subsection Gzip format (mis)features not present in lzip - -@table @samp -@item Multiple algorithms - -Gzip provides a CM (Compression Method) field that has never been used -because it is a bad idea to begin with. New compression methods may -require additional fields, making it impossible to implement new methods -and, at the same time, keep the same format. This field does not solve -the problem of format proliferation; it just makes the problem less -obvious. - -@item Optional fields in header - -Unless special precautions are taken, optional fields are generally a -bad idea because they produce a header of variable size. The gzip header -has 2 fields that, in addition to being optional, are zero-terminated. -This means that if any byte inside the field gets zeroed, or if the -terminating zero gets altered, gzip won't be able to find neither the -header CRC nor the compressed blocks. - -Using an optional checksum for the header is not only a bad idea, it is -an error; it may prevent the extraction of perfectly good data. For -example, if the checksum is used and the bit enabling it is reset by a -bit-flip, the header will appear to be intact (in spite of being -corrupt) while the compressed blocks will appear to be totally -unrecoverable (in spite of being intact). Very misleading indeed. - -@end table - -@subsection Lzip format improvements over gzip - -@table @samp -@item 64-bit size field - -Probably the most frequently reported shortcoming of the gzip format is -that it only stores the least significant 32 bits of the uncompressed -size. The size of any file larger than 4 GiB gets truncated. - -The lzip format provides a 64-bit field for the uncompressed size. -Additionaly, lzip produces multi-member output automatically when the -size is too large for a single member, allowing an unlimited -uncompressed size. - -@item Distributed index - -The lzip format provides a distributed index that, among other things, -helps plzip to decompress several times faster than pigz and helps -lziprecover do its job. The gzip format does not provide an index. - -A distributed index is safer and more scalable than a monolithic index. -The monolithic index introduces a single point of failure in the -compressed file and may limit the number of members or the total -uncompressed size. - -@end table - -@section Quality of implementation - -Three related but independent compressor implementations, lzip, clzip -and minilzip/lzlib, are developed concurrently. Every stable release of -any of them is subjected to a hundred hours of intensive testing to -verify that it produces identical output to the other two. This -guarantees that all three implement the same algorithm, and makes it -unlikely that any of them may contain serious undiscovered errors. In -fact, no errors have been discovered in lzip since 2009. - -Just like the lzip format provides 4 factor protection against -undetected data corruption, the development methodology described above -provides 3 factor protection against undetected programming errors in -lzip. - -Lzip automatically uses the smallest possible dictionary size for each -file. In addition to reducing the amount of memory required for -decompression, this feature also minimizes the probability of being -affected by RAM errors during compression. - -Returning a warning status of 2 is a design flaw of compress that leaked -into the design of gzip. Both bzip2 and lzip are free form this flaw. - - @node Examples @chapter A small tutorial with examples @cindex examples @@ -947,7 +968,7 @@ find by running @w{@code{lzip --version}}. @cindex reference source code @verbatim -/* Lzd - Educational decompressor for lzip files +/* Lzd - Educational decompressor for the lzip format Copyright (C) 2013-2015 Antonio Diaz Diaz. This program is free software: you have unlimited permission @@ -1204,7 +1225,7 @@ class LZ_decoder } public: - LZ_decoder( const unsigned dict_size ) + explicit LZ_decoder( const unsigned dict_size ) : partial_data_pos( 0 ), dictionary_size( dict_size ), @@ -1231,7 +1252,7 @@ void LZ_decoder::flush_data() crc32.update_buf( crc_, buffer + stream_pos, size ); errno = 0; if( std::fwrite( buffer + stream_pos, 1, size, stdout ) != size ) - { std::fprintf( stderr, "Write error: %s.\n", std::strerror( errno ) ); + { std::fprintf( stderr, "Write error: %s\n", std::strerror( errno ) ); std::exit( 1 ); } if( pos >= dictionary_size ) { partial_data_pos += pos; pos = 0; } stream_pos = pos; @@ -1273,7 +1294,7 @@ bool LZ_decoder::decode_member() // Returns false if error put_byte( rdec.decode_matched( bm, peek( rep0 ) ) ); state.set_char(); } - else + else // match or repeated match { int len; if( rdec.decode_bit( bm_rep[state()] ) != 0 ) // 2nd bit @@ -1302,7 +1323,7 @@ bool LZ_decoder::decode_member() // Returns false if error state.set_rep(); len = min_match_len + rdec.decode_len( rep_len_model, pos_state ); } - else + else // match { rep3 = rep2; rep2 = rep1; rep1 = rep0; len = min_match_len + rdec.decode_len( match_len_model, pos_state ); @@ -1344,7 +1365,7 @@ int main( const int argc, const char * const argv[] ) { if( argc > 1 ) { - std::printf( "Lzd %s - Educational decompressor for lzip files.\n", + std::printf( "Lzd %s - Educational decompressor for the lzip format.\n", PROGVERSION ); std::printf( "Study the source to learn how a lzip decompressor works.\n" "See the lzip manual for an explanation of the code.\n" @@ -1371,19 +1392,19 @@ int main( const int argc, const char * const argv[] ) if( std::feof( stdin ) || std::memcmp( header, "LZIP\x01", 5 ) != 0 ) { if( first_member ) - { std::fprintf( stderr, "Bad magic number (file not in lzip format)\n" ); + { std::fputs( "Bad magic number (file not in lzip format).\n", stderr ); return 2; } break; } unsigned dict_size = 1 << ( header[5] & 0x1F ); dict_size -= ( dict_size / 16 ) * ( ( header[5] >> 5 ) & 7 ); if( dict_size < min_dictionary_size || dict_size > max_dictionary_size ) - { std::fprintf( stderr, "Invalid dictionary size in member header\n" ); + { std::fputs( "Invalid dictionary size in member header.\n", stderr ); return 2; } LZ_decoder decoder( dict_size ); if( !decoder.decode_member() ) - { std::fprintf( stderr, "Data error\n" ); return 2; } + { std::fputs( "Data error\n", stderr ); return 2; } File_trailer trailer; for( int i = 0; i < 20; ++i ) trailer[i] = std::getc( stdin ); @@ -1392,11 +1413,11 @@ int main( const int argc, const char * const argv[] ) unsigned long long data_size = 0; for( int i = 11; i >= 4; --i ) { data_size <<= 8; data_size += trailer[i]; } if( crc != decoder.crc() || data_size != decoder.data_position() ) - { std::fprintf( stderr, "CRC error\n" ); return 2; } + { std::fputs( "CRC error\n", stderr ); return 2; } } if( std::fclose( stdout ) != 0 ) - { std::fprintf( stderr, "Can't close stdout: %s.\n", std::strerror( errno ) ); + { std::fprintf( stderr, "Can't close stdout: %s\n", std::strerror( errno ) ); return 1; } return 0; } |