diff options
Diffstat (limited to 'doc/lzip.texi')
-rw-r--r-- | doc/lzip.texi | 176 |
1 files changed, 160 insertions, 16 deletions
diff --git a/doc/lzip.texi b/doc/lzip.texi index 5046140..ac44ee9 100644 --- a/doc/lzip.texi +++ b/doc/lzip.texi @@ -6,8 +6,8 @@ @finalout @c %**end of header -@set UPDATED 17 April 2015 -@set VERSION 1.17-rc1 +@set UPDATED 25 May 2015 +@set VERSION 1.17-rc2 @dircategory Data Compression @direntry @@ -40,6 +40,7 @@ This manual is for Lzip (version @value{VERSION}, @value{UPDATED}). * Invoking lzip:: Command line interface * File format:: Detailed format of the compressed file * Stream format:: Format of the LZMA stream in lzip files +* Quality assurance:: Design, development and testing of lzip * Examples:: A small tutorial with examples * Problems:: Reporting bugs * Reference source code:: Source code illustrating stream format @@ -60,8 +61,7 @@ to copy, distribute and modify it. Lzip is a lossless data compressor with a user interface similar to the one of gzip or bzip2. Lzip is about as fast as gzip, compresses most files more than bzip2, and is better than both from a data recovery -perspective. Lzip is a clean implementation of the LZMA -(Lempel-Ziv-Markov chain-Algorithm) "algorithm". +perspective. The lzip file format is designed for data sharing and long-term archiving, taking into account both data integrity and decoder @@ -159,23 +159,24 @@ multivolume compressed tar archives. Lzip is able to compress and decompress streams of unlimited size by automatically creating multi-member output. The members so created are -large, about 64 PiB each. +large, about 2 PiB each. @node Algorithm @chapter Algorithm @cindex algorithm -There is no such thing as a "LZMA algorithm"; it is more like a "LZMA -coding scheme". For example, the option '-0' of lzip uses the scheme in -almost the simplest way possible; issuing the longest match it can find, -or a literal byte if it can't find a match. Inversely, a much more -elaborated way of finding coding sequences of minimum size than the one -currently used by lzip could be developed, and the resulting sequence -could also be coded using the LZMA coding scheme. +In spite of its name (Lempel-Ziv-Markov chain-Algorithm), LZMA is not a +concrete algorithm; it is more like "any algorithm using the LZMA coding +scheme". For example, the option '-0' of lzip uses the scheme in almost +the simplest way possible; issuing the longest match it can find, or a +literal byte if it can't find a match. Inversely, a much more elaborated +way of finding coding sequences of minimum size than the one currently +used by lzip could be developed, and the resulting sequence could also +be coded using the LZMA coding scheme. -Lzip currently implements two variants of the LZMA algorithm; fast (used -by option -0) and normal (used by all other compression levels). +Lzip currently implements two variants of the LZMA algorithm; fast +(used by option -0) and normal (used by all other compression levels). The high compression of LZMA comes from combining two basic, well-proven compression ideas: sliding dictionaries (LZ77/78) and markov models (the @@ -242,7 +243,7 @@ lzip [@var{options}] [@var{files}] Lzip supports the following options: -@table @samp +@table @code @item -h @itemx --help Print an informative help message describing the options and exit. @@ -255,7 +256,7 @@ Print the version number of lzip on the standard output and exit. @itemx --member-size=@var{bytes} Set the member size limit to @var{bytes}. A small member size may degrade compression ratio, so use it only when needed. Valid values -range from 100 kB to 64 PiB. Defaults to 64 PiB. +range from 100 kB to 2 PiB. Defaults to 2 PiB. @item -c @itemx --stdout @@ -689,6 +690,140 @@ sequences (matches, repeated matches, and literal bytes), until the "End Of Stream" marker is decoded. +@node Quality assurance +@chapter Design, development and testing of lzip +@cindex quality assurance + +There are two ways of constructing a software design. One way is to make +it so simple that there are obviously no deficiencies and the other is +to make it so complicated that there are no obvious deficiencies.@* +--- C.A.R. Hoare + +Lzip has been designed, written and tested with great care to be the +standard general-purpose compressor for unix-like systems. This chapter +describes the lessons learned from previous compressors (gzip and +bzip2), and their application to the design of lzip. + +@sp 1 +@section Format design + +When gzip was designed in 1992, computers and operating systems were +much less capable than they are today. Gzip tried to work around some of +those limitations, like 8.3 file names, with additional fields in its +file format. + +Today those limitations have mostly disappeared, and the format of gzip +has proved to be unnecessarily complicated. It includes fields that were +never used, others that have lost its usefulness, and finally others +that have become too limited. + +Bzip2 was designed 5 years later, and its format is in some aspects +simpler than the one of gzip. But bzip2 also shows complexities in its +file format which slow down decompression and, in retrospect, are +unnecessary. + +Probably the worst defect of the gzip format from the point of view of +data safety is the variable size of its header. If the byte at offset 3 +(flags) of a gzip member gets corrupted, it mat become very difficult to +recover the data, even if the compressed blocks are intact, because it +can't be known with certainty where the compressed blocks begin. + +By contrast, the lzma stream in a lzip member always starts at offset 6, +making it trivial to recover the data even if the whole header becomes +corrupt. + +Lzip provides better data recovery capabilities than any other gzip-like +compressor because its format has been designed from the beginning to be +simple and safe. It would be very difficult to write an automatic +recovery tool like lziprecover for the gzip format. And, as far as I +know, it has never been writen. + +The lzip format is designed for long-term archiving. Therefore it +excludes any unneeded features that may interfere with the future +extraction of the uncompressed data. + +@sp 1 +@subsection Gzip format (mis)features not present in lzip + +@table @samp +@item Multiple algorithms + +Gzip provides a CM (Compression Method) field that has never been used +because it is a bad idea to begin with. New compression methods may +require additional fields, making it impossible to implement new methods +and, at the same time, keep the same format. This field does not solve +the problem of format proliferation; it just makes the problem less +obvious. + +@item Optional fields in header + +Unless special precautions are taken, optional fields are generally a +bad idea because they produce a header of variable size. The gzip header +has 2 fields that, in addition to being optional, are zero-terminated. +This means that if any byte inside the field gets zeroed, or if the +terminating zero gets altered, gzip won't be able to find neither the +header CRC nor the compressed blocks. + +Using an optional checksum for the header is not only a bad idea, it is +an error; it may prevent the extraction of perfectly good data. For +example, if the checksum is used and the bit enabling it is reset by a +bit-flip, the header will appear to be intact (in spite of being +corrupt) while the compressed blocks will appear to be totally +unrecoverable (in spite of being intact). Very misleading indeed. + +@end table + +@subsection Lzip format improvements over gzip + +@table @samp +@item 64-bit size field + +Probably the most frequently reported shortcoming of the gzip format is +that it only stores the least significant 32 bits of the uncompressed +size. The size of any file larger than 4 GiB gets truncated. + +The lzip format provides a 64-bit field for the uncompressed size. +Additionaly, lzip produces multi-member output automatically when the +size is too large for a single member, allowing an unlimited +uncompressed size. + +@item Distributed index + +The lzip format provides a distributed index that, among other things, +helps plzip to decompress several times faster than pigz and helps +lziprecover do its job. The gzip format does not provide an index. + +A distributed index is safer and more scalable than a monolithic index. +The monolithic index introduces a single point of failure in the +compressed file and may limit the number of members or the total +uncompressed size. + +@end table + +@section Quality of implementation + +Three related but independent compressor implementations, lzip, clzip +and minilzip/lzlib, are developed concurrently. Every stable release of +any of them is subjected to a hundred hours of intensive testing to +verify that it produces identical output to the other two. This +guarantees that all three implement the same algorithm, and makes it +unlikely that any of them may contain serious undiscovered errors. In +fact, no errors have been discovered in lzip since 2009. + +Just like the lzip format provides 4 factor protection against +undetected data corruption, the development methodology described above +provides 3 factor protection against undetected programming errors in +lzip. + +Lzip automatically uses the smallest possible dictionary size for each +file. In addition to reducing the amount of memory required for +decompression, this feature also minimizes the probability of being +affected by RAM errors during compression. + +Returning a warning status of 2 is a design flaw of compress that leaked +into the design of gzip. Both bzip2 and lzip are free form this flaw. + + @node Examples @chapter A small tutorial with examples @cindex examples @@ -835,6 +970,10 @@ find by running @w{@code{lzip --version}}. #include <cstring> #include <stdint.h> #include <unistd.h> +#if defined(__MSVCRT__) || defined(__OS2__) || defined(_MSC_VER) +#include <fcntl.h> +#include <io.h> +#endif class State @@ -1220,6 +1359,11 @@ int main( const int argc, const char * const argv[] ) return 0; } +#if defined(__MSVCRT__) || defined(__OS2__) || defined(_MSC_VER) + setmode( fileno( stdin ), O_BINARY ); + setmode( fileno( stdout ), O_BINARY ); +#endif + for( bool first_member = true; ; first_member = false ) { File_header header; |