diff options
Diffstat (limited to 'doc')
-rw-r--r-- | doc/lzip.1 | 2 | ||||
-rw-r--r-- | doc/lzip.info | 199 | ||||
-rw-r--r-- | doc/lzip.texi | 176 |
3 files changed, 335 insertions, 42 deletions
@@ -1,5 +1,5 @@ .\" DO NOT MODIFY THIS FILE! It was generated by help2man 1.46.1. -.TH LZIP "1" "April 2015" "lzip 1.17-rc1" "User Commands" +.TH LZIP "1" "May 2015" "lzip 1.17-rc2" "User Commands" .SH NAME lzip \- reduces the size of files .SH SYNOPSIS diff --git a/doc/lzip.info b/doc/lzip.info index 53f7c55..6854503 100644 --- a/doc/lzip.info +++ b/doc/lzip.info @@ -11,7 +11,7 @@ File: lzip.info, Node: Top, Next: Introduction, Up: (dir) Lzip Manual *********** -This manual is for Lzip (version 1.17-rc1, 17 April 2015). +This manual is for Lzip (version 1.17-rc2, 25 May 2015). * Menu: @@ -20,6 +20,7 @@ This manual is for Lzip (version 1.17-rc1, 17 April 2015). * Invoking lzip:: Command line interface * File format:: Detailed format of the compressed file * Stream format:: Format of the LZMA stream in lzip files +* Quality assurance:: Design, development and testing of lzip * Examples:: A small tutorial with examples * Problems:: Reporting bugs * Reference source code:: Source code illustrating stream format @@ -40,8 +41,7 @@ File: lzip.info, Node: Introduction, Next: Algorithm, Prev: Top, Up: Top Lzip is a lossless data compressor with a user interface similar to the one of gzip or bzip2. Lzip is about as fast as gzip, compresses most files more than bzip2, and is better than both from a data recovery -perspective. Lzip is a clean implementation of the LZMA -(Lempel-Ziv-Markov chain-Algorithm) "algorithm". +perspective. The lzip file format is designed for data sharing and long-term archiving, taking into account both data integrity and decoder @@ -133,7 +133,7 @@ multivolume compressed tar archives. Lzip is able to compress and decompress streams of unlimited size by automatically creating multi-member output. The members so created are -large, about 64 PiB each. +large, about 2 PiB each. File: lzip.info, Node: Algorithm, Next: Invoking lzip, Prev: Introduction, Up: Top @@ -141,13 +141,14 @@ File: lzip.info, Node: Algorithm, Next: Invoking lzip, Prev: Introduction, U 2 Algorithm *********** -There is no such thing as a "LZMA algorithm"; it is more like a "LZMA -coding scheme". For example, the option '-0' of lzip uses the scheme in -almost the simplest way possible; issuing the longest match it can find, -or a literal byte if it can't find a match. Inversely, a much more -elaborated way of finding coding sequences of minimum size than the one -currently used by lzip could be developed, and the resulting sequence -could also be coded using the LZMA coding scheme. +In spite of its name (Lempel-Ziv-Markov chain-Algorithm), LZMA is not a +concrete algorithm; it is more like "any algorithm using the LZMA coding +scheme". For example, the option '-0' of lzip uses the scheme in almost +the simplest way possible; issuing the longest match it can find, or a +literal byte if it can't find a match. Inversely, a much more elaborated +way of finding coding sequences of minimum size than the one currently +used by lzip could be developed, and the resulting sequence could also +be coded using the LZMA coding scheme. Lzip currently implements two variants of the LZMA algorithm; fast (used by option -0) and normal (used by all other compression levels). @@ -224,7 +225,7 @@ The format for running lzip is: '--member-size=BYTES' Set the member size limit to BYTES. A small member size may degrade compression ratio, so use it only when needed. Valid values - range from 100 kB to 64 PiB. Defaults to 64 PiB. + range from 100 kB to 2 PiB. Defaults to 2 PiB. '-c' '--stdout' @@ -432,7 +433,7 @@ additional information before, between, or after them. -File: lzip.info, Node: Stream format, Next: Examples, Prev: File format, Up: Top +File: lzip.info, Node: Stream format, Next: Quality assurance, Prev: File format, Up: Top 5 Format of the LZMA stream in lzip files ***************************************** @@ -645,9 +646,146 @@ with the appropriate contexts to decode the different coding sequences Stream" marker is decoded. -File: lzip.info, Node: Examples, Next: Problems, Prev: Stream format, Up: Top +File: lzip.info, Node: Quality assurance, Next: Examples, Prev: Stream format, Up: Top -6 A small tutorial with examples +6 Design, development and testing of lzip +***************************************** + +There are two ways of constructing a software design. One way is to make +it so simple that there are obviously no deficiencies and the other is +to make it so complicated that there are no obvious deficiencies. +-- C.A.R. Hoare + + Lzip has been designed, written and tested with great care to be the +standard general-purpose compressor for unix-like systems. This chapter +describes the lessons learned from previous compressors (gzip and +bzip2), and their application to the design of lzip. + + +6.1 Format design +================= + +When gzip was designed in 1992, computers and operating systems were +much less capable than they are today. Gzip tried to work around some of +those limitations, like 8.3 file names, with additional fields in its +file format. + + Today those limitations have mostly disappeared, and the format of +gzip has proved to be unnecessarily complicated. It includes fields +that were never used, others that have lost its usefulness, and finally +others that have become too limited. + + Bzip2 was designed 5 years later, and its format is in some aspects +simpler than the one of gzip. But bzip2 also shows complexities in its +file format which slow down decompression and, in retrospect, are +unnecessary. + + Probably the worst defect of the gzip format from the point of view +of data safety is the variable size of its header. If the byte at +offset 3 (flags) of a gzip member gets corrupted, it mat become very +difficult to recover the data, even if the compressed blocks are +intact, because it can't be known with certainty where the compressed +blocks begin. + + By contrast, the lzma stream in a lzip member always starts at +offset 6, making it trivial to recover the data even if the whole +header becomes corrupt. + + Lzip provides better data recovery capabilities than any other +gzip-like compressor because its format has been designed from the +beginning to be simple and safe. It would be very difficult to write an +automatic recovery tool like lziprecover for the gzip format. And, as +far as I know, it has never been writen. + + The lzip format is designed for long-term archiving. Therefore it +excludes any unneeded features that may interfere with the future +extraction of the uncompressed data. + + +6.1.1 Gzip format (mis)features not present in lzip +--------------------------------------------------- + +'Multiple algorithms' + Gzip provides a CM (Compression Method) field that has never been + used because it is a bad idea to begin with. New compression + methods may require additional fields, making it impossible to + implement new methods and, at the same time, keep the same format. + This field does not solve the problem of format proliferation; it + just makes the problem less obvious. + +'Optional fields in header' + Unless special precautions are taken, optional fields are + generally a bad idea because they produce a header of variable + size. The gzip header has 2 fields that, in addition to being + optional, are zero-terminated. This means that if any byte inside + the field gets zeroed, or if the terminating zero gets altered, + gzip won't be able to find neither the header CRC nor the + compressed blocks. + + Using an optional checksum for the header is not only a bad idea, + it is an error; it may prevent the extraction of perfectly good + data. For example, if the checksum is used and the bit enabling it + is reset by a bit-flip, the header will appear to be intact (in + spite of being corrupt) while the compressed blocks will appear to + be totally unrecoverable (in spite of being intact). Very + misleading indeed. + + +6.1.2 Lzip format improvements over gzip +---------------------------------------- + +'64-bit size field' + Probably the most frequently reported shortcoming of the gzip + format is that it only stores the least significant 32 bits of the + uncompressed size. The size of any file larger than 4 GiB gets + truncated. + + The lzip format provides a 64-bit field for the uncompressed size. + Additionaly, lzip produces multi-member output automatically when + the size is too large for a single member, allowing an unlimited + uncompressed size. + +'Distributed index' + The lzip format provides a distributed index that, among other + things, helps plzip to decompress several times faster than pigz + and helps lziprecover do its job. The gzip format does not provide + an index. + + A distributed index is safer and more scalable than a monolithic + index. The monolithic index introduces a single point of failure + in the compressed file and may limit the number of members or the + total uncompressed size. + + +6.2 Quality of implementation +============================= + +Three related but independent compressor implementations, lzip, clzip +and minilzip/lzlib, are developed concurrently. Every stable release of +any of them is subjected to a hundred hours of intensive testing to +verify that it produces identical output to the other two. This +guarantees that all three implement the same algorithm, and makes it +unlikely that any of them may contain serious undiscovered errors. In +fact, no errors have been discovered in lzip since 2009. + + Just like the lzip format provides 4 factor protection against +undetected data corruption, the development methodology described above +provides 3 factor protection against undetected programming errors in +lzip. + + Lzip automatically uses the smallest possible dictionary size for +each file. In addition to reducing the amount of memory required for +decompression, this feature also minimizes the probability of being +affected by RAM errors during compression. + + Returning a warning status of 2 is a design flaw of compress that +leaked into the design of gzip. Both bzip2 and lzip are free form this +flaw. + + +File: lzip.info, Node: Examples, Next: Problems, Prev: Quality assurance, Up: Top + +7 A small tutorial with examples ******************************** WARNING! Even if lzip is bug-free, other causes may result in a corrupt @@ -720,7 +858,7 @@ file with a member size of 32 MiB. File: lzip.info, Node: Problems, Next: Reference source code, Prev: Examples, Up: Top -7 Reporting bugs +8 Reporting bugs **************** There are probably bugs in lzip. There are certainly errors and @@ -761,6 +899,10 @@ Appendix A Reference source code #include <cstring> #include <stdint.h> #include <unistd.h> +#if defined(__MSVCRT__) || defined(__OS2__) || defined(_MSC_VER) +#include <fcntl.h> +#include <io.h> +#endif class State @@ -1146,6 +1288,11 @@ int main( const int argc, const char * const argv[] ) return 0; } +#if defined(__MSVCRT__) || defined(__OS2__) || defined(_MSC_VER) + setmode( fileno( stdin ), O_BINARY ); + setmode( fileno( stdout ), O_BINARY ); +#endif + for( bool first_member = true; ; first_member = false ) { File_header header; @@ -1201,6 +1348,7 @@ Concept index * introduction: Introduction. (line 6) * invoking: Invoking lzip. (line 6) * options: Invoking lzip. (line 6) +* quality assurance: Quality assurance. (line 6) * reference source code: Reference source code. (line 6) * usage: Invoking lzip. (line 6) * version: Invoking lzip. (line 6) @@ -1209,15 +1357,16 @@ Concept index Tag Table: Node: Top208 -Node: Introduction1025 -Node: Algorithm6036 -Node: Invoking lzip8793 -Node: File format14383 -Node: Stream format16768 -Node: Examples26200 -Node: Problems28157 -Node: Reference source code28687 -Node: Concept index42024 +Node: Introduction1090 +Node: Algorithm6008 +Node: Invoking lzip8833 +Node: File format14421 +Node: Stream format16806 +Node: Quality assurance26247 +Node: Examples32269 +Node: Problems34230 +Node: Reference source code34760 +Node: Concept index48358 End Tag Table diff --git a/doc/lzip.texi b/doc/lzip.texi index 5046140..ac44ee9 100644 --- a/doc/lzip.texi +++ b/doc/lzip.texi @@ -6,8 +6,8 @@ @finalout @c %**end of header -@set UPDATED 17 April 2015 -@set VERSION 1.17-rc1 +@set UPDATED 25 May 2015 +@set VERSION 1.17-rc2 @dircategory Data Compression @direntry @@ -40,6 +40,7 @@ This manual is for Lzip (version @value{VERSION}, @value{UPDATED}). * Invoking lzip:: Command line interface * File format:: Detailed format of the compressed file * Stream format:: Format of the LZMA stream in lzip files +* Quality assurance:: Design, development and testing of lzip * Examples:: A small tutorial with examples * Problems:: Reporting bugs * Reference source code:: Source code illustrating stream format @@ -60,8 +61,7 @@ to copy, distribute and modify it. Lzip is a lossless data compressor with a user interface similar to the one of gzip or bzip2. Lzip is about as fast as gzip, compresses most files more than bzip2, and is better than both from a data recovery -perspective. Lzip is a clean implementation of the LZMA -(Lempel-Ziv-Markov chain-Algorithm) "algorithm". +perspective. The lzip file format is designed for data sharing and long-term archiving, taking into account both data integrity and decoder @@ -159,23 +159,24 @@ multivolume compressed tar archives. Lzip is able to compress and decompress streams of unlimited size by automatically creating multi-member output. The members so created are -large, about 64 PiB each. +large, about 2 PiB each. @node Algorithm @chapter Algorithm @cindex algorithm -There is no such thing as a "LZMA algorithm"; it is more like a "LZMA -coding scheme". For example, the option '-0' of lzip uses the scheme in -almost the simplest way possible; issuing the longest match it can find, -or a literal byte if it can't find a match. Inversely, a much more -elaborated way of finding coding sequences of minimum size than the one -currently used by lzip could be developed, and the resulting sequence -could also be coded using the LZMA coding scheme. +In spite of its name (Lempel-Ziv-Markov chain-Algorithm), LZMA is not a +concrete algorithm; it is more like "any algorithm using the LZMA coding +scheme". For example, the option '-0' of lzip uses the scheme in almost +the simplest way possible; issuing the longest match it can find, or a +literal byte if it can't find a match. Inversely, a much more elaborated +way of finding coding sequences of minimum size than the one currently +used by lzip could be developed, and the resulting sequence could also +be coded using the LZMA coding scheme. -Lzip currently implements two variants of the LZMA algorithm; fast (used -by option -0) and normal (used by all other compression levels). +Lzip currently implements two variants of the LZMA algorithm; fast +(used by option -0) and normal (used by all other compression levels). The high compression of LZMA comes from combining two basic, well-proven compression ideas: sliding dictionaries (LZ77/78) and markov models (the @@ -242,7 +243,7 @@ lzip [@var{options}] [@var{files}] Lzip supports the following options: -@table @samp +@table @code @item -h @itemx --help Print an informative help message describing the options and exit. @@ -255,7 +256,7 @@ Print the version number of lzip on the standard output and exit. @itemx --member-size=@var{bytes} Set the member size limit to @var{bytes}. A small member size may degrade compression ratio, so use it only when needed. Valid values -range from 100 kB to 64 PiB. Defaults to 64 PiB. +range from 100 kB to 2 PiB. Defaults to 2 PiB. @item -c @itemx --stdout @@ -689,6 +690,140 @@ sequences (matches, repeated matches, and literal bytes), until the "End Of Stream" marker is decoded. +@node Quality assurance +@chapter Design, development and testing of lzip +@cindex quality assurance + +There are two ways of constructing a software design. One way is to make +it so simple that there are obviously no deficiencies and the other is +to make it so complicated that there are no obvious deficiencies.@* +--- C.A.R. Hoare + +Lzip has been designed, written and tested with great care to be the +standard general-purpose compressor for unix-like systems. This chapter +describes the lessons learned from previous compressors (gzip and +bzip2), and their application to the design of lzip. + +@sp 1 +@section Format design + +When gzip was designed in 1992, computers and operating systems were +much less capable than they are today. Gzip tried to work around some of +those limitations, like 8.3 file names, with additional fields in its +file format. + +Today those limitations have mostly disappeared, and the format of gzip +has proved to be unnecessarily complicated. It includes fields that were +never used, others that have lost its usefulness, and finally others +that have become too limited. + +Bzip2 was designed 5 years later, and its format is in some aspects +simpler than the one of gzip. But bzip2 also shows complexities in its +file format which slow down decompression and, in retrospect, are +unnecessary. + +Probably the worst defect of the gzip format from the point of view of +data safety is the variable size of its header. If the byte at offset 3 +(flags) of a gzip member gets corrupted, it mat become very difficult to +recover the data, even if the compressed blocks are intact, because it +can't be known with certainty where the compressed blocks begin. + +By contrast, the lzma stream in a lzip member always starts at offset 6, +making it trivial to recover the data even if the whole header becomes +corrupt. + +Lzip provides better data recovery capabilities than any other gzip-like +compressor because its format has been designed from the beginning to be +simple and safe. It would be very difficult to write an automatic +recovery tool like lziprecover for the gzip format. And, as far as I +know, it has never been writen. + +The lzip format is designed for long-term archiving. Therefore it +excludes any unneeded features that may interfere with the future +extraction of the uncompressed data. + +@sp 1 +@subsection Gzip format (mis)features not present in lzip + +@table @samp +@item Multiple algorithms + +Gzip provides a CM (Compression Method) field that has never been used +because it is a bad idea to begin with. New compression methods may +require additional fields, making it impossible to implement new methods +and, at the same time, keep the same format. This field does not solve +the problem of format proliferation; it just makes the problem less +obvious. + +@item Optional fields in header + +Unless special precautions are taken, optional fields are generally a +bad idea because they produce a header of variable size. The gzip header +has 2 fields that, in addition to being optional, are zero-terminated. +This means that if any byte inside the field gets zeroed, or if the +terminating zero gets altered, gzip won't be able to find neither the +header CRC nor the compressed blocks. + +Using an optional checksum for the header is not only a bad idea, it is +an error; it may prevent the extraction of perfectly good data. For +example, if the checksum is used and the bit enabling it is reset by a +bit-flip, the header will appear to be intact (in spite of being +corrupt) while the compressed blocks will appear to be totally +unrecoverable (in spite of being intact). Very misleading indeed. + +@end table + +@subsection Lzip format improvements over gzip + +@table @samp +@item 64-bit size field + +Probably the most frequently reported shortcoming of the gzip format is +that it only stores the least significant 32 bits of the uncompressed +size. The size of any file larger than 4 GiB gets truncated. + +The lzip format provides a 64-bit field for the uncompressed size. +Additionaly, lzip produces multi-member output automatically when the +size is too large for a single member, allowing an unlimited +uncompressed size. + +@item Distributed index + +The lzip format provides a distributed index that, among other things, +helps plzip to decompress several times faster than pigz and helps +lziprecover do its job. The gzip format does not provide an index. + +A distributed index is safer and more scalable than a monolithic index. +The monolithic index introduces a single point of failure in the +compressed file and may limit the number of members or the total +uncompressed size. + +@end table + +@section Quality of implementation + +Three related but independent compressor implementations, lzip, clzip +and minilzip/lzlib, are developed concurrently. Every stable release of +any of them is subjected to a hundred hours of intensive testing to +verify that it produces identical output to the other two. This +guarantees that all three implement the same algorithm, and makes it +unlikely that any of them may contain serious undiscovered errors. In +fact, no errors have been discovered in lzip since 2009. + +Just like the lzip format provides 4 factor protection against +undetected data corruption, the development methodology described above +provides 3 factor protection against undetected programming errors in +lzip. + +Lzip automatically uses the smallest possible dictionary size for each +file. In addition to reducing the amount of memory required for +decompression, this feature also minimizes the probability of being +affected by RAM errors during compression. + +Returning a warning status of 2 is a design flaw of compress that leaked +into the design of gzip. Both bzip2 and lzip are free form this flaw. + + @node Examples @chapter A small tutorial with examples @cindex examples @@ -835,6 +970,10 @@ find by running @w{@code{lzip --version}}. #include <cstring> #include <stdint.h> #include <unistd.h> +#if defined(__MSVCRT__) || defined(__OS2__) || defined(_MSC_VER) +#include <fcntl.h> +#include <io.h> +#endif class State @@ -1220,6 +1359,11 @@ int main( const int argc, const char * const argv[] ) return 0; } +#if defined(__MSVCRT__) || defined(__OS2__) || defined(_MSC_VER) + setmode( fileno( stdin ), O_BINARY ); + setmode( fileno( stdout ), O_BINARY ); +#endif + for( bool first_member = true; ; first_member = false ) { File_header header; |