summaryrefslogtreecommitdiffstats
path: root/doc/lzip.texi
diff options
context:
space:
mode:
Diffstat (limited to 'doc/lzip.texi')
-rw-r--r--doc/lzip.texi176
1 files changed, 160 insertions, 16 deletions
diff --git a/doc/lzip.texi b/doc/lzip.texi
index 5046140..ac44ee9 100644
--- a/doc/lzip.texi
+++ b/doc/lzip.texi
@@ -6,8 +6,8 @@
@finalout
@c %**end of header
-@set UPDATED 17 April 2015
-@set VERSION 1.17-rc1
+@set UPDATED 25 May 2015
+@set VERSION 1.17-rc2
@dircategory Data Compression
@direntry
@@ -40,6 +40,7 @@ This manual is for Lzip (version @value{VERSION}, @value{UPDATED}).
* Invoking lzip:: Command line interface
* File format:: Detailed format of the compressed file
* Stream format:: Format of the LZMA stream in lzip files
+* Quality assurance:: Design, development and testing of lzip
* Examples:: A small tutorial with examples
* Problems:: Reporting bugs
* Reference source code:: Source code illustrating stream format
@@ -60,8 +61,7 @@ to copy, distribute and modify it.
Lzip is a lossless data compressor with a user interface similar to the
one of gzip or bzip2. Lzip is about as fast as gzip, compresses most
files more than bzip2, and is better than both from a data recovery
-perspective. Lzip is a clean implementation of the LZMA
-(Lempel-Ziv-Markov chain-Algorithm) "algorithm".
+perspective.
The lzip file format is designed for data sharing and long-term
archiving, taking into account both data integrity and decoder
@@ -159,23 +159,24 @@ multivolume compressed tar archives.
Lzip is able to compress and decompress streams of unlimited size by
automatically creating multi-member output. The members so created are
-large, about 64 PiB each.
+large, about 2 PiB each.
@node Algorithm
@chapter Algorithm
@cindex algorithm
-There is no such thing as a "LZMA algorithm"; it is more like a "LZMA
-coding scheme". For example, the option '-0' of lzip uses the scheme in
-almost the simplest way possible; issuing the longest match it can find,
-or a literal byte if it can't find a match. Inversely, a much more
-elaborated way of finding coding sequences of minimum size than the one
-currently used by lzip could be developed, and the resulting sequence
-could also be coded using the LZMA coding scheme.
+In spite of its name (Lempel-Ziv-Markov chain-Algorithm), LZMA is not a
+concrete algorithm; it is more like "any algorithm using the LZMA coding
+scheme". For example, the option '-0' of lzip uses the scheme in almost
+the simplest way possible; issuing the longest match it can find, or a
+literal byte if it can't find a match. Inversely, a much more elaborated
+way of finding coding sequences of minimum size than the one currently
+used by lzip could be developed, and the resulting sequence could also
+be coded using the LZMA coding scheme.
-Lzip currently implements two variants of the LZMA algorithm; fast (used
-by option -0) and normal (used by all other compression levels).
+Lzip currently implements two variants of the LZMA algorithm; fast
+(used by option -0) and normal (used by all other compression levels).
The high compression of LZMA comes from combining two basic, well-proven
compression ideas: sliding dictionaries (LZ77/78) and markov models (the
@@ -242,7 +243,7 @@ lzip [@var{options}] [@var{files}]
Lzip supports the following options:
-@table @samp
+@table @code
@item -h
@itemx --help
Print an informative help message describing the options and exit.
@@ -255,7 +256,7 @@ Print the version number of lzip on the standard output and exit.
@itemx --member-size=@var{bytes}
Set the member size limit to @var{bytes}. A small member size may
degrade compression ratio, so use it only when needed. Valid values
-range from 100 kB to 64 PiB. Defaults to 64 PiB.
+range from 100 kB to 2 PiB. Defaults to 2 PiB.
@item -c
@itemx --stdout
@@ -689,6 +690,140 @@ sequences (matches, repeated matches, and literal bytes), until the "End
Of Stream" marker is decoded.
+@node Quality assurance
+@chapter Design, development and testing of lzip
+@cindex quality assurance
+
+There are two ways of constructing a software design. One way is to make
+it so simple that there are obviously no deficiencies and the other is
+to make it so complicated that there are no obvious deficiencies.@*
+--- C.A.R. Hoare
+
+Lzip has been designed, written and tested with great care to be the
+standard general-purpose compressor for unix-like systems. This chapter
+describes the lessons learned from previous compressors (gzip and
+bzip2), and their application to the design of lzip.
+
+@sp 1
+@section Format design
+
+When gzip was designed in 1992, computers and operating systems were
+much less capable than they are today. Gzip tried to work around some of
+those limitations, like 8.3 file names, with additional fields in its
+file format.
+
+Today those limitations have mostly disappeared, and the format of gzip
+has proved to be unnecessarily complicated. It includes fields that were
+never used, others that have lost its usefulness, and finally others
+that have become too limited.
+
+Bzip2 was designed 5 years later, and its format is in some aspects
+simpler than the one of gzip. But bzip2 also shows complexities in its
+file format which slow down decompression and, in retrospect, are
+unnecessary.
+
+Probably the worst defect of the gzip format from the point of view of
+data safety is the variable size of its header. If the byte at offset 3
+(flags) of a gzip member gets corrupted, it mat become very difficult to
+recover the data, even if the compressed blocks are intact, because it
+can't be known with certainty where the compressed blocks begin.
+
+By contrast, the lzma stream in a lzip member always starts at offset 6,
+making it trivial to recover the data even if the whole header becomes
+corrupt.
+
+Lzip provides better data recovery capabilities than any other gzip-like
+compressor because its format has been designed from the beginning to be
+simple and safe. It would be very difficult to write an automatic
+recovery tool like lziprecover for the gzip format. And, as far as I
+know, it has never been writen.
+
+The lzip format is designed for long-term archiving. Therefore it
+excludes any unneeded features that may interfere with the future
+extraction of the uncompressed data.
+
+@sp 1
+@subsection Gzip format (mis)features not present in lzip
+
+@table @samp
+@item Multiple algorithms
+
+Gzip provides a CM (Compression Method) field that has never been used
+because it is a bad idea to begin with. New compression methods may
+require additional fields, making it impossible to implement new methods
+and, at the same time, keep the same format. This field does not solve
+the problem of format proliferation; it just makes the problem less
+obvious.
+
+@item Optional fields in header
+
+Unless special precautions are taken, optional fields are generally a
+bad idea because they produce a header of variable size. The gzip header
+has 2 fields that, in addition to being optional, are zero-terminated.
+This means that if any byte inside the field gets zeroed, or if the
+terminating zero gets altered, gzip won't be able to find neither the
+header CRC nor the compressed blocks.
+
+Using an optional checksum for the header is not only a bad idea, it is
+an error; it may prevent the extraction of perfectly good data. For
+example, if the checksum is used and the bit enabling it is reset by a
+bit-flip, the header will appear to be intact (in spite of being
+corrupt) while the compressed blocks will appear to be totally
+unrecoverable (in spite of being intact). Very misleading indeed.
+
+@end table
+
+@subsection Lzip format improvements over gzip
+
+@table @samp
+@item 64-bit size field
+
+Probably the most frequently reported shortcoming of the gzip format is
+that it only stores the least significant 32 bits of the uncompressed
+size. The size of any file larger than 4 GiB gets truncated.
+
+The lzip format provides a 64-bit field for the uncompressed size.
+Additionaly, lzip produces multi-member output automatically when the
+size is too large for a single member, allowing an unlimited
+uncompressed size.
+
+@item Distributed index
+
+The lzip format provides a distributed index that, among other things,
+helps plzip to decompress several times faster than pigz and helps
+lziprecover do its job. The gzip format does not provide an index.
+
+A distributed index is safer and more scalable than a monolithic index.
+The monolithic index introduces a single point of failure in the
+compressed file and may limit the number of members or the total
+uncompressed size.
+
+@end table
+
+@section Quality of implementation
+
+Three related but independent compressor implementations, lzip, clzip
+and minilzip/lzlib, are developed concurrently. Every stable release of
+any of them is subjected to a hundred hours of intensive testing to
+verify that it produces identical output to the other two. This
+guarantees that all three implement the same algorithm, and makes it
+unlikely that any of them may contain serious undiscovered errors. In
+fact, no errors have been discovered in lzip since 2009.
+
+Just like the lzip format provides 4 factor protection against
+undetected data corruption, the development methodology described above
+provides 3 factor protection against undetected programming errors in
+lzip.
+
+Lzip automatically uses the smallest possible dictionary size for each
+file. In addition to reducing the amount of memory required for
+decompression, this feature also minimizes the probability of being
+affected by RAM errors during compression.
+
+Returning a warning status of 2 is a design flaw of compress that leaked
+into the design of gzip. Both bzip2 and lzip are free form this flaw.
+
+
@node Examples
@chapter A small tutorial with examples
@cindex examples
@@ -835,6 +970,10 @@ find by running @w{@code{lzip --version}}.
#include <cstring>
#include <stdint.h>
#include <unistd.h>
+#if defined(__MSVCRT__) || defined(__OS2__) || defined(_MSC_VER)
+#include <fcntl.h>
+#include <io.h>
+#endif
class State
@@ -1220,6 +1359,11 @@ int main( const int argc, const char * const argv[] )
return 0;
}
+#if defined(__MSVCRT__) || defined(__OS2__) || defined(_MSC_VER)
+ setmode( fileno( stdin ), O_BINARY );
+ setmode( fileno( stdout ), O_BINARY );
+#endif
+
for( bool first_member = true; ; first_member = false )
{
File_header header;