summaryrefslogtreecommitdiffstats
path: root/doc/lzip.info
diff options
context:
space:
mode:
Diffstat (limited to 'doc/lzip.info')
-rw-r--r--doc/lzip.info199
1 files changed, 174 insertions, 25 deletions
diff --git a/doc/lzip.info b/doc/lzip.info
index 53f7c55..6854503 100644
--- a/doc/lzip.info
+++ b/doc/lzip.info
@@ -11,7 +11,7 @@ File: lzip.info, Node: Top, Next: Introduction, Up: (dir)
Lzip Manual
***********
-This manual is for Lzip (version 1.17-rc1, 17 April 2015).
+This manual is for Lzip (version 1.17-rc2, 25 May 2015).
* Menu:
@@ -20,6 +20,7 @@ This manual is for Lzip (version 1.17-rc1, 17 April 2015).
* Invoking lzip:: Command line interface
* File format:: Detailed format of the compressed file
* Stream format:: Format of the LZMA stream in lzip files
+* Quality assurance:: Design, development and testing of lzip
* Examples:: A small tutorial with examples
* Problems:: Reporting bugs
* Reference source code:: Source code illustrating stream format
@@ -40,8 +41,7 @@ File: lzip.info, Node: Introduction, Next: Algorithm, Prev: Top, Up: Top
Lzip is a lossless data compressor with a user interface similar to the
one of gzip or bzip2. Lzip is about as fast as gzip, compresses most
files more than bzip2, and is better than both from a data recovery
-perspective. Lzip is a clean implementation of the LZMA
-(Lempel-Ziv-Markov chain-Algorithm) "algorithm".
+perspective.
The lzip file format is designed for data sharing and long-term
archiving, taking into account both data integrity and decoder
@@ -133,7 +133,7 @@ multivolume compressed tar archives.
Lzip is able to compress and decompress streams of unlimited size by
automatically creating multi-member output. The members so created are
-large, about 64 PiB each.
+large, about 2 PiB each.

File: lzip.info, Node: Algorithm, Next: Invoking lzip, Prev: Introduction, Up: Top
@@ -141,13 +141,14 @@ File: lzip.info, Node: Algorithm, Next: Invoking lzip, Prev: Introduction, U
2 Algorithm
***********
-There is no such thing as a "LZMA algorithm"; it is more like a "LZMA
-coding scheme". For example, the option '-0' of lzip uses the scheme in
-almost the simplest way possible; issuing the longest match it can find,
-or a literal byte if it can't find a match. Inversely, a much more
-elaborated way of finding coding sequences of minimum size than the one
-currently used by lzip could be developed, and the resulting sequence
-could also be coded using the LZMA coding scheme.
+In spite of its name (Lempel-Ziv-Markov chain-Algorithm), LZMA is not a
+concrete algorithm; it is more like "any algorithm using the LZMA coding
+scheme". For example, the option '-0' of lzip uses the scheme in almost
+the simplest way possible; issuing the longest match it can find, or a
+literal byte if it can't find a match. Inversely, a much more elaborated
+way of finding coding sequences of minimum size than the one currently
+used by lzip could be developed, and the resulting sequence could also
+be coded using the LZMA coding scheme.
Lzip currently implements two variants of the LZMA algorithm; fast
(used by option -0) and normal (used by all other compression levels).
@@ -224,7 +225,7 @@ The format for running lzip is:
'--member-size=BYTES'
Set the member size limit to BYTES. A small member size may
degrade compression ratio, so use it only when needed. Valid values
- range from 100 kB to 64 PiB. Defaults to 64 PiB.
+ range from 100 kB to 2 PiB. Defaults to 2 PiB.
'-c'
'--stdout'
@@ -432,7 +433,7 @@ additional information before, between, or after them.

-File: lzip.info, Node: Stream format, Next: Examples, Prev: File format, Up: Top
+File: lzip.info, Node: Stream format, Next: Quality assurance, Prev: File format, Up: Top
5 Format of the LZMA stream in lzip files
*****************************************
@@ -645,9 +646,146 @@ with the appropriate contexts to decode the different coding sequences
Stream" marker is decoded.

-File: lzip.info, Node: Examples, Next: Problems, Prev: Stream format, Up: Top
+File: lzip.info, Node: Quality assurance, Next: Examples, Prev: Stream format, Up: Top
-6 A small tutorial with examples
+6 Design, development and testing of lzip
+*****************************************
+
+There are two ways of constructing a software design. One way is to make
+it so simple that there are obviously no deficiencies and the other is
+to make it so complicated that there are no obvious deficiencies.
+-- C.A.R. Hoare
+
+ Lzip has been designed, written and tested with great care to be the
+standard general-purpose compressor for unix-like systems. This chapter
+describes the lessons learned from previous compressors (gzip and
+bzip2), and their application to the design of lzip.
+
+
+6.1 Format design
+=================
+
+When gzip was designed in 1992, computers and operating systems were
+much less capable than they are today. Gzip tried to work around some of
+those limitations, like 8.3 file names, with additional fields in its
+file format.
+
+ Today those limitations have mostly disappeared, and the format of
+gzip has proved to be unnecessarily complicated. It includes fields
+that were never used, others that have lost its usefulness, and finally
+others that have become too limited.
+
+ Bzip2 was designed 5 years later, and its format is in some aspects
+simpler than the one of gzip. But bzip2 also shows complexities in its
+file format which slow down decompression and, in retrospect, are
+unnecessary.
+
+ Probably the worst defect of the gzip format from the point of view
+of data safety is the variable size of its header. If the byte at
+offset 3 (flags) of a gzip member gets corrupted, it mat become very
+difficult to recover the data, even if the compressed blocks are
+intact, because it can't be known with certainty where the compressed
+blocks begin.
+
+ By contrast, the lzma stream in a lzip member always starts at
+offset 6, making it trivial to recover the data even if the whole
+header becomes corrupt.
+
+ Lzip provides better data recovery capabilities than any other
+gzip-like compressor because its format has been designed from the
+beginning to be simple and safe. It would be very difficult to write an
+automatic recovery tool like lziprecover for the gzip format. And, as
+far as I know, it has never been writen.
+
+ The lzip format is designed for long-term archiving. Therefore it
+excludes any unneeded features that may interfere with the future
+extraction of the uncompressed data.
+
+
+6.1.1 Gzip format (mis)features not present in lzip
+---------------------------------------------------
+
+'Multiple algorithms'
+ Gzip provides a CM (Compression Method) field that has never been
+ used because it is a bad idea to begin with. New compression
+ methods may require additional fields, making it impossible to
+ implement new methods and, at the same time, keep the same format.
+ This field does not solve the problem of format proliferation; it
+ just makes the problem less obvious.
+
+'Optional fields in header'
+ Unless special precautions are taken, optional fields are
+ generally a bad idea because they produce a header of variable
+ size. The gzip header has 2 fields that, in addition to being
+ optional, are zero-terminated. This means that if any byte inside
+ the field gets zeroed, or if the terminating zero gets altered,
+ gzip won't be able to find neither the header CRC nor the
+ compressed blocks.
+
+ Using an optional checksum for the header is not only a bad idea,
+ it is an error; it may prevent the extraction of perfectly good
+ data. For example, if the checksum is used and the bit enabling it
+ is reset by a bit-flip, the header will appear to be intact (in
+ spite of being corrupt) while the compressed blocks will appear to
+ be totally unrecoverable (in spite of being intact). Very
+ misleading indeed.
+
+
+6.1.2 Lzip format improvements over gzip
+----------------------------------------
+
+'64-bit size field'
+ Probably the most frequently reported shortcoming of the gzip
+ format is that it only stores the least significant 32 bits of the
+ uncompressed size. The size of any file larger than 4 GiB gets
+ truncated.
+
+ The lzip format provides a 64-bit field for the uncompressed size.
+ Additionaly, lzip produces multi-member output automatically when
+ the size is too large for a single member, allowing an unlimited
+ uncompressed size.
+
+'Distributed index'
+ The lzip format provides a distributed index that, among other
+ things, helps plzip to decompress several times faster than pigz
+ and helps lziprecover do its job. The gzip format does not provide
+ an index.
+
+ A distributed index is safer and more scalable than a monolithic
+ index. The monolithic index introduces a single point of failure
+ in the compressed file and may limit the number of members or the
+ total uncompressed size.
+
+
+6.2 Quality of implementation
+=============================
+
+Three related but independent compressor implementations, lzip, clzip
+and minilzip/lzlib, are developed concurrently. Every stable release of
+any of them is subjected to a hundred hours of intensive testing to
+verify that it produces identical output to the other two. This
+guarantees that all three implement the same algorithm, and makes it
+unlikely that any of them may contain serious undiscovered errors. In
+fact, no errors have been discovered in lzip since 2009.
+
+ Just like the lzip format provides 4 factor protection against
+undetected data corruption, the development methodology described above
+provides 3 factor protection against undetected programming errors in
+lzip.
+
+ Lzip automatically uses the smallest possible dictionary size for
+each file. In addition to reducing the amount of memory required for
+decompression, this feature also minimizes the probability of being
+affected by RAM errors during compression.
+
+ Returning a warning status of 2 is a design flaw of compress that
+leaked into the design of gzip. Both bzip2 and lzip are free form this
+flaw.
+
+
+File: lzip.info, Node: Examples, Next: Problems, Prev: Quality assurance, Up: Top
+
+7 A small tutorial with examples
********************************
WARNING! Even if lzip is bug-free, other causes may result in a corrupt
@@ -720,7 +858,7 @@ file with a member size of 32 MiB.

File: lzip.info, Node: Problems, Next: Reference source code, Prev: Examples, Up: Top
-7 Reporting bugs
+8 Reporting bugs
****************
There are probably bugs in lzip. There are certainly errors and
@@ -761,6 +899,10 @@ Appendix A Reference source code
#include <cstring>
#include <stdint.h>
#include <unistd.h>
+#if defined(__MSVCRT__) || defined(__OS2__) || defined(_MSC_VER)
+#include <fcntl.h>
+#include <io.h>
+#endif
class State
@@ -1146,6 +1288,11 @@ int main( const int argc, const char * const argv[] )
return 0;
}
+#if defined(__MSVCRT__) || defined(__OS2__) || defined(_MSC_VER)
+ setmode( fileno( stdin ), O_BINARY );
+ setmode( fileno( stdout ), O_BINARY );
+#endif
+
for( bool first_member = true; ; first_member = false )
{
File_header header;
@@ -1201,6 +1348,7 @@ Concept index
* introduction: Introduction. (line 6)
* invoking: Invoking lzip. (line 6)
* options: Invoking lzip. (line 6)
+* quality assurance: Quality assurance. (line 6)
* reference source code: Reference source code. (line 6)
* usage: Invoking lzip. (line 6)
* version: Invoking lzip. (line 6)
@@ -1209,15 +1357,16 @@ Concept index

Tag Table:
Node: Top208
-Node: Introduction1025
-Node: Algorithm6036
-Node: Invoking lzip8793
-Node: File format14383
-Node: Stream format16768
-Node: Examples26200
-Node: Problems28157
-Node: Reference source code28687
-Node: Concept index42024
+Node: Introduction1090
+Node: Algorithm6008
+Node: Invoking lzip8833
+Node: File format14421
+Node: Stream format16806
+Node: Quality assurance26247
+Node: Examples32269
+Node: Problems34230
+Node: Reference source code34760
+Node: Concept index48358

End Tag Table