summaryrefslogtreecommitdiffstats
path: root/doc/lzip.info
diff options
context:
space:
mode:
Diffstat (limited to 'doc/lzip.info')
-rw-r--r--doc/lzip.info1708
1 files changed, 1708 insertions, 0 deletions
diff --git a/doc/lzip.info b/doc/lzip.info
new file mode 100644
index 0000000..7c6d812
--- /dev/null
+++ b/doc/lzip.info
@@ -0,0 +1,1708 @@
+This is lzip.info, produced by makeinfo version 4.13+ from lzip.texi.
+
+INFO-DIR-SECTION Compression
+START-INFO-DIR-ENTRY
+* Lzip: (lzip). LZMA lossless data compressor
+END-INFO-DIR-ENTRY
+
+
+File: lzip.info, Node: Top, Next: Introduction, Up: (dir)
+
+Lzip Manual
+***********
+
+This manual is for Lzip (version 1.23, 24 January 2022).
+
+* Menu:
+
+* Introduction:: Purpose and features of lzip
+* Output:: Meaning of lzip's output
+* Invoking lzip:: Command line interface
+* Quality assurance:: Design, development, and testing of lzip
+* Algorithm:: How lzip compresses the data
+* File format:: Detailed format of the compressed file
+* Stream format:: Format of the LZMA stream in lzip files
+* Trailing data:: Extra data appended to the file
+* Examples:: A small tutorial with examples
+* Problems:: Reporting bugs
+* Reference source code:: Source code illustrating stream format
+* Concept index:: Index of concepts
+
+
+ Copyright (C) 2008-2022 Antonio Diaz Diaz.
+
+ This manual is free documentation: you have unlimited permission to copy,
+distribute, and modify it.
+
+
+File: lzip.info, Node: Introduction, Next: Output, Prev: Top, Up: Top
+
+1 Introduction
+**************
+
+Lzip is a lossless data compressor with a user interface similar to the one
+of gzip or bzip2. Lzip uses a simplified form of the 'Lempel-Ziv-Markov
+chain-Algorithm' (LZMA) stream format and provides a 3 factor integrity
+checking to maximize interoperability and optimize safety. Lzip can compress
+about as fast as gzip (lzip -0) or compress most files more than bzip2
+(lzip -9). Decompression speed is intermediate between gzip and bzip2. Lzip
+is better than gzip and bzip2 from a data recovery perspective. Lzip has
+been designed, written, and tested with great care to replace gzip and
+bzip2 as the standard general-purpose compressed format for unix-like
+systems.
+
+ For compressing/decompressing large files on multiprocessor machines
+plzip can be much faster than lzip at the cost of a slightly reduced
+compression ratio. *Note plzip manual: (plzip)Top.
+
+ For creation and manipulation of compressed tar archives tarlz can be
+more efficient than using tar and plzip because tarlz is able to keep the
+alignment between tar members and lzip members. *Note tarlz manual:
+(tarlz)Top.
+
+ The lzip file format is designed for data sharing and long-term
+archiving, taking into account both data integrity and decoder availability:
+
+ * The lzip format provides very safe integrity checking and some data
+ recovery means. The program lziprecover can repair bit flip errors
+ (one of the most common forms of data corruption) in lzip files, and
+ provides data recovery capabilities, including error-checked merging
+ of damaged copies of a file. *Note Data safety: (lziprecover)Data
+ safety.
+
+ * The lzip format is as simple as possible (but not simpler). The lzip
+ manual provides the source code of a simple decompressor along with a
+ detailed explanation of how it works, so that with the only help of the
+ lzip manual it would be possible for a digital archaeologist to extract
+ the data from a lzip file long after quantum computers eventually
+ render LZMA obsolete.
+
+ * Additionally the lzip reference implementation is copylefted, which
+ guarantees that it will remain free forever.
+
+ A nice feature of the lzip format is that a corrupt byte is easier to
+repair the nearer it is from the beginning of the file. Therefore, with the
+help of lziprecover, losing an entire archive just because of a corrupt
+byte near the beginning is a thing of the past.
+
+ The member trailer stores the 32-bit CRC of the original data, the size
+of the original data, and the size of the member. These values, together
+with the "End Of Stream" marker, provide a 3 factor integrity checking
+which guarantees that the decompressed version of the data is identical to
+the original. This guards against corruption of the compressed data, and
+against undetected bugs in lzip (hopefully very unlikely). The chances of
+data corruption going undetected are microscopic. Be aware, though, that
+the check occurs upon decompression, so it can only tell you that something
+is wrong. It can't help you recover the original uncompressed data.
+
+ Lzip uses the same well-defined exit status values used by bzip2, which
+makes it safer than compressors returning ambiguous warning values (like
+gzip) when it is used as a back end for other programs like tar or zutils.
+
+ Lzip will automatically use for each file the largest dictionary size
+that does not exceed neither the file size nor the limit given. Keep in
+mind that the decompression memory requirement is affected at compression
+time by the choice of dictionary size limit.
+
+ The amount of memory required for compression is about 1 or 2 times the
+dictionary size limit (1 if input file size is less than dictionary size
+limit, else 2) plus 9 times the dictionary size really used. The option
+'-0' is special and only requires about 1.5 MiB at most. The amount of
+memory required for decompression is about 46 kB larger than the dictionary
+size really used.
+
+ When compressing, lzip replaces every file given in the command line
+with a compressed version of itself, with the name "original_name.lz". When
+decompressing, lzip attempts to guess the name for the decompressed file
+from that of the compressed file as follows:
+
+filename.lz becomes filename
+filename.tlz becomes filename.tar
+anyothername becomes anyothername.out
+
+ (De)compressing a file is much like copying or moving it. Therefore lzip
+preserves the access and modification dates, permissions, and, when
+possible, ownership of the file just as 'cp -p' does. (If the user ID or
+the group ID can't be duplicated, the file permission bits S_ISUID and
+S_ISGID are cleared).
+
+ Lzip is able to read from some types of non-regular files if either the
+option '-c' or the option '-o' is specified.
+
+ Lzip will refuse to read compressed data from a terminal or write
+compressed data to a terminal, as this would be entirely incomprehensible
+and might leave the terminal in an abnormal state.
+
+ Lzip will correctly decompress a file which is the concatenation of two
+or more compressed files. The result is the concatenation of the
+corresponding decompressed files. Integrity testing of concatenated
+compressed files is also supported.
+
+ Lzip can produce multimember files, and lziprecover can safely recover
+the undamaged members in case of file damage. Lzip can also split the
+compressed output in volumes of a given size, even when reading from
+standard input. This allows the direct creation of multivolume compressed
+tar archives.
+
+ Lzip is able to compress and decompress streams of unlimited size by
+automatically creating multimember output. The members so created are large,
+about 2 PiB each.
+
+
+File: lzip.info, Node: Output, Next: Invoking lzip, Prev: Introduction, Up: Top
+
+2 Meaning of lzip's output
+**************************
+
+The output of lzip looks like this:
+
+ lzip -v foo
+ foo: 6.676:1, 14.98% ratio, 85.02% saved, 450560 in, 67493 out.
+
+ lzip -tvvv foo.lz
+ foo.lz: 6.676:1, 14.98% ratio, 85.02% saved. 450560 out, 67493 in. ok
+
+ The meaning of each field is as follows:
+
+'N:1'
+ The compression ratio (uncompressed_size / compressed_size), shown as
+ N to 1.
+
+'ratio'
+ The inverse compression ratio (compressed_size / uncompressed_size),
+ shown as a percentage. A decimal ratio is easily obtained by moving the
+ decimal point two places to the left; 14.98% = 0.1498.
+
+'saved'
+ The space saved by compression (1 - ratio), shown as a percentage.
+
+'in'
+ Size of the input data. This is the uncompressed size when
+ compressing, or the compressed size when decompressing or testing.
+ Note that lzip always prints the uncompressed size before the
+ compressed size when compressing, decompressing, testing, or listing.
+
+'out'
+ Size of the output data. This is the compressed size when compressing,
+ or the decompressed size when decompressing or testing.
+
+
+ When decompressing or testing at verbosity level 4 (-vvvv), the
+dictionary size used to compress the file and the CRC32 of the uncompressed
+data are also shown.
+
+ LANGUAGE NOTE: Uncompressed = not compressed = plain data; it may never
+have been compressed. Decompressed is used to refer to data which have
+undergone the process of decompression.
+
+
+File: lzip.info, Node: Invoking lzip, Next: Quality assurance, Prev: Output, Up: Top
+
+3 Invoking lzip
+***************
+
+The format for running lzip is:
+
+ lzip [OPTIONS] [FILES]
+
+If no file names are specified, lzip compresses (or decompresses) from
+standard input to standard output. A hyphen '-' used as a FILE argument
+means standard input. It can be mixed with other FILES and is read just
+once, the first time it appears in the command line.
+
+ lzip supports the following options: *Note Argument syntax:
+(arg_parser)Argument syntax.
+
+'-h'
+'--help'
+ Print an informative help message describing the options and exit.
+
+'-V'
+'--version'
+ Print the version number of lzip on the standard output and exit. This
+ version number should be included in all bug reports.
+
+'-a'
+'--trailing-error'
+ Exit with error status 2 if any remaining input is detected after
+ decompressing the last member. Such remaining input is usually trailing
+ garbage that can be safely ignored. *Note concat-example::.
+
+'-b BYTES'
+'--member-size=BYTES'
+ When compressing, set the member size limit to BYTES. It is advisable
+ to keep members smaller than RAM size so that they can be repaired with
+ lziprecover in case of corruption. A small member size may degrade
+ compression ratio, so use it only when needed. Valid values range from
+ 100 kB to 2 PiB. Defaults to 2 PiB.
+
+'-c'
+'--stdout'
+ Compress or decompress to standard output; keep input files unchanged.
+ If compressing several files, each file is compressed independently.
+ (The output consists of a sequence of independently compressed
+ members). This option (or '-o') is needed when reading from a named
+ pipe (fifo) or from a device. Use it also to recover as much of the
+ decompressed data as possible when decompressing a corrupt file. '-c'
+ overrides '-o' and '-S'. '-c' has no effect when testing or listing.
+
+'-d'
+'--decompress'
+ Decompress the files specified. If a file does not exist, can't be
+ opened, or the destination file already exists and '--force' has not
+ been specified, lzip continues decompressing the rest of the files and
+ exits with error status 1. If a file fails to decompress, or is a
+ terminal, lzip exits immediately with error status 2 without
+ decompressing the rest of the files. A terminal is considered an
+ uncompressed file, and therefore invalid.
+
+'-f'
+'--force'
+ Force overwrite of output files.
+
+'-F'
+'--recompress'
+ When compressing, force re-compression of files whose name already has
+ the '.lz' or '.tlz' suffix.
+
+'-k'
+'--keep'
+ Keep (don't delete) input files during compression or decompression.
+
+'-l'
+'--list'
+ Print the uncompressed size, compressed size, and percentage saved of
+ the files specified. Trailing data are ignored. The values produced
+ are correct even for multimember files. If more than one file is
+ given, a final line containing the cumulative sizes is printed. With
+ '-v', the dictionary size, the number of members in the file, and the
+ amount of trailing data (if any) are also printed. With '-vv', the
+ positions and sizes of each member in multimember files are also
+ printed.
+
+ If any file is damaged, does not exist, can't be opened, or is not
+ regular, the final exit status will be > 0. '-lq' can be used to verify
+ quickly (without decompressing) the structural integrity of the files
+ specified. (Use '--test' to verify the data integrity). '-alq'
+ additionally verifies that none of the files specified contain
+ trailing data.
+
+'-m BYTES'
+'--match-length=BYTES'
+ When compressing, set the match length limit in bytes. After a match
+ this long is found, the search is finished. Valid values range from 5
+ to 273. Larger values usually give better compression ratios but longer
+ compression times.
+
+'-o FILE'
+'--output=FILE'
+ If '-c' has not been also specified, write the (de)compressed output to
+ FILE; keep input files unchanged. If compressing several files, each
+ file is compressed independently. (The output consists of a sequence of
+ independently compressed members). This option (or '-c') is needed when
+ reading from a named pipe (fifo) or from a device. '-o -' is
+ equivalent to '-c'. '-o' has no effect when testing or listing.
+
+ In order to keep backward compatibility with lzip versions prior to
+ 1.22, when compressing from standard input and no other file names are
+ given, the extension '.lz' is appended to FILE unless it already ends
+ in '.lz' or '.tlz'. This feature will be removed in a future version
+ of lzip. Meanwhile, redirection may be used instead of '-o' to write
+ the compressed output to a file without the extension '.lz' in its
+ name: 'lzip < file > foo'.
+
+ When compressing and splitting the output in volumes, FILE is used as
+ a prefix, and several files named 'FILE00001.lz', 'FILE00002.lz', etc,
+ are created. In this case, only one input file is allowed.
+
+'-q'
+'--quiet'
+ Quiet operation. Suppress all messages.
+
+'-s BYTES'
+'--dictionary-size=BYTES'
+ When compressing, set the dictionary size limit in bytes. Lzip will use
+ for each file the largest dictionary size that does not exceed neither
+ the file size nor this limit. Valid values range from 4 KiB to
+ 512 MiB. Values 12 to 29 are interpreted as powers of two, meaning
+ 2^12 to 2^29 bytes. Dictionary sizes are quantized so that they can be
+ coded in just one byte (*note coded-dict-size::). If the size specified
+ does not match one of the valid sizes, it will be rounded upwards by
+ adding up to (BYTES / 8) to it.
+
+ For maximum compression you should use a dictionary size limit as large
+ as possible, but keep in mind that the decompression memory requirement
+ is affected at compression time by the choice of dictionary size limit.
+
+'-S BYTES'
+'--volume-size=BYTES'
+ When compressing, and '-c' has not been also specified, split the
+ compressed output into several volume files with names
+ 'original_name00001.lz', 'original_name00002.lz', etc, and set the
+ volume size limit to BYTES. Input files are kept unchanged. Each
+ volume is a complete, maybe multimember, lzip file. A small volume
+ size may degrade compression ratio, so use it only when needed. Valid
+ values range from 100 kB to 4 EiB.
+
+'-t'
+'--test'
+ Check integrity of the files specified, but don't decompress them. This
+ really performs a trial decompression and throws away the result. Use
+ it together with '-v' to see information about the files. If a file
+ fails the test, does not exist, can't be opened, or is a terminal, lzip
+ continues checking the rest of the files. A final diagnostic is shown
+ at verbosity level 1 or higher if any file fails the test when testing
+ multiple files.
+
+'-v'
+'--verbose'
+ Verbose mode.
+ When compressing, show the compression ratio and size for each file
+ processed.
+ When decompressing or testing, further -v's (up to 4) increase the
+ verbosity level, showing status, compression ratio, dictionary size,
+ trailer contents (CRC, data size, member size), and up to 6 bytes of
+ trailing data (if any) both in hexadecimal and as a string of printable
+ ASCII characters.
+ Two or more '-v' options show the progress of (de)compression.
+
+'-0 .. -9'
+ Compression level. Set the compression parameters (dictionary size and
+ match length limit) as shown in the table below. The default
+ compression level is '-6', equivalent to '-s8MiB -m36'. Note that '-9'
+ can be much slower than '-0'. These options have no effect when
+ decompressing, testing, or listing.
+
+ The bidimensional parameter space of LZMA can't be mapped to a linear
+ scale optimal for all files. If your files are large, very repetitive,
+ etc, you may need to use the options '--dictionary-size' and
+ '--match-length' directly to achieve optimal performance.
+
+ If several compression levels or '-s' or '-m' options are given, the
+ last setting is used. For example '-9 -s64MiB' is equivalent to
+ '-s64MiB -m273'
+
+ Level Dictionary size (-s) Match length limit (-m)
+ -0 64 KiB 16 bytes
+ -1 1 MiB 5 bytes
+ -2 1.5 MiB 6 bytes
+ -3 2 MiB 8 bytes
+ -4 3 MiB 12 bytes
+ -5 4 MiB 20 bytes
+ -6 8 MiB 36 bytes
+ -7 16 MiB 68 bytes
+ -8 24 MiB 132 bytes
+ -9 32 MiB 273 bytes
+
+'--fast'
+'--best'
+ Aliases for GNU gzip compatibility.
+
+'--loose-trailing'
+ When decompressing, testing, or listing, allow trailing data whose
+ first bytes are so similar to the magic bytes of a lzip header that
+ they can be confused with a corrupt header. Use this option if a file
+ triggers a "corrupt header" error and the cause is not indeed a
+ corrupt header.
+
+
+ Numbers given as arguments to options may be followed by a multiplier
+and an optional 'B' for "byte".
+
+ Table of SI and binary prefixes (unit multipliers):
+
+Prefix Value | Prefix Value
+k kilobyte (10^3 = 1000) | Ki kibibyte (2^10 = 1024)
+M megabyte (10^6) | Mi mebibyte (2^20)
+G gigabyte (10^9) | Gi gibibyte (2^30)
+T terabyte (10^12) | Ti tebibyte (2^40)
+P petabyte (10^15) | Pi pebibyte (2^50)
+E exabyte (10^18) | Ei exbibyte (2^60)
+Z zettabyte (10^21) | Zi zebibyte (2^70)
+Y yottabyte (10^24) | Yi yobibyte (2^80)
+
+
+ Exit status: 0 for a normal exit, 1 for environmental problems (file not
+found, invalid flags, I/O errors, etc), 2 to indicate a corrupt or invalid
+input file, 3 for an internal consistency error (e.g., bug) which caused
+lzip to panic.
+
+
+File: lzip.info, Node: Quality assurance, Next: Algorithm, Prev: Invoking lzip, Up: Top
+
+4 Design, development, and testing of lzip
+******************************************
+
+There are two ways of constructing a software design: One way is to make it
+so simple that there are obviously no deficiencies and the other way is to
+make it so complicated that there are no obvious deficiencies. The first
+method is far more difficult.
+-- C.A.R. Hoare
+
+ Lzip is developed by volunteers who lack the resources required for
+extensive testing in all circumstances. It is up to you to test lzip before
+using it in mission-critical applications. However, a compressor like lzip
+is not a toy, and maintaining it is not a hobby. Many people's data depend
+on it. Therefore the lzip file format has been reviewed carefully and is
+believed to be free from negligent design errors.
+
+ Lzip has been designed, written, and tested with great care to replace
+gzip and bzip2 as the standard general-purpose compressed format for
+unix-like systems. This chapter describes the lessons learned from these
+previous formats, and their application to the design of lzip.
+
+
+4.1 Format design
+=================
+
+When gzip was designed in 1992, computers and operating systems were much
+less capable than they are today. The designers of gzip tried to work around
+some of those limitations, like 8.3 file names, with additional fields in
+the file format.
+
+ Today those limitations have mostly disappeared, and the format of gzip
+has proved to be unnecessarily complicated. It includes fields that were
+never used, others that have lost their usefulness, and finally others that
+have become too limited.
+
+ Bzip2 was designed 5 years later, and its format is simpler than the one
+of gzip.
+
+ Probably the worst defect of the gzip format from the point of view of
+data safety is the variable size of its header. If the byte at offset 3
+(flags) of a gzip member gets corrupted, it may become difficult to recover
+the data, even if the compressed blocks are intact, because it can't be
+known with certainty where the compressed blocks begin.
+
+ By contrast, the header of a lzip member has a fixed length of 6. The
+LZMA stream in a lzip member always starts at offset 6, making it trivial to
+recover the data even if the whole header becomes corrupt.
+
+ Bzip2 also provides a header of fixed length and marks the begin and end
+of each compressed block with six magic bytes, making it possible to find
+the compressed blocks even in case of file damage. But bzip2 does not store
+the size of each compressed block, as lzip does.
+
+ Lziprecover is able to provide unique data recovery capabilities because
+the lzip format is extraordinarily safe. The simple and safe design of the
+file format complements the embedded error detection provided by the LZMA
+data stream. Any distance larger than the dictionary size acts as a
+forbidden symbol, allowing the decompressor to detect the approximate
+position of errors, and leaving very little work for the check sequence
+(CRC and data sizes) in the detection of errors. Lzip is usually able to
+detect all possible bit flips in the compressed data without resorting to
+the check sequence. It would be difficult to write an automatic recovery
+tool like lziprecover for the gzip format. And, as far as I know, it has
+never been written.
+
+ Lzip, like gzip and bzip2, uses a CRC32 to check the integrity of the
+decompressed data because it provides optimal accuracy in the detection of
+errors up to a compressed size of about 16 GiB, a size larger than that of
+most files. In the case of lzip, the additional detection capability of the
+decompressor reduces the probability of undetected errors several million
+times more, resulting in a combined integrity checking optimally accurate
+for any member size produced by lzip. Preliminary results suggest that the
+lzip format is safe enough to be used in critical safety avionics systems.
+
+ The lzip format is designed for long-term archiving. Therefore it
+excludes any unneeded features that may interfere with the future
+extraction of the decompressed data.
+
+
+4.1.1 Gzip format (mis)features not present in lzip
+---------------------------------------------------
+
+'Multiple algorithms'
+ Gzip provides a CM (Compression Method) field that has never been used
+ because it is a bad idea to begin with. New compression methods may
+ require additional fields, making it impossible to implement new
+ methods and, at the same time, keep the same format. This field does
+ not solve the problem of format proliferation; it just makes the
+ problem less obvious.
+
+'Optional fields in header'
+ Unless special precautions are taken, optional fields are generally a
+ bad idea because they produce a header of variable size. The gzip
+ header has 2 fields that, in addition to being optional, are
+ zero-terminated. This means that if any byte inside the field gets
+ zeroed, or if the terminating zero gets altered, gzip won't be able to
+ find neither the header CRC nor the compressed blocks.
+
+'Optional CRC for the header'
+ Using an optional CRC for the header is not only a bad idea, it is an
+ error; it circumvents the Hamming distance (HD) of the CRC and may
+ prevent the extraction of perfectly good data. For example, if the CRC
+ is used and the bit enabling it is reset by a bit flip, the header
+ will appear to be intact (in spite of being corrupt) while the
+ compressed blocks will appear to be totally unrecoverable (in spite of
+ being intact). Very misleading indeed.
+
+'Metadata'
+ The gzip format stores some metadata, like the modification time of the
+ original file or the operating system on which compression took place.
+ This complicates reproducible compression (obtaining identical
+ compressed output from identical input).
+
+
+4.1.2 Lzip format improvements over gzip and bzip2
+--------------------------------------------------
+
+'64-bit size field'
+ Probably the most frequently reported shortcoming of the gzip format
+ is that it only stores the least significant 32 bits of the
+ uncompressed size. The size of any file larger than 4 GiB gets
+ truncated.
+
+ Bzip2 does not store the uncompressed size of the file.
+
+ The lzip format provides a 64-bit field for the uncompressed size.
+ Additionally, lzip produces multimember output automatically when the
+ size is too large for a single member, allowing for an unlimited
+ uncompressed size.
+
+'Distributed index'
+ The lzip format provides a distributed index that, among other things,
+ helps plzip to decompress several times faster than pigz and helps
+ lziprecover do its job. Neither the gzip format nor the bzip2 format
+ do provide an index.
+
+ A distributed index is safer and more scalable than a monolithic
+ index. The monolithic index introduces a single point of failure in
+ the compressed file and may limit the number of members or the total
+ uncompressed size.
+
+
+4.2 Quality of implementation
+=============================
+
+'Accurate and robust error detection'
+ The lzip format provides 3 factor integrity checking, and the
+ decompressors report mismatches in each factor separately. This method
+ detects most false positives for corruption. If just one byte in one
+ factor fails but the other two factors match the data, it probably
+ means that the data are intact and the corruption just affects the
+ mismatching factor (CRC, data size, or member size) in the member
+ trailer.
+
+'Multiple implementations'
+ Just like the lzip format provides 3 factor protection against
+ undetected data corruption, the development methodology of the lzip
+ family of compressors provides 3 factor protection against undetected
+ programming errors.
+
+ Three related but independent compressor implementations, lzip, clzip,
+ and minilzip/lzlib, are developed concurrently. Every stable release
+ of any of them is tested to verify that it produces identical output
+ to the other two. This guarantees that all three implement the same
+ algorithm, and makes it unlikely that any of them may contain serious
+ undiscovered errors. In fact, no errors have been discovered in lzip
+ since 2009.
+
+ Additionally, the three implementations have been extensively tested
+ with unzcrash, valgrind, and 'american fuzzy lop' without finding a
+ single vulnerability or false negative. *Note Unzcrash:
+ (lziprecover)Unzcrash.
+
+'Dictionary size'
+ Lzip automatically adapts the dictionary size to the size of each file.
+ In addition to reducing the amount of memory required for
+ decompression, this feature also minimizes the probability of being
+ affected by RAM errors during compression.
+
+'Exit status'
+ Returning a warning status of 2 is a design flaw of compress that
+ leaked into the design of gzip. Both bzip2 and lzip are free from this
+ flaw.
+
+
+
+File: lzip.info, Node: Algorithm, Next: File format, Prev: Quality assurance, Up: Top
+
+5 Algorithm
+***********
+
+In spite of its name (Lempel-Ziv-Markov chain-Algorithm), LZMA is not a
+concrete algorithm; it is more like "any algorithm using the LZMA coding
+scheme". LZMA compression consists in describing the uncompressed data as a
+succession of coding sequences from the set shown in Section 'What is
+coded' (*note what-is-coded::), and then encoding them using a range
+encoder. For example, the option '-0' of lzip uses the scheme in almost the
+simplest way possible; issuing the longest match it can find, or a literal
+byte if it can't find a match. Inversely, a much more elaborated way of
+finding coding sequences of minimum size than the one currently used by
+lzip could be developed, and the resulting sequence could also be coded
+using the LZMA coding scheme.
+
+ Lzip currently implements two variants of the LZMA algorithm: fast (used
+by option '-0') and normal (used by all other compression levels).
+
+ The high compression of LZMA comes from combining two basic, well-proven
+compression ideas: sliding dictionaries (LZ77/78) and markov models (the
+thing used by every compression algorithm that uses a range encoder or
+similar order-0 entropy coder as its last stage) with segregation of
+contexts according to what the bits are used for.
+
+ Lzip is a two stage compressor. The first stage is a Lempel-Ziv coder,
+which reduces redundancy by translating chunks of data to their
+corresponding distance-length pairs. The second stage is a range encoder
+that uses a different probability model for each type of data: distances,
+lengths, literal bytes, etc.
+
+ Here is how it works, step by step:
+
+ 1) The member header is written to the output stream.
+
+ 2) The first byte is coded literally, because there are no previous
+bytes to which the match finder can refer to.
+
+ 3) The main encoder advances to the next byte in the input data and
+calls the match finder.
+
+ 4) The match finder fills an array with the minimum distances before the
+current byte where a match of a given length can be found.
+
+ 5) Go back to step 3 until a sequence (formed of pairs, repeated
+distances, and literal bytes) of minimum price has been formed. Where the
+price represents the number of output bits produced.
+
+ 6) The range encoder encodes the sequence produced by the main encoder
+and sends the bytes produced to the output stream.
+
+ 7) Go back to step 3 until the input data are finished or until the
+member or volume size limits are reached.
+
+ 8) The range encoder is flushed.
+
+ 9) The member trailer is written to the output stream.
+
+ 10) If there are more data to compress, go back to step 1.
+
+
+ During compression, lzip reads data in large blocks (one dictionary size
+at a time). Therefore it may block for up to tens of seconds any process
+feeding data to it through a pipe. This is normal. The blocking intervals
+get longer with higher compression levels because dictionary size increases
+(and compression speed decreases) with compression level.
+
+The ideas embodied in lzip are due to (at least) the following people:
+Abraham Lempel and Jacob Ziv (for the LZ algorithm), Andrey Markov (for the
+definition of Markov chains), G.N.N. Martin (for the definition of range
+encoding), Igor Pavlov (for putting all the above together in LZMA), and
+Julian Seward (for bzip2's CLI).
+
+
+File: lzip.info, Node: File format, Next: Stream format, Prev: Algorithm, Up: Top
+
+6 File format
+*************
+
+Perfection is reached, not when there is no longer anything to add, but
+when there is no longer anything to take away.
+-- Antoine de Saint-Exupery
+
+
+ In the diagram below, a box like this:
+
++---+
+| | <-- the vertical bars might be missing
++---+
+
+ represents one byte; a box like this:
+
++==============+
+| |
++==============+
+
+ represents a variable number of bytes.
+
+
+ A lzip file consists of a series of independent "members" (compressed
+data sets). The members simply appear one after another in the file, with no
+additional information before, between, or after them. Each member can
+encode in compressed form up to 16 EiB - 1 byte of uncompressed data. The
+size of a multimember file is unlimited.
+
+ Each member has the following structure:
+
++--+--+--+--+----+----+=============+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+| ID string | VN | DS | LZMA stream | CRC32 | Data size | Member size |
++--+--+--+--+----+----+=============+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
+
+ All multibyte values are stored in little endian order.
+
+'ID string (the "magic" bytes)'
+ A four byte string, identifying the lzip format, with the value "LZIP"
+ (0x4C, 0x5A, 0x49, 0x50).
+
+'VN (version number, 1 byte)'
+ Just in case something needs to be modified in the future. 1 for now.
+
+'DS (coded dictionary size, 1 byte)'
+ The dictionary size is calculated by taking a power of 2 (the base
+ size) and subtracting from it a fraction between 0/16 and 7/16 of the
+ base size.
+ Bits 4-0 contain the base 2 logarithm of the base size (12 to 29).
+ Bits 7-5 contain the numerator of the fraction (0 to 7) to subtract
+ from the base size to obtain the dictionary size.
+ Example: 0xD3 = 2^19 - 6 * 2^15 = 512 KiB - 6 * 32 KiB = 320 KiB
+ Valid values for dictionary size range from 4 KiB to 512 MiB.
+
+'LZMA stream'
+ The LZMA stream, finished by an "End Of Stream" marker. Uses default
+ values for encoder properties. *Note Stream format::, for a complete
+ description.
+
+'CRC32 (4 bytes)'
+ Cyclic Redundancy Check (CRC) of the original uncompressed data.
+
+'Data size (8 bytes)'
+ Size of the original uncompressed data.
+
+'Member size (8 bytes)'
+ Total size of the member, including header and trailer. This field acts
+ as a distributed index, allows the verification of stream integrity,
+ and facilitates the safe recovery of undamaged members from
+ multimember files. Member size should be limited to 2 PiB to prevent
+ the data size field from overflowing.
+
+
+
+File: lzip.info, Node: Stream format, Next: Trailing data, Prev: File format, Up: Top
+
+7 Format of the LZMA stream in lzip files
+*****************************************
+
+The LZMA algorithm has three parameters, called "special LZMA properties",
+to adjust it for some kinds of binary data. These parameters are:
+'literal_context_bits' (with a default value of 3),
+'literal_pos_state_bits' (with a default value of 0), and 'pos_state_bits'
+(with a default value of 2). As a general purpose compressor, lzip only
+uses the default values for these parameters. In particular
+'literal_pos_state_bits' has been optimized away and does not even appear
+in the code.
+
+ Lzip finishes the LZMA stream with an "End Of Stream" (EOS) marker (the
+distance-length pair 0xFFFFFFFFU, 2), which in conjunction with the 'member
+size' field in the member trailer allows the verification of stream
+integrity. The EOS marker is the only marker allowed in lzip files. The
+LZMA stream in lzip files always has these two features (default properties
+and EOS marker) and is referred to in this document as LZMA-302eos. This
+simplified form of the LZMA stream format has been chosen to maximize
+interoperability and safety.
+
+ The second stage of LZMA is a range encoder that uses a different
+probability model for each type of symbol: distances, lengths, literal
+bytes, etc. Range encoding conceptually encodes all the symbols of the
+message into one number. Unlike Huffman coding, which assigns to each
+symbol a bit-pattern and concatenates all the bit-patterns together, range
+encoding can compress one symbol to less than one bit. Therefore the
+compressed data produced by a range encoder can't be split in pieces that
+could be described individually.
+
+ It seems that the only way of describing the LZMA-302eos stream is to
+describe the algorithm that decodes it. And given the many details about
+the range decoder that need to be described accurately, the source code of
+a real decompressor seems the only appropriate reference to use.
+
+ What follows is a description of the decoding algorithm for LZMA-302eos
+streams using as reference the source code of "lzd", an educational
+decompressor for lzip files which can be downloaded from the lzip download
+directory. Lzd is written in C++11 and its source code is included in
+appendix A. *Note Reference source code::.
+
+
+7.1 What is coded
+=================
+
+The LZMA stream includes literals, matches, and repeated matches (matches
+reusing a recently used distance). There are 7 different coding sequences:
+
+Bit sequence Name Description
+-----------------------------------------------------------------------------
+0 + byte literal literal byte
+1 + 0 + len + dis match distance-length pair
+1 + 1 + 0 + 0 shortrep 1 byte match at latest used distance
+1 + 1 + 0 + 1 + len rep0 len bytes match at latest used distance
+1 + 1 + 1 + 0 + len rep1 len bytes match at second latest used
+ distance
+1 + 1 + 1 + 1 + 0 + len rep2 len bytes match at third latest used
+ distance
+1 + 1 + 1 + 1 + 1 + len rep3 len bytes match at fourth latest used
+ distance
+
+
+ In the following tables, multibit sequences are coded in normal order,
+from most significant bit (MSB) to least significant bit (LSB), except
+where noted otherwise.
+
+ Lengths (the 'len' in the table above) are coded as follows:
+
+Bit sequence Description
+----------------------------------------------------------------------------
+0 + 3 bits lengths from 2 to 9
+1 + 0 + 3 bits lengths from 10 to 17
+1 + 1 + 8 bits lengths from 18 to 273
+
+
+ The coding of distances is a little more complicated, so I'll begin by
+explaining a simpler version of the encoding.
+
+ Imagine you need to encode a number from 0 to 2^32 - 1, and you want to
+do it in a way that produces shorter codes for the smaller numbers. You may
+first encode the position of the most significant bit that is set to 1,
+which you may find by making a bit scan from the left (from the MSB). A
+position of 0 means that the number is 0 (no bit is set), 1 means the LSB is
+the first bit set (the number is 1), and 32 means the MSB is set (i.e., the
+number is >= 0x80000000). Then, if the position is >= 2, you encode the
+remaining position - 1 bits. Let's call these bits "direct bits" because
+they are coded directly by value instead of indirectly by position.
+
+ The inconvenient of this simple method is that it needs 6 bits to encode
+the position, but it just uses 33 of the 64 possible values, wasting almost
+half of the codes.
+
+ The intelligent trick of LZMA is that it encodes in what it calls a
+"slot" the position of the most significant bit set, along with the value
+of the next bit, using the same 6 bits that would take to encode the
+position alone. This seems to need 66 slots (twice the number of
+positions), but for positions 0 and 1 there is no next bit, so the number
+of slots needed is 64 (0 to 63).
+
+ The 6 bits representing this "slot number" are then context-coded. If
+the distance is >= 4, the remaining bits are encoded as follows.
+'direct_bits' is the amount of remaining bits (from 1 to 30) needed to form
+a complete distance, and is calculated as (slot >> 1) - 1. If a distance
+needs 6 or more direct_bits, the last 4 bits are encoded separately. The
+last piece (all the direct_bits for distances 4 to 127, or the last 4 bits
+for distances >= 128) is context-coded in reverse order (from LSB to MSB).
+For distances >= 128, the 'direct_bits - 4' part is encoded with fixed 0.5
+probability.
+
+Bit sequence Description
+----------------------------------------------------------------------------
+slot distances from 0 to 3
+slot + direct_bits distances from 4 to 127
+slot + (direct_bits - 4) + 4 bits distances from 128 to 2^32 - 1
+
+
+7.2 The coding contexts
+=======================
+
+These contexts ('Bit_model' in the source), are integers or arrays of
+integers representing the probability of the corresponding bit being 0.
+
+ The indices used in these arrays are:
+
+'state'
+ A state machine ('State' in the source) with 12 states (0 to 11),
+ coding the latest 2 to 4 types of sequences processed. The initial
+ state is 0.
+
+'pos_state'
+ Value of the 2 least significant bits of the current position in the
+ decoded data.
+
+'literal_state'
+ Value of the 3 most significant bits of the latest byte decoded.
+
+'len_state'
+ Coded value of the current match length (length - 2), with a maximum
+ of 3. The resulting value is in the range 0 to 3.
+
+
+ The types of previous sequences corresponding to each state are shown in
+the following table. '!literal' is any sequence except a literal byte.
+'rep' is any one of 'rep0', 'rep1', 'rep2', or 'rep3'. The last type in
+each line is the most recent.
+
+State Types of previous sequences
+------------------------------------------------------
+0 literal, literal, literal
+1 match, literal, literal
+2 rep or (!literal, shortrep), literal, literal
+3 literal, shortrep, literal, literal
+4 match, literal
+5 rep or (!literal, shortrep), literal
+6 literal, shortrep, literal
+7 literal, match
+8 literal, rep
+9 literal, shortrep
+10 !literal, match
+11 !literal, (rep or shortrep)
+
+
+ The contexts for decoding the type of coding sequence are:
+
+Name Indices Used when
+----------------------------------------------------------------------------
+bm_match state, pos_state sequence start
+bm_rep state after sequence 1
+bm_rep0 state after sequence 11
+bm_rep1 state after sequence 111
+bm_rep2 state after sequence 1111
+bm_len state, pos_state after sequence 110
+
+
+ The contexts for decoding distances are:
+
+Name Indices Used when
+----------------------------------------------------------------------------
+bm_dis_slot len_state, bit tree distance start
+bm_dis reverse bit tree after slots 4 to 13
+bm_align reverse bit tree for distances >= 128, after fixed
+ probability bits
+
+
+ There are two separate sets of contexts for lengths ('Len_model' in the
+source). One for normal matches, the other for repeated matches. The
+contexts in each Len_model are (see 'decode_len' in the source):
+
+Name Indices Used when
+---------------------------------------------------------------------------
+choice1 none length start
+choice2 none after sequence 1
+bm_low pos_state, bit tree after sequence 0
+bm_mid pos_state, bit tree after sequence 10
+bm_high bit tree after sequence 11
+
+
+ The context array 'bm_literal' is special. In principle it acts as a
+normal bit tree context, the one selected by 'literal_state'. But if the
+previous decoded byte was not a literal, two other bit tree contexts are
+used depending on the value of each bit in 'match_byte' (the byte at the
+latest used distance), until a bit is decoded that is different from its
+corresponding bit in 'match_byte'. After the first difference is found, the
+rest of the byte is decoded using the normal bit tree context. (See
+'decode_matched' in the source).
+
+
+7.3 The range decoder
+=====================
+
+The LZMA stream is consumed one byte at a time by the range decoder. (See
+'normalize' in the source). Every byte consumed produces a variable number
+of decoded bits, depending on how well these bits agree with their context.
+(See 'decode_bit' in the source).
+
+ The range decoder state consists of two unsigned 32-bit variables:
+'range' (representing the most significant part of the range size not yet
+decoded) and 'code' (representing the current point within 'range').
+'range' is initialized to 2^32 - 1, and 'code' is initialized to 0.
+
+ The range encoder produces a first 0 byte that must be ignored by the
+range decoder. This is done by shifting 5 bytes in the initialization of
+'code' instead of 4. (See the 'Range_decoder' constructor in the source).
+
+
+7.4 Decoding and verifying the LZMA stream
+==========================================
+
+After decoding the member header and obtaining the dictionary size, the
+range decoder is initialized and then the LZMA decoder enters a loop (see
+'decode_member' in the source) where it invokes the range decoder with the
+appropriate contexts to decode the different coding sequences (matches,
+repeated matches, and literal bytes), until the "End Of Stream" marker is
+decoded.
+
+ Once the "End Of Stream" marker has been decoded, the decompressor reads
+and decodes the member trailer, and verifies that the three integrity
+factors stored there (CRC, data size, and member size) match those computed
+from the data.
+
+
+File: lzip.info, Node: Trailing data, Next: Examples, Prev: Stream format, Up: Top
+
+8 Extra data appended to the file
+*********************************
+
+Sometimes extra data are found appended to a lzip file after the last
+member. Such trailing data may be:
+
+ * Padding added to make the file size a multiple of some block size, for
+ example when writing to a tape. It is safe to append any amount of
+ padding zero bytes to a lzip file.
+
+ * Useful data added by the user; a cryptographically secure hash, a
+ description of file contents, etc. It is safe to append any amount of
+ text to a lzip file as long as none of the first four bytes of the text
+ match the corresponding byte in the string "LZIP", and the text does
+ not contain any zero bytes (null characters). Nonzero bytes and zero
+ bytes can't be safely mixed in trailing data.
+
+ * Garbage added by some not totally successful copy operation.
+
+ * Malicious data added to the file in order to make its total size and
+ hash value (for a chosen hash) coincide with those of another file.
+
+ * In rare cases, trailing data could be the corrupt header of another
+ member. In multimember or concatenated files the probability of
+ corruption happening in the magic bytes is 5 times smaller than the
+ probability of getting a false positive caused by the corruption of the
+ integrity information itself. Therefore it can be considered to be
+ below the noise level. Additionally, the test used by lzip to
+ discriminate trailing data from a corrupt header has a Hamming
+ distance (HD) of 3, and the 3 bit flips must happen in different magic
+ bytes for the test to fail. In any case, the option '--trailing-error'
+ guarantees that any corrupt header will be detected.
+
+ Trailing data are in no way part of the lzip file format, but tools
+reading lzip files are expected to behave as correctly and usefully as
+possible in the presence of trailing data.
+
+ Trailing data can be safely ignored in most cases. In some cases, like
+that of user-added data, they are expected to be ignored. In those cases
+where a file containing trailing data must be rejected, the option
+'--trailing-error' can be used. *Note --trailing-error::.
+
+
+File: lzip.info, Node: Examples, Next: Problems, Prev: Trailing data, Up: Top
+
+9 A small tutorial with examples
+********************************
+
+WARNING! Even if lzip is bug-free, other causes may result in a corrupt
+compressed file (bugs in the system libraries, memory errors, etc).
+Therefore, if the data you are going to compress are important, give the
+option '--keep' to lzip and don't remove the original file until you verify
+the compressed file with a command like 'lzip -cd file.lz | cmp file -'.
+Most RAM errors happening during compression can only be detected by
+comparing the compressed file with the original because the corruption
+happens before lzip compresses the RAM contents, resulting in a valid
+compressed file containing wrong data.
+
+
+Example 1: Extract all the files from archive 'foo.tar.lz'.
+
+ tar -xf foo.tar.lz
+ or
+ lzip -cd foo.tar.lz | tar -xf -
+
+
+Example 2: Replace a regular file with its compressed version 'file.lz' and
+show the compression ratio.
+
+ lzip -v file
+
+
+Example 3: Like example 2 but the created 'file.lz' is multimember with a
+member size of 1 MiB. The compression ratio is not shown.
+
+ lzip -b 1MiB file
+
+
+Example 4: Restore a regular file from its compressed version 'file.lz'. If
+the operation is successful, 'file.lz' is removed.
+
+ lzip -d file.lz
+
+
+Example 5: Verify the integrity of the compressed file 'file.lz' and show
+status.
+
+ lzip -tv file.lz
+
+
+Example 6: The right way of concatenating the decompressed output of two or
+more compressed files. *Note Trailing data::.
+
+ Don't do this
+ cat file1.lz file2.lz file3.lz | lzip -d -
+ Do this instead
+ lzip -cd file1.lz file2.lz file3.lz
+
+
+Example 7: Decompress 'file.lz' partially until 10 KiB of decompressed data
+are produced.
+
+ lzip -cd file.lz | dd bs=1024 count=10
+
+
+Example 8: Decompress 'file.lz' partially from decompressed byte at offset
+10000 to decompressed byte at offset 14999 (5000 bytes are produced).
+
+ lzip -cd file.lz | dd bs=1000 skip=10 count=5
+
+
+Example 9: Compress a whole device in /dev/sdc and send the output to
+'file.lz'.
+
+ lzip -c /dev/sdc > file.lz
+ or
+ lzip /dev/sdc -o file.lz
+
+
+Example 10: Create a multivolume compressed tar archive with a volume size
+of 1440 KiB.
+
+ tar -c some_directory | lzip -S 1440KiB -o volume_name -
+
+
+Example 11: Extract a multivolume compressed tar archive.
+
+ lzip -cd volume_name*.lz | tar -xf -
+
+
+Example 12: Create a multivolume compressed backup of a large database file
+with a volume size of 650 MB, where each volume is a multimember file with
+a member size of 32 MiB.
+
+ lzip -b 32MiB -S 650MB big_db
+
+
+File: lzip.info, Node: Problems, Next: Reference source code, Prev: Examples, Up: Top
+
+10 Reporting bugs
+*****************
+
+There are probably bugs in lzip. There are certainly errors and omissions
+in this manual. If you report them, they will get fixed. If you don't, no
+one will ever know about them and they will remain unfixed for all
+eternity, if not longer.
+
+ If you find a bug in lzip, please send electronic mail to
+<lzip-bug@nongnu.org>. Include the version number, which you can find by
+running 'lzip --version'.
+
+
+File: lzip.info, Node: Reference source code, Next: Concept index, Prev: Problems, Up: Top
+
+Appendix A Reference source code
+********************************
+
+/* Lzd - Educational decompressor for the lzip format
+ Copyright (C) 2013-2022 Antonio Diaz Diaz.
+
+ This program is free software. Redistribution and use in source and
+ binary forms, with or without modification, are permitted provided
+ that the following conditions are met:
+
+ 1. Redistributions of source code must retain the above copyright
+ notice, this list of conditions, and the following disclaimer.
+
+ 2. Redistributions in binary form must reproduce the above copyright
+ notice, this list of conditions, and the following disclaimer in the
+ documentation and/or other materials provided with the distribution.
+
+ This program is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
+*/
+/*
+ Exit status: 0 for a normal exit, 1 for environmental problems
+ (file not found, invalid flags, I/O errors, etc), 2 to indicate a
+ corrupt or invalid input file.
+*/
+
+#include <algorithm>
+#include <cerrno>
+#include <cstdio>
+#include <cstdlib>
+#include <cstring>
+#include <stdint.h>
+#include <unistd.h>
+#if defined __MSVCRT__ || defined __OS2__ || defined __DJGPP__
+#include <fcntl.h>
+#include <io.h>
+#endif
+
+
+class State
+ {
+ int st;
+
+public:
+ enum { states = 12 };
+ State() : st( 0 ) {}
+ int operator()() const { return st; }
+ bool is_char() const { return st < 7; }
+
+ void set_char()
+ {
+ const int next[states] = { 0, 0, 0, 0, 1, 2, 3, 4, 5, 6, 4, 5 };
+ st = next[st];
+ }
+ void set_match() { st = ( st < 7 ) ? 7 : 10; }
+ void set_rep() { st = ( st < 7 ) ? 8 : 11; }
+ void set_short_rep() { st = ( st < 7 ) ? 9 : 11; }
+ };
+
+
+enum {
+ min_dictionary_size = 1 << 12,
+ max_dictionary_size = 1 << 29,
+ literal_context_bits = 3,
+ literal_pos_state_bits = 0, // not used
+ pos_state_bits = 2,
+ pos_states = 1 << pos_state_bits,
+ pos_state_mask = pos_states - 1,
+
+ len_states = 4,
+ dis_slot_bits = 6,
+ start_dis_model = 4,
+ end_dis_model = 14,
+ modeled_distances = 1 << ( end_dis_model / 2 ), // 128
+ dis_align_bits = 4,
+ dis_align_size = 1 << dis_align_bits,
+
+ len_low_bits = 3,
+ len_mid_bits = 3,
+ len_high_bits = 8,
+ len_low_symbols = 1 << len_low_bits,
+ len_mid_symbols = 1 << len_mid_bits,
+ len_high_symbols = 1 << len_high_bits,
+ max_len_symbols = len_low_symbols + len_mid_symbols + len_high_symbols,
+
+ min_match_len = 2, // must be 2
+
+ bit_model_move_bits = 5,
+ bit_model_total_bits = 11,
+ bit_model_total = 1 << bit_model_total_bits };
+
+struct Bit_model
+ {
+ int probability;
+ Bit_model() : probability( bit_model_total / 2 ) {}
+ };
+
+struct Len_model
+ {
+ Bit_model choice1;
+ Bit_model choice2;
+ Bit_model bm_low[pos_states][len_low_symbols];
+ Bit_model bm_mid[pos_states][len_mid_symbols];
+ Bit_model bm_high[len_high_symbols];
+ };
+
+
+class CRC32
+ {
+ uint32_t data[256]; // Table of CRCs of all 8-bit messages.
+
+public:
+ CRC32()
+ {
+ for( unsigned n = 0; n < 256; ++n )
+ {
+ unsigned c = n;
+ for( int k = 0; k < 8; ++k )
+ { if( c & 1 ) c = 0xEDB88320U ^ ( c >> 1 ); else c >>= 1; }
+ data[n] = c;
+ }
+ }
+
+ void update_buf( uint32_t & crc, const uint8_t * const buffer,
+ const int size ) const
+ {
+ for( int i = 0; i < size; ++i )
+ crc = data[(crc^buffer[i])&0xFF] ^ ( crc >> 8 );
+ }
+ };
+
+const CRC32 crc32;
+
+
+typedef uint8_t Lzip_header[6]; // 0-3 magic bytes
+ // 4 version
+ // 5 coded dictionary size
+typedef uint8_t Lzip_trailer[20];
+ // 0-3 CRC32 of the uncompressed data
+ // 4-11 size of the uncompressed data
+ // 12-19 member size including header and trailer
+
+class Range_decoder
+ {
+ unsigned long long member_pos;
+ uint32_t code;
+ uint32_t range;
+
+public:
+ Range_decoder() : member_pos( 6 ), code( 0 ), range( 0xFFFFFFFFU )
+ {
+ for( int i = 0; i < 5; ++i ) code = ( code << 8 ) | get_byte();
+ }
+
+ uint8_t get_byte() { ++member_pos; return std::getc( stdin ); }
+ unsigned long long member_position() const { return member_pos; }
+
+ unsigned decode( const int num_bits )
+ {
+ unsigned symbol = 0;
+ for( int i = num_bits; i > 0; --i )
+ {
+ range >>= 1;
+ symbol <<= 1;
+ if( code >= range ) { code -= range; symbol |= 1; }
+ if( range <= 0x00FFFFFFU ) // normalize
+ { range <<= 8; code = ( code << 8 ) | get_byte(); }
+ }
+ return symbol;
+ }
+
+ unsigned decode_bit( Bit_model & bm )
+ {
+ unsigned symbol;
+ const uint32_t bound = ( range >> bit_model_total_bits ) * bm.probability;
+ if( code < bound )
+ {
+ range = bound;
+ bm.probability +=
+ ( bit_model_total - bm.probability ) >> bit_model_move_bits;
+ symbol = 0;
+ }
+ else
+ {
+ range -= bound;
+ code -= bound;
+ bm.probability -= bm.probability >> bit_model_move_bits;
+ symbol = 1;
+ }
+ if( range <= 0x00FFFFFFU ) // normalize
+ { range <<= 8; code = ( code << 8 ) | get_byte(); }
+ return symbol;
+ }
+
+ unsigned decode_tree( Bit_model bm[], const int num_bits )
+ {
+ unsigned symbol = 1;
+ for( int i = 0; i < num_bits; ++i )
+ symbol = ( symbol << 1 ) | decode_bit( bm[symbol] );
+ return symbol - ( 1 << num_bits );
+ }
+
+ unsigned decode_tree_reversed( Bit_model bm[], const int num_bits )
+ {
+ unsigned symbol = decode_tree( bm, num_bits );
+ unsigned reversed_symbol = 0;
+ for( int i = 0; i < num_bits; ++i )
+ {
+ reversed_symbol = ( reversed_symbol << 1 ) | ( symbol & 1 );
+ symbol >>= 1;
+ }
+ return reversed_symbol;
+ }
+
+ unsigned decode_matched( Bit_model bm[], const unsigned match_byte )
+ {
+ unsigned symbol = 1;
+ for( int i = 7; i >= 0; --i )
+ {
+ const unsigned match_bit = ( match_byte >> i ) & 1;
+ const unsigned bit = decode_bit( bm[symbol+(match_bit<<8)+0x100] );
+ symbol = ( symbol << 1 ) | bit;
+ if( match_bit != bit )
+ {
+ while( symbol < 0x100 )
+ symbol = ( symbol << 1 ) | decode_bit( bm[symbol] );
+ break;
+ }
+ }
+ return symbol & 0xFF;
+ }
+
+ unsigned decode_len( Len_model & lm, const int pos_state )
+ {
+ if( decode_bit( lm.choice1 ) == 0 )
+ return decode_tree( lm.bm_low[pos_state], len_low_bits );
+ if( decode_bit( lm.choice2 ) == 0 )
+ return len_low_symbols +
+ decode_tree( lm.bm_mid[pos_state], len_mid_bits );
+ return len_low_symbols + len_mid_symbols +
+ decode_tree( lm.bm_high, len_high_bits );
+ }
+ };
+
+
+class LZ_decoder
+ {
+ unsigned long long partial_data_pos;
+ Range_decoder rdec;
+ const unsigned dictionary_size;
+ uint8_t * const buffer; // output buffer
+ unsigned pos; // current pos in buffer
+ unsigned stream_pos; // first byte not yet written to stdout
+ uint32_t crc_;
+ bool pos_wrapped;
+
+ void flush_data();
+
+ uint8_t peek( const unsigned distance ) const
+ {
+ if( pos > distance ) return buffer[pos - distance - 1];
+ if( pos_wrapped ) return buffer[dictionary_size + pos - distance - 1];
+ return 0; // prev_byte of first byte
+ }
+
+ void put_byte( const uint8_t b )
+ {
+ buffer[pos] = b;
+ if( ++pos >= dictionary_size ) flush_data();
+ }
+
+public:
+ explicit LZ_decoder( const unsigned dict_size )
+ :
+ partial_data_pos( 0 ),
+ dictionary_size( dict_size ),
+ buffer( new uint8_t[dictionary_size] ),
+ pos( 0 ),
+ stream_pos( 0 ),
+ crc_( 0xFFFFFFFFU ),
+ pos_wrapped( false )
+ {}
+
+ ~LZ_decoder() { delete[] buffer; }
+
+ unsigned crc() const { return crc_ ^ 0xFFFFFFFFU; }
+ unsigned long long data_position() const
+ { return partial_data_pos + pos; }
+ uint8_t get_byte() { return rdec.get_byte(); }
+ unsigned long long member_position() const
+ { return rdec.member_position(); }
+
+ bool decode_member();
+ };
+
+
+void LZ_decoder::flush_data()
+ {
+ if( pos > stream_pos )
+ {
+ const unsigned size = pos - stream_pos;
+ crc32.update_buf( crc_, buffer + stream_pos, size );
+ if( std::fwrite( buffer + stream_pos, 1, size, stdout ) != size )
+ { std::fprintf( stderr, "Write error: %s\n", std::strerror( errno ) );
+ std::exit( 1 ); }
+ if( pos >= dictionary_size )
+ { partial_data_pos += pos; pos = 0; pos_wrapped = true; }
+ stream_pos = pos;
+ }
+ }
+
+
+bool LZ_decoder::decode_member() // Returns false if error
+ {
+ Bit_model bm_literal[1<<literal_context_bits][0x300];
+ Bit_model bm_match[State::states][pos_states];
+ Bit_model bm_rep[State::states];
+ Bit_model bm_rep0[State::states];
+ Bit_model bm_rep1[State::states];
+ Bit_model bm_rep2[State::states];
+ Bit_model bm_len[State::states][pos_states];
+ Bit_model bm_dis_slot[len_states][1<<dis_slot_bits];
+ Bit_model bm_dis[modeled_distances-end_dis_model+1];
+ Bit_model bm_align[dis_align_size];
+ Len_model match_len_model;
+ Len_model rep_len_model;
+ unsigned rep0 = 0; // rep[0-3] latest four distances
+ unsigned rep1 = 0; // used for efficient coding of
+ unsigned rep2 = 0; // repeated distances
+ unsigned rep3 = 0;
+ State state;
+
+ while( !std::feof( stdin ) && !std::ferror( stdin ) )
+ {
+ const int pos_state = data_position() & pos_state_mask;
+ if( rdec.decode_bit( bm_match[state()][pos_state] ) == 0 ) // 1st bit
+ {
+ // literal byte
+ const uint8_t prev_byte = peek( 0 );
+ const int literal_state = prev_byte >> ( 8 - literal_context_bits );
+ Bit_model * const bm = bm_literal[literal_state];
+ if( state.is_char() )
+ put_byte( rdec.decode_tree( bm, 8 ) );
+ else
+ put_byte( rdec.decode_matched( bm, peek( rep0 ) ) );
+ state.set_char();
+ continue;
+ }
+ // match or repeated match
+ int len;
+ if( rdec.decode_bit( bm_rep[state()] ) != 0 ) // 2nd bit
+ {
+ if( rdec.decode_bit( bm_rep0[state()] ) == 0 ) // 3rd bit
+ {
+ if( rdec.decode_bit( bm_len[state()][pos_state] ) == 0 ) // 4th bit
+ { state.set_short_rep(); put_byte( peek( rep0 ) ); continue; }
+ }
+ else
+ {
+ unsigned distance;
+ if( rdec.decode_bit( bm_rep1[state()] ) == 0 ) // 4th bit
+ distance = rep1;
+ else
+ {
+ if( rdec.decode_bit( bm_rep2[state()] ) == 0 ) // 5th bit
+ distance = rep2;
+ else
+ { distance = rep3; rep3 = rep2; }
+ rep2 = rep1;
+ }
+ rep1 = rep0;
+ rep0 = distance;
+ }
+ state.set_rep();
+ len = min_match_len + rdec.decode_len( rep_len_model, pos_state );
+ }
+ else // match
+ {
+ rep3 = rep2; rep2 = rep1; rep1 = rep0;
+ len = min_match_len + rdec.decode_len( match_len_model, pos_state );
+ const int len_state = std::min( len - min_match_len, len_states - 1 );
+ rep0 = rdec.decode_tree( bm_dis_slot[len_state], dis_slot_bits );
+ if( rep0 >= start_dis_model )
+ {
+ const unsigned dis_slot = rep0;
+ const int direct_bits = ( dis_slot >> 1 ) - 1;
+ rep0 = ( 2 | ( dis_slot & 1 ) ) << direct_bits;
+ if( dis_slot < end_dis_model )
+ rep0 += rdec.decode_tree_reversed( bm_dis + ( rep0 - dis_slot ),
+ direct_bits );
+ else
+ {
+ rep0 +=
+ rdec.decode( direct_bits - dis_align_bits ) << dis_align_bits;
+ rep0 += rdec.decode_tree_reversed( bm_align, dis_align_bits );
+ if( rep0 == 0xFFFFFFFFU ) // marker found
+ {
+ flush_data();
+ return ( len == min_match_len ); // End Of Stream marker
+ }
+ }
+ }
+ state.set_match();
+ if( rep0 >= dictionary_size || ( rep0 >= pos && !pos_wrapped ) )
+ { flush_data(); return false; }
+ }
+ for( int i = 0; i < len; ++i ) put_byte( peek( rep0 ) );
+ }
+ flush_data();
+ return false;
+ }
+
+
+int main( const int argc, const char * const argv[] )
+ {
+ if( argc > 2 || ( argc == 2 && std::strcmp( argv[1], "-d" ) != 0 ) )
+ {
+ std::printf(
+ "Lzd %s - Educational decompressor for the lzip format.\n"
+ "Study the source to learn how a lzip decompressor works.\n"
+ "See the lzip manual for an explanation of the code.\n"
+ "\nUsage: %s [-d] < file.lz > file\n"
+ "Lzd decompresses from standard input to standard output.\n"
+ "\nCopyright (C) 2022 Antonio Diaz Diaz.\n"
+ "License 2-clause BSD.\n"
+ "This is free software: you are free to change and redistribute it.\n"
+ "There is NO WARRANTY, to the extent permitted by law.\n"
+ "Report bugs to lzip-bug@nongnu.org\n"
+ "Lzd home page: http://www.nongnu.org/lzip/lzd.html\n",
+ PROGVERSION, argv[0] );
+ return 0;
+ }
+
+#if defined __MSVCRT__ || defined __OS2__ || defined __DJGPP__
+ setmode( STDIN_FILENO, O_BINARY );
+ setmode( STDOUT_FILENO, O_BINARY );
+#endif
+
+ for( bool first_member = true; ; first_member = false )
+ {
+ Lzip_header header; // verify header
+ for( int i = 0; i < 6; ++i ) header[i] = std::getc( stdin );
+ if( std::feof( stdin ) || std::memcmp( header, "LZIP\x01", 5 ) != 0 )
+ {
+ if( first_member )
+ { std::fputs( "Bad magic number (file not in lzip format).\n",
+ stderr ); return 2; }
+ break; // ignore trailing data
+ }
+ unsigned dict_size = 1 << ( header[5] & 0x1F );
+ dict_size -= ( dict_size / 16 ) * ( ( header[5] >> 5 ) & 7 );
+ if( dict_size < min_dictionary_size || dict_size > max_dictionary_size )
+ { std::fputs( "Invalid dictionary size in member header.\n", stderr );
+ return 2; }
+
+ LZ_decoder decoder( dict_size ); // decode LZMA stream
+ if( !decoder.decode_member() )
+ { std::fputs( "Data error\n", stderr ); return 2; }
+
+ Lzip_trailer trailer; // verify trailer
+ for( int i = 0; i < 20; ++i ) trailer[i] = decoder.get_byte();
+ int retval = 0;
+ unsigned crc = 0;
+ for( int i = 3; i >= 0; --i ) crc = ( crc << 8 ) + trailer[i];
+ if( crc != decoder.crc() )
+ { std::fputs( "CRC mismatch\n", stderr ); retval = 2; }
+
+ unsigned long long data_size = 0;
+ for( int i = 11; i >= 4; --i )
+ data_size = ( data_size << 8 ) + trailer[i];
+ if( data_size != decoder.data_position() )
+ { std::fputs( "Data size mismatch\n", stderr ); retval = 2; }
+
+ unsigned long long member_size = 0;
+ for( int i = 19; i >= 12; --i )
+ member_size = ( member_size << 8 ) + trailer[i];
+ if( member_size != decoder.member_position() )
+ { std::fputs( "Member size mismatch\n", stderr ); retval = 2; }
+ if( retval ) return retval;
+ }
+
+ if( std::fclose( stdout ) != 0 )
+ { std::fprintf( stderr, "Error closing stdout: %s\n",
+ std::strerror( errno ) ); return 1; }
+ return 0;
+ }
+
+
+File: lzip.info, Node: Concept index, Prev: Reference source code, Up: Top
+
+Concept index
+*************
+
+
+* Menu:
+
+* algorithm: Algorithm. (line 6)
+* bugs: Problems. (line 6)
+* examples: Examples. (line 6)
+* file format: File format. (line 6)
+* format of the LZMA stream: Stream format. (line 6)
+* getting help: Problems. (line 6)
+* introduction: Introduction. (line 6)
+* invoking: Invoking lzip. (line 6)
+* options: Invoking lzip. (line 6)
+* output: Output. (line 6)
+* quality assurance: Quality assurance. (line 6)
+* reference source code: Reference source code. (line 6)
+* trailing data: Trailing data. (line 6)
+* usage: Invoking lzip. (line 6)
+* version: Invoking lzip. (line 6)
+
+
+
+Tag Table:
+Node: Top203
+Node: Introduction1198
+Node: Output6972
+Node: Invoking lzip8567
+Ref: --trailing-error9356
+Node: Quality assurance18682
+Node: Algorithm27705
+Node: File format31109
+Ref: coded-dict-size32538
+Node: Stream format33773
+Ref: what-is-coded36169
+Node: Trailing data45097
+Node: Examples47358
+Ref: concat-example48800
+Node: Problems50021
+Node: Reference source code50553
+Node: Concept index65411
+
+End Tag Table
+
+
+Local Variables:
+coding: iso-8859-15
+End: