summaryrefslogtreecommitdiffstats
path: root/doc/lzip.texinfo
diff options
context:
space:
mode:
authorDaniel Baumann <mail@daniel-baumann.ch>2015-11-07 10:02:58 +0000
committerDaniel Baumann <mail@daniel-baumann.ch>2015-11-07 10:02:58 +0000
commit9761fd14b899e8dd9a7f7d5fa10f21edc7f2679d (patch)
tree364cb7aab4247b6894ef58c85317737945e21a0c /doc/lzip.texinfo
parentAdding debian version 1.15-1. (diff)
downloadlzip-9761fd14b899e8dd9a7f7d5fa10f21edc7f2679d.tar.xz
lzip-9761fd14b899e8dd9a7f7d5fa10f21edc7f2679d.zip
Merging upstream version 1.16~pre1.
Signed-off-by: Daniel Baumann <mail@daniel-baumann.ch>
Diffstat (limited to 'doc/lzip.texinfo')
-rw-r--r--doc/lzip.texinfo1246
1 files changed, 0 insertions, 1246 deletions
diff --git a/doc/lzip.texinfo b/doc/lzip.texinfo
deleted file mode 100644
index cfc9138..0000000
--- a/doc/lzip.texinfo
+++ /dev/null
@@ -1,1246 +0,0 @@
-\input texinfo @c -*-texinfo-*-
-@c %**start of header
-@setfilename lzip.info
-@documentencoding ISO-8859-15
-@settitle Lzip Manual
-@finalout
-@c %**end of header
-
-@set UPDATED 20 September 2013
-@set VERSION 1.15
-
-@dircategory Data Compression
-@direntry
-* Lzip: (lzip). LZMA lossless data compressor
-@end direntry
-
-
-@ifnothtml
-@titlepage
-@title Lzip
-@subtitle LZMA lossless data compressor
-@subtitle for Lzip version @value{VERSION}, @value{UPDATED}
-@author by Antonio Diaz Diaz
-
-@page
-@vskip 0pt plus 1filll
-@end titlepage
-
-@contents
-@end ifnothtml
-
-@node Top
-@top
-
-This manual is for Lzip (version @value{VERSION}, @value{UPDATED}).
-
-@menu
-* Introduction:: Purpose and features of lzip
-* Algorithm:: How lzip compresses the data
-* Invoking lzip:: Command line interface
-* File format:: Detailed format of the compressed file
-* Stream format:: Format of the LZMA stream in lzip files
-* Examples:: A small tutorial with examples
-* Problems:: Reporting bugs
-* Reference source code:: Source code illustrating stream format
-* Concept index:: Index of concepts
-@end menu
-
-@sp 1
-Copyright @copyright{} 2008, 2009, 2010, 2011, 2012, 2013
-Antonio Diaz Diaz.
-
-This manual is free documentation: you have unlimited permission
-to copy, distribute and modify it.
-
-
-@node Introduction
-@chapter Introduction
-@cindex introduction
-
-Lzip is a lossless data compressor with a user interface similar to the
-one of gzip or bzip2. Lzip decompresses almost as fast as gzip and
-compresses more than bzip2, which makes it well suited for software
-distribution and data archiving. Lzip is a clean implementation of the
-LZMA algorithm.
-
-The lzip file format is designed for long-term data archiving and
-provides very safe integrity checking. The member trailer stores the
-32-bit CRC of the original data, the size of the original data and the
-size of the member. These values, together with the value remaining in
-the range decoder and the end-of-stream marker, provide a 4 factor
-integrity checking which guarantees that the decompressed version of the
-data is identical to the original. This guards against corruption of the
-compressed data, and against undetected bugs in lzip (hopefully very
-unlikely). The chances of data corruption going undetected are
-microscopic. Be aware, though, that the check occurs upon decompression,
-so it can only tell you that something is wrong. It can't help you
-recover the original uncompressed data.
-
-If you ever need to recover data from a damaged lzip file, try the
-lziprecover program. Lziprecover makes lzip files resistant to bit-flip
-(one of the most common forms of data corruption), and provides data
-recovery capabilities, including error-checked merging of damaged copies
-of a file.
-
-Lzip uses the same well-defined exit status values used by bzip2, which
-makes it safer than compressors returning ambiguous warning values (like
-gzip) when it is used as a back end for tar or zutils.
-
-Lzip replaces every file given in the command line with a compressed
-version of itself, with the name "original_name.lz". Each compressed
-file has the same modification date, permissions, and, when possible,
-ownership as the corresponding original, so that these properties can be
-correctly restored at decompression time. Lzip is able to read from some
-types of non regular files if the @samp{--stdout} option is specified.
-
-If no file names are specified, lzip compresses (or decompresses) from
-standard input to standard output. In this case, lzip will decline to
-write compressed output to a terminal, as this would be entirely
-incomprehensible and therefore pointless.
-
-Lzip will correctly decompress a file which is the concatenation of two
-or more compressed files. The result is the concatenation of the
-corresponding uncompressed files. Integrity testing of concatenated
-compressed files is also supported.
-
-Lzip can produce multi-member files and safely recover, with
-lziprecover, the undamaged members in case of file damage. Lzip can also
-split the compressed output in volumes of a given size, even when
-reading from standard input. This allows the direct creation of
-multivolume compressed tar archives.
-
-Lzip is able to compress and decompress streams of unlimited size by
-automatically creating multi-member output. The members so created are
-large, about 64 PiB each.
-
-The amount of memory required for compression is about 1 or 2 times the
-dictionary size limit (1 if input file size is less than dictionary size
-limit, else 2) plus 9 times the dictionary size really used. The option
-@samp{-0} is special and only requires about 1.5 MiB at most. The amount
-of memory required for decompression is only a few tens of KiB larger
-than the dictionary size really used.
-
-Lzip will automatically use the smallest possible dictionary size
-without exceeding the given limit. Keep in mind that the decompression
-memory requirement is affected at compression time by the choice of
-dictionary size limit.
-
-When decompressing, lzip attempts to guess the name for the decompressed
-file from that of the compressed file as follows:
-
-@multitable {anyothername} {becomes} {anyothername.out}
-@item filename.lz @tab becomes @tab filename
-@item filename.tlz @tab becomes @tab filename.tar
-@item anyothername @tab becomes @tab anyothername.out
-@end multitable
-
-
-@node Algorithm
-@chapter Algorithm
-@cindex algorithm
-
-Lzip implements a simplified version of the LZMA (Lempel-Ziv-Markov
-chain-Algorithm) algorithm. The high compression of LZMA comes from
-combining two basic, well-proven compression ideas: sliding dictionaries
-(LZ77/78) and markov models (the thing used by every compression
-algorithm that uses a range encoder or similar order-0 entropy coder as
-its last stage) with segregation of contexts according to what the bits
-are used for.
-
-Lzip is a two stage compressor. The first stage is a Lempel-Ziv coder,
-which reduces redundancy by translating chunks of data to their
-corresponding distance-length pairs. The second stage is a range encoder
-that uses a different probability model for each type of data;
-distances, lengths, literal bytes, etc.
-
-The match finder, part of the LZ coder, is the most important piece of
-the LZMA algorithm, as it is in many Lempel-Ziv based algorithms. Most
-of lzip's execution time is spent in the match finder, and it has the
-greatest influence on the compression ratio.
-
-Here is how it works, step by step:
-
-1) The member header is written to the output stream.
-
-2) The first byte is coded literally, because there are no previous
-bytes to which the match finder can refer to.
-
-3) The main encoder advances to the next byte in the input data and
-calls the match finder.
-
-4) The match finder fills an array with the minimum distances before the
-current byte where a match of a given length can be found.
-
-5) Go back to step 3 until a sequence (formed of pairs, repeated
-distances and literal bytes) of minimum price has been formed. Where the
-price represents the number of output bits produced.
-
-6) The range encoder encodes the sequence produced by the main encoder
-and sends the produced bytes to the output stream.
-
-7) Go back to step 3 until the input data is finished or until the
-member or volume size limits are reached.
-
-8) The range encoder is flushed.
-
-9) The member trailer is written to the output stream.
-
-10) If there are more data to compress, go back to step 1.
-
-@sp 1
-@noindent
-The ideas embodied in lzip are due to (at least) the following people:
-Abraham Lempel and Jacob Ziv (for the LZ algorithm), Andrey Markov (for
-the definition of Markov chains), G.N.N. Martin (for the definition of
-range encoding), Igor Pavlov (for putting all the above together in
-LZMA), and Julian Seward (for bzip2's CLI).
-
-
-@node Invoking lzip
-@chapter Invoking lzip
-@cindex invoking
-@cindex options
-@cindex usage
-@cindex version
-
-The format for running lzip is:
-
-@example
-lzip [@var{options}] [@var{files}]
-@end example
-
-Lzip supports the following options:
-
-@table @samp
-@item -h
-@itemx --help
-Print an informative help message describing the options and exit.
-
-@item -V
-@itemx --version
-Print the version number of lzip on the standard output and exit.
-
-@item -b @var{bytes}
-@itemx --member-size=@var{bytes}
-Set the member size limit to @var{bytes}. A small member size may
-degrade compression ratio, so use it only when needed. Valid values
-range from 100 kB to 64 PiB. Defaults to 64 PiB.
-
-@item -c
-@itemx --stdout
-Compress or decompress to standard output. Needed when reading from a
-named pipe (fifo) or from a device. Use it to recover as much of the
-uncompressed data as possible when decompressing a corrupt file.
-
-@item -d
-@itemx --decompress
-Decompress.
-
-@item -f
-@itemx --force
-Force overwrite of output files.
-
-@item -F
-@itemx --recompress
-Force recompression of files whose name already has the @samp{.lz} or
-@samp{.tlz} suffix.
-
-@item -k
-@itemx --keep
-Keep (don't delete) input files during compression or decompression.
-
-@item -m @var{bytes}
-@itemx --match-length=@var{bytes}
-Set the match length limit in bytes. After a match this long is found,
-the search is finished. Valid values range from 5 to 273. Larger values
-usually give better compression ratios but longer compression times.
-
-@item -o @var{file}
-@itemx --output=@var{file}
-When reading from standard input and @samp{--stdout} has not been
-specified, use @samp{@var{file}} as the virtual name of the uncompressed
-file. This produces a file named @samp{@var{file}} when decompressing, a
-file named @samp{@var{file}.lz} when compressing, and several files
-named @samp{@var{file}00001.lz}, @samp{@var{file}00002.lz}, etc, when
-compressing and splitting the output in volumes.
-
-@item -q
-@itemx --quiet
-Quiet operation. Suppress all messages.
-
-@item -s @var{bytes}
-@itemx --dictionary-size=@var{bytes}
-Set the dictionary size limit in bytes. Valid values range from 4 KiB to
-512 MiB. Lzip will use the smallest possible dictionary size for each
-member without exceeding this limit. Note that dictionary sizes are
-quantized. If the specified size does not match one of the valid sizes,
-it will be rounded upwards by adding up to (@var{bytes} / 16) to it.
-
-For maximum compression you should use a dictionary size limit as large
-as possible, but keep in mind that the decompression memory requirement
-is affected at compression time by the choice of dictionary size limit.
-
-@item -S @var{bytes}
-@itemx --volume-size=@var{bytes}
-Split the compressed output into several volume files with names
-@samp{original_name00001.lz}, @samp{original_name00002.lz}, etc, and set
-the volume size limit to @var{bytes}. Each volume is a complete, maybe
-multi-member, lzip file. A small volume size may degrade compression
-ratio, so use it only when needed. Valid values range from 100 kB to 4
-EiB.
-
-@item -t
-@itemx --test
-Check integrity of the specified file(s), but don't decompress them.
-This really performs a trial decompression and throws away the result.
-Use it together with @samp{-v} to see information about the file.
-
-@item -v
-@itemx --verbose
-Verbose mode.@*
-When compressing, show the compression ratio for each file processed. A
-second @samp{-v} shows the progress of compression.@*
-When decompressing or testing, further -v's (up to 4) increase the
-verbosity level, showing status, compression ratio, dictionary size,
-trailer contents (CRC, data size, member size), and up to 6 bytes of
-trailing garbage (if any).
-
-@item -0 .. -9
-Set the compression parameters (dictionary size and match length limit)
-as shown in the table below. Note that @samp{-9} can be much slower than
-@samp{-0}. These options have no effect when decompressing.
-
-The bidimensional parameter space of LZMA can't be mapped to a linear
-scale optimal for all files. If your files are large, very repetitive,
-etc, you may need to use the @samp{--match-length} and
-@samp{--dictionary-size} options directly to achieve optimal
-performance. For example, @samp{-9m64} usually compresses executables
-more (and faster) than @samp{-9}.
-
-@multitable {Level} {Dictionary size} {Match length limit}
-@item Level @tab Dictionary size @tab Match length limit
-@item -0 @tab 64 KiB @tab 16 bytes
-@item -1 @tab 1 MiB @tab 5 bytes
-@item -2 @tab 1.5 MiB @tab 6 bytes
-@item -3 @tab 2 MiB @tab 8 bytes
-@item -4 @tab 3 MiB @tab 12 bytes
-@item -5 @tab 4 MiB @tab 20 bytes
-@item -6 @tab 8 MiB @tab 36 bytes
-@item -7 @tab 16 MiB @tab 68 bytes
-@item -8 @tab 24 MiB @tab 132 bytes
-@item -9 @tab 32 MiB @tab 273 bytes
-@end multitable
-
-@item --fast
-@itemx --best
-Aliases for GNU gzip compatibility.
-
-@end table
-
-Numbers given as arguments to options may be followed by a multiplier
-and an optional @samp{B} for "byte".
-
-Table of SI and binary prefixes (unit multipliers):
-
-@multitable {Prefix} {kilobyte (10^3 = 1000)} {|} {Prefix} {kibibyte (2^10 = 1024)}
-@item Prefix @tab Value @tab | @tab Prefix @tab Value
-@item k @tab kilobyte (10^3 = 1000) @tab | @tab Ki @tab kibibyte (2^10 = 1024)
-@item M @tab megabyte (10^6) @tab | @tab Mi @tab mebibyte (2^20)
-@item G @tab gigabyte (10^9) @tab | @tab Gi @tab gibibyte (2^30)
-@item T @tab terabyte (10^12) @tab | @tab Ti @tab tebibyte (2^40)
-@item P @tab petabyte (10^15) @tab | @tab Pi @tab pebibyte (2^50)
-@item E @tab exabyte (10^18) @tab | @tab Ei @tab exbibyte (2^60)
-@item Z @tab zettabyte (10^21) @tab | @tab Zi @tab zebibyte (2^70)
-@item Y @tab yottabyte (10^24) @tab | @tab Yi @tab yobibyte (2^80)
-@end multitable
-
-@sp 1
-Exit status: 0 for a normal exit, 1 for environmental problems (file not
-found, invalid flags, I/O errors, etc), 2 to indicate a corrupt or
-invalid input file, 3 for an internal consistency error (eg, bug) which
-caused lzip to panic.
-
-
-@node File format
-@chapter File format
-@cindex file format
-
-Perfection is reached, not when there is no longer anything to add, but
-when there is no longer anything to take away.@*
---- Antoine de Saint-Exupery
-
-@sp 1
-In the diagram below, a box like this:
-@verbatim
-+---+
-| | <-- the vertical bars might be missing
-+---+
-@end verbatim
-
-represents one byte; a box like this:
-@verbatim
-+==============+
-| |
-+==============+
-@end verbatim
-
-represents a variable number of bytes.
-
-@sp 1
-A lzip file consists of a series of "members" (compressed data sets).
-The members simply appear one after another in the file, with no
-additional information before, between, or after them.
-
-Each member has the following structure:
-@verbatim
-+--+--+--+--+----+----+=============+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
-| ID string | VN | DS | Lzma stream | CRC32 | Data size | Member size |
-+--+--+--+--+----+----+=============+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
-@end verbatim
-
-All multibyte values are stored in little endian order.
-
-@table @samp
-@item ID string
-A four byte string, identifying the lzip format, with the value "LZIP"
-(0x4C, 0x5A, 0x49, 0x50).
-
-@item VN (version number, 1 byte)
-Just in case something needs to be modified in the future. 1 for now.
-
-@item DS (coded dictionary size, 1 byte)
-Lzip divides the distance between any two powers of 2 into 8 equally
-spaced intervals, named "wedges". The dictionary size is calculated by
-taking a power of 2 (the base size) and substracting from it a number of
-wedges between 0 and 7. The size of a wedge is (base_size / 16).@*
-Bits 4-0 contain the base 2 logarithm of the base size (12 to 29).@*
-Bits 7-5 contain the number of wedges (0 to 7) to substract from the
-base size to obtain the dictionary size.@*
-Example: 0xD3 = 2^19 - 6 * 2^15 = 512 KiB - 6 * 32 KiB = 320 KiB@*
-Valid values for dictionary size range from 4 KiB to 512 MiB.
-
-@item Lzma stream
-The lzma stream, finished by an end of stream marker. Uses default
-values for encoder properties. See the chapter @samp{Stream format}
-(@pxref{Stream format}) for a complete description.
-
-@item CRC32 (4 bytes)
-CRC of the uncompressed original data.
-
-@item Data size (8 bytes)
-Size of the uncompressed original data.
-
-@item Member size (8 bytes)
-Total size of the member, including header and trailer. This field acts
-as a distributed index, allows the verification of stream integrity, and
-facilitates safe recovery of undamaged members from multi-member files.
-
-@end table
-
-
-@node Stream format
-@chapter Format of the LZMA stream in lzip files
-@cindex format of the LZMA stream
-
-The LZMA algorithm has three parameters, called "special LZMA
-properties", to adjust it for some kinds of binary data. These
-parameters are; @samp{literal_context_bits} (with a default value of 3),
-@samp{literal_pos_state_bits} (with a default value of 0), and
-@samp{pos_state_bits} (with a default value of 2). As a general purpose
-compressor, lzip only uses the default values for these parameters.
-
-Lzip also finishes the LZMA stream with an "End Of Stream" marker (the
-distance-length pair 0xFFFFFFFFU, 2), which in conjunction with the
-"member size" field in the member trailer allows the verification of
-stream integrity. The LZMA stream in lzip files always has these two
-features (default properties and EOS marker) and is referred to in this
-document as LZMA-302eos or LZMA-lzip.
-
-The second stage of LZMA is a range encoder that uses a different
-probability model for each type of symbol; distances, lengths, literal
-bytes, etc. Range encoding conceptually encodes all the symbols of the
-message into one number. Unlike Huffman coding, which assigns to each
-symbol a bit-pattern and concatenates all the bit-patterns together,
-range encoding can compress one symbol to less than one bit. Therefore
-the compressed data produced by a range encoder can't be split in pieces
-that could be individually described.
-
-It seems that the only way of describing the LZMA-302eos stream is
-describing the algorithm that decodes it. And given the many details
-about the range decoder that need to be described accurately, the source
-code of a real decoder seems the only appropiate reference to use.
-
-What follows is a description of the decoding algorithm for LZMA-302eos
-streams using as reference the source code of "lzd", an educational
-decompressor for lzip files which can be downloaded from the lzip
-download directory. The source code of lzd is included in appendix A.
-@ref{Reference source code}
-
-@sp 1
-@section What is coded
-
-The LZMA stream includes literals, matches and repeated matches (matches
-reusing a recently used distance). There are 7 different coding sequences:
-
-@multitable @columnfractions .35 .14 .51
-@headitem Bit sequence @tab Name @tab Description
-@item 0 + byte @tab literal @tab literal byte
-@item 1 + 0 + len + dis @tab match @tab distance-length pair
-@item 1 + 1 + 0 + 0 @tab shortrep @tab 1 byte match at latest used
-distance
-@item 1 + 1 + 0 + 1 + len @tab rep0 @tab len bytes match at latest used
-distance
-@item 1 + 1 + 1 + 0 + len @tab rep1 @tab len bytes match at second
-latest used distance
-@item 1 + 1 + 1 + 1 + 0 + len @tab rep2 @tab len bytes match at third
-latest used distance
-@item 1 + 1 + 1 + 1 + 1 + len @tab rep3 @tab len bytes match at fourth
-latest used distance
-@end multitable
-
-@sp 1
-In the following tables, multi-bit sequences are coded in normal order,
-from MSB to LSB, except where noted otherwise.
-
-Lengths (the @samp{len} in the table above) are coded as follows:
-
-@multitable @columnfractions .5 .5
-@headitem Bit sequence @tab Description
-@item 0 + 3 bits @tab lengths from 2 to 9
-@item 1 + 0 + 3 bits @tab lengths from 10 to 17
-@item 1 + 1 + 8 bits @tab lengths from 18 to 273
-@end multitable
-
-@sp 1
-The coding of distances is a little more complicated. LZMA divides the
-interval between any two powers of 2 into 2 halves, named slots. As
-possible distances range from 0 to (2^32 - 1), there are 64 slots (0 to
-63). The slot number is context-coded in 6 bits. @samp{direct_bits} are
-the remaining bits (from 0 to 30) needed to form a complete distance,
-and are calculated as (slot >> 1) - 1. If a distance needs 6 or more
-direct_bits, the last 4 bits are coded separately. The last piece
-(direct_bits for distances 4 to 127 or the last 4 bits for distances >=
-128) is context-coded in reverse order (from LSB to MSB). For distances
->= 128, the @samp{direct_bits - 4} part is coded with fixed 0.5
-probability.
-
-@multitable @columnfractions .5 .5
-@headitem Bit sequence @tab Description
-@item slot @tab distances from 0 to 3
-@item slot + direct_bits @tab distances from 4 to 127
-@item slot + (direct_bits - 4) + 4 bits @tab distances from 128 to
-2^32 - 1
-@end multitable
-
-@sp 1
-@section The coding contexts
-
-These contexts (@samp{Bit_model} in the source), are integers or arrays
-of integers representing the probability of the corresponding bit being 0.
-
-The indices used in these arrays are:
-
-@table @samp
-@item state
-A state machine (@samp{State} in the source) with 12 states (0 to 11),
-coding the latest 2 to 4 types of sequences processed. The initial state
-is 0.
-
-@item pos_state
-Value of the 2 least significant bits of the current position in the
-decoded data.
-
-@item literal_state
-Value of the 3 most significant bits of the latest byte decoded.
-
-@item len_state
-Coded value of length (length - 2), with a maximum of 3. The resulting
-value is in the range 0 to 3.
-
-@end table
-
-
-In the following table, @samp{!literal} is any sequence except a literal
-byte. @samp{rep} is any one of @samp{rep0}, @samp{rep1}, @samp{rep2} or
-@samp{rep3}. The types of previous sequences corresponding to each state
-are:
-
-@multitable {State} {literal, shortrep, literal, literal}
-@headitem State @tab Types of previous sequences
-@item 0 @tab literal, literal, literal
-@item 1 @tab match, literal, literal
-@item 2 @tab (rep or shortrep), literal, literal
-@item 3 @tab literal, shortrep, literal, literal
-@item 4 @tab match, literal
-@item 5 @tab (rep or shortrep), literal
-@item 6 @tab literal, shortrep, literal
-@item 7 @tab literal, match
-@item 8 @tab literal, rep
-@item 9 @tab literal, shortrep
-@item 10 @tab !literal, match
-@item 11 @tab !literal, (rep or shortrep)
-@end multitable
-
-@sp 1
-The contexts for decoding the type of coding sequence are:
-
-@multitable @columnfractions .2 .4 .4
-@headitem Name @tab Indices @tab Used when
-@item bm_match @tab state, pos_state @tab sequence start
-@item bm_rep @tab state @tab after sequence 1
-@item bm_rep0 @tab state @tab after sequence 11
-@item bm_rep1 @tab state @tab after sequence 111
-@item bm_rep2 @tab state @tab after sequence 1111
-@item bm_len @tab state, pos_state @tab after sequence 110
-@end multitable
-
-@sp 1
-The contexts for decoding distances are:
-
-@multitable @columnfractions .2 .4 .4
-@headitem Name @tab Indices @tab Used when
-@item bm_dis_slot @tab len_state, bit tree @tab distance start
-@item bm_dis @tab reverse bit tree @tab after slots 4 to 13
-@item bm_align @tab reverse bit tree @tab for distances >= 128, after
-fixed probability bits
-@end multitable
-
-@sp 1
-There are two separate sets of contexts for lengths (@samp{Len_model} in
-the source). One for normal matches, the other for repeated matches. The
-contexts in each Len_model are (see @samp{decode_len} in the source):
-
-@multitable @columnfractions .2 .4 .4
-@headitem Name @tab Indices @tab Used when
-@item choice1 @tab none @tab length start
-@item choice2 @tab none @tab after sequence 1
-@item bm_low @tab pos_state, bit tree @tab after sequence 0
-@item bm_mid @tab pos_state, bit tree @tab after sequence 10
-@item bm_high @tab bit tree @tab after sequence 11
-@end multitable
-
-@sp 1
-The context array @samp{bm_literal} is special. In principle it acts as
-a normal bit tree context, the one selected by @samp{literal_state}. But
-if the previous decoded byte was not a literal, two other bit tree
-contexts are used depending on the value of each bit in
-@samp{match_byte} (the byte at the latest used distance), until a bit is
-decoded that is different from its corresponding bit in
-@samp{match_byte}. After the first difference is found, the rest of the
-byte is decoded using the normal bit tree context. (See
-@samp{decode_matched} in the source).
-
-@sp 1
-@section The range decoder
-
-The LZMA stream is consumed one byte at a time by the range decoder.
-(See @samp{normalize} in the source). Every byte consumed produces a
-variable number of decoded bits, depending on how well these bits agree
-with their context. (See @samp{decode_bit} in the source).
-
-The range decoder state consists of two unsigned 32-bit variables,
-@code{range} (representing the most significant part of the range size
-not yet decoded), and @code{code} (representing the current point within
-@code{range}). @code{range} is initialized to (2^32 - 1), and
-@code{code} is initialized to 0.
-
-The range encoder produces a first 0 byte that must be ignored by the
-range decoder. This is done by shifting 5 bytes in the initialization of
-@code{code} instead of 4. (See the @samp{Range_decoder} constructor in
-the source).
-
-@sp 1
-@section Decoding the LZMA stream
-
-After decoding the member header and obtaining the dictionary size, the
-range decoder is initialized and then the LZMA decoder enters a loop
-(See @samp{decode_member} in the source) where it invokes the range
-decoder with the appropiate contexts to decode the different coding
-sequences (matches, repeated matches, and literal bytes), until the "End
-Of Stream" marker is decoded.
-
-
-@node Examples
-@chapter A small tutorial with examples
-@cindex examples
-
-WARNING! Even if lzip is bug-free, other causes may result in a corrupt
-compressed file (bugs in the system libraries, memory errors, etc).
-Therefore, if the data you are going to compress is important, give the
-@samp{--keep} option to lzip and do not remove the original file until
-you verify the compressed file with a command like
-@w{@samp{lzip -cd file.lz | cmp file -}}.
-
-@sp 1
-@noindent
-Example 1: Replace a regular file with its compressed version
-@samp{file.lz} and show the compression ratio.
-
-@example
-lzip -v file
-@end example
-
-@sp 1
-@noindent
-Example 2: Like example 1 but the created @samp{file.lz} is multi-member
-with a member size of 1 MiB. The compression ratio is not shown.
-
-@example
-lzip -b 1MiB file
-@end example
-
-@sp 1
-@noindent
-Example 3: Restore a regular file from its compressed version
-@samp{file.lz}. If the operation is successful, @samp{file.lz} is
-removed.
-
-@example
-lzip -d file.lz
-@end example
-
-@sp 1
-@noindent
-Example 4: Verify the integrity of the compressed file @samp{file.lz}
-and show status.
-
-@example
-lzip -tv file.lz
-@end example
-
-@sp 1
-@noindent
-Example 5: Compress a whole floppy in /dev/fd0 and send the output to
-@samp{file.lz}.
-
-@example
-lzip -c /dev/fd0 > file.lz
-@end example
-
-@sp 1
-@noindent
-Example 6: Decompress @samp{file.lz} partially until 10 KiB of
-decompressed data are produced.
-
-@example
-lzip -cd file.lz | dd bs=1024 count=10
-@end example
-
-@sp 1
-@noindent
-Example 7: Decompress @samp{file.lz} partially from decompressed byte
-10000 to decompressed byte 15000 (5000 bytes are produced).
-
-@example
-lzip -cd file.lz | dd bs=1000 skip=10 count=5
-@end example
-
-@sp 1
-@noindent
-Example 8: Create a multivolume compressed tar archive with a volume
-size of 1440 KiB.
-
-@example
-tar -c some_directory | lzip -S 1440KiB -o volume_name
-@end example
-
-@sp 1
-@noindent
-Example 9: Extract a multivolume compressed tar archive.
-
-@example
-lzip -cd volume_name*.lz | tar -xf -
-@end example
-
-@sp 1
-@noindent
-Example 10: Create a multivolume compressed backup of a large database
-file with a volume size of 650 MB, where each volume is a multi-member
-file with a member size of 32 MiB.
-
-@example
-lzip -b 32MiB -S 650MB big_db
-@end example
-
-
-@node Problems
-@chapter Reporting bugs
-@cindex bugs
-@cindex getting help
-
-There are probably bugs in lzip. There are certainly errors and
-omissions in this manual. If you report them, they will get fixed. If
-you don't, no one will ever know about them and they will remain unfixed
-for all eternity, if not longer.
-
-If you find a bug in lzip, please send electronic mail to
-@email{lzip-bug@@nongnu.org}. Include the version number, which you can
-find by running @w{@samp{lzip --version}}.
-
-
-@node Reference source code
-@appendix Reference source code
-@cindex reference source code
-
-@verbatim
-/* Lzd - Educational decompressor for lzip files
- Copyright (C) 2013 Antonio Diaz Diaz.
-
- This program is free software: you have unlimited permission
- to copy, distribute and modify it.
-
- This program is distributed in the hope that it will be useful,
- but WITHOUT ANY WARRANTY; without even the implied warranty of
- MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
-*/
-/*
- Exit status: 0 for a normal exit, 1 for environmental problems
- (file not found, invalid flags, I/O errors, etc), 2 to indicate a
- corrupt or invalid input file.
-*/
-
-#include <algorithm>
-#include <cerrno>
-#include <cstdio>
-#include <cstdlib>
-#include <cstring>
-#include <stdint.h>
-#include <unistd.h>
-
-
-class State
- {
- int st;
-
-public:
- enum { states = 12 };
- State() : st( 0 ) {}
- int operator()() const { return st; }
- bool is_char() const { return st < 7; }
-
- void set_char()
- {
- static const int next[states] = { 0, 0, 0, 0, 1, 2, 3, 4, 5, 6, 4, 5 };
- st = next[st];
- }
- void set_match() { st = ( st < 7 ) ? 7 : 10; }
- void set_rep() { st = ( st < 7 ) ? 8 : 11; }
- void set_short_rep() { st = ( st < 7 ) ? 9 : 11; }
- };
-
-
-enum {
- min_dictionary_size = 1 << 12,
- max_dictionary_size = 1 << 29,
- literal_context_bits = 3,
- pos_state_bits = 2,
- pos_states = 1 << pos_state_bits,
- pos_state_mask = pos_states - 1,
-
- len_states = 4,
- dis_slot_bits = 6,
- start_dis_model = 4,
- end_dis_model = 14,
- modeled_distances = 1 << (end_dis_model / 2), // 128
- dis_align_bits = 4,
- dis_align_size = 1 << dis_align_bits,
-
- len_low_bits = 3,
- len_mid_bits = 3,
- len_high_bits = 8,
- len_low_symbols = 1 << len_low_bits,
- len_mid_symbols = 1 << len_mid_bits,
- len_high_symbols = 1 << len_high_bits,
- max_len_symbols = len_low_symbols + len_mid_symbols + len_high_symbols,
-
- min_match_len = 2, // must be 2
-
- bit_model_move_bits = 5,
- bit_model_total_bits = 11,
- bit_model_total = 1 << bit_model_total_bits };
-
-struct Bit_model
- {
- int probability;
- Bit_model() : probability( bit_model_total / 2 ) {}
- };
-
-struct Len_model
- {
- Bit_model choice1;
- Bit_model choice2;
- Bit_model bm_low[pos_states][len_low_symbols];
- Bit_model bm_mid[pos_states][len_mid_symbols];
- Bit_model bm_high[len_high_symbols];
- };
-
-
-class CRC32
- {
- uint32_t data[256]; // Table of CRCs of all 8-bit messages.
-
-public:
- CRC32()
- {
- for( unsigned n = 0; n < 256; ++n )
- {
- unsigned c = n;
- for( int k = 0; k < 8; ++k )
- { if( c & 1 ) c = 0xEDB88320U ^ ( c >> 1 ); else c >>= 1; }
- data[n] = c;
- }
- }
-
- void update_buf( uint32_t & crc, const uint8_t * const buffer,
- const int size ) const
- {
- for( int i = 0; i < size; ++i )
- crc = data[(crc^buffer[i])&0xFF] ^ ( crc >> 8 );
- }
- };
-
-const CRC32 crc32;
-
-
-typedef uint8_t File_header[6]; // 0-3 magic, 4 version, 5 coded_dict_size
-
-typedef uint8_t File_trailer[20];
- // 0-3 CRC32 of the uncompressed data
- // 4-11 size of the uncompressed data
- // 12-19 member size including header and trailer
-
-class Range_decoder
- {
- uint32_t code;
- uint32_t range;
-
-public:
- Range_decoder() : code( 0 ), range( 0xFFFFFFFFU )
- {
- for( int i = 0; i < 5; ++i ) code = (code << 8) | get_byte();
- }
-
- uint8_t get_byte() { return std::getc( stdin ); }
-
- int decode( const int num_bits )
- {
- int symbol = 0;
- for( int i = 0; i < num_bits; ++i )
- {
- range >>= 1;
- symbol <<= 1;
- if( code >= range ) { code -= range; symbol |= 1; }
- if( range <= 0x00FFFFFFU ) // normalize
- { range <<= 8; code = (code << 8) | get_byte(); }
- }
- return symbol;
- }
-
- int decode_bit( Bit_model & bm )
- {
- int symbol;
- const uint32_t bound = ( range >> bit_model_total_bits ) * bm.probability;
- if( code < bound )
- {
- range = bound;
- bm.probability += (bit_model_total - bm.probability) >> bit_model_move_bits;
- symbol = 0;
- }
- else
- {
- range -= bound;
- code -= bound;
- bm.probability -= bm.probability >> bit_model_move_bits;
- symbol = 1;
- }
- if( range <= 0x00FFFFFFU ) // normalize
- { range <<= 8; code = (code << 8) | get_byte(); }
- return symbol;
- }
-
- int decode_tree( Bit_model bm[], const int num_bits )
- {
- int symbol = 1;
- for( int i = 0; i < num_bits; ++i )
- symbol = ( symbol << 1 ) | decode_bit( bm[symbol] );
- return symbol - (1 << num_bits);
- }
-
- int decode_tree_reversed( Bit_model bm[], const int num_bits )
- {
- int symbol = decode_tree( bm, num_bits );
- int reversed_symbol = 0;
- for( int i = 0; i < num_bits; ++i )
- {
- reversed_symbol = ( reversed_symbol << 1 ) | ( symbol & 1 );
- symbol >>= 1;
- }
- return reversed_symbol;
- }
-
- int decode_matched( Bit_model bm[], const int match_byte )
- {
- Bit_model * const bm1 = bm + 0x100;
- int symbol = 1;
- for( int i = 7; i >= 0; --i )
- {
- const int match_bit = ( match_byte >> i ) & 1;
- const int bit = decode_bit( bm1[(match_bit<<8)+symbol] );
- symbol = ( symbol << 1 ) | bit;
- if( match_bit != bit )
- {
- while( symbol < 0x100 )
- symbol = ( symbol << 1 ) | decode_bit( bm[symbol] );
- break;
- }
- }
- return symbol - 0x100;
- }
-
- int decode_len( Len_model & lm, const int pos_state )
- {
- if( decode_bit( lm.choice1 ) == 0 )
- return decode_tree( lm.bm_low[pos_state], len_low_bits );
- if( decode_bit( lm.choice2 ) == 0 )
- return len_low_symbols +
- decode_tree( lm.bm_mid[pos_state], len_mid_bits );
- return len_low_symbols + len_mid_symbols +
- decode_tree( lm.bm_high, len_high_bits );
- }
- };
-
-
-class LZ_decoder
- {
- unsigned long long partial_data_pos;
- Range_decoder rdec;
- const unsigned dictionary_size;
- uint8_t * const buffer; // output buffer
- unsigned pos; // current pos in buffer
- unsigned stream_pos; // first byte not yet written to stdout
- uint32_t crc_;
-
- void flush_data();
-
- uint8_t get_byte( const unsigned distance ) const
- {
- unsigned i = pos - distance - 1;
- if( pos <= distance ) i += dictionary_size;
- return buffer[i];
- }
-
- void put_byte( const uint8_t b )
- {
- buffer[pos] = b;
- if( ++pos >= dictionary_size ) flush_data();
- }
-
-public:
- LZ_decoder( const unsigned dict_size )
- :
- partial_data_pos( 0 ),
- dictionary_size( dict_size ),
- buffer( new uint8_t[dictionary_size] ),
- pos( 0 ),
- stream_pos( 0 ),
- crc_( 0xFFFFFFFFU )
- { buffer[dictionary_size-1] = 0; } // prev_byte of first_byte
-
- ~LZ_decoder() { delete[] buffer; }
-
- unsigned crc() const { return crc_ ^ 0xFFFFFFFFU; }
- unsigned long long data_position() const { return partial_data_pos + pos; }
-
- bool decode_member();
- };
-
-
-void LZ_decoder::flush_data()
- {
- if( pos > stream_pos )
- {
- const unsigned size = pos - stream_pos;
- crc32.update_buf( crc_, buffer + stream_pos, size );
- errno = 0;
- if( std::fwrite( buffer + stream_pos, 1, size, stdout ) != size )
- { std::fprintf( stderr, "Write error: %s\n", std::strerror( errno ) );
- std::exit( 1 ); }
- if( pos >= dictionary_size ) { partial_data_pos += pos; pos = 0; }
- stream_pos = pos;
- }
- }
-
-
-bool LZ_decoder::decode_member() // Returns false if error
- {
- Bit_model bm_literal[1<<literal_context_bits][0x300];
- Bit_model bm_match[State::states][pos_states];
- Bit_model bm_rep[State::states];
- Bit_model bm_rep0[State::states];
- Bit_model bm_rep1[State::states];
- Bit_model bm_rep2[State::states];
- Bit_model bm_len[State::states][pos_states];
- Bit_model bm_dis_slot[len_states][1<<dis_slot_bits];
- Bit_model bm_dis[modeled_distances-end_dis_model];
- Bit_model bm_align[dis_align_size];
- Len_model match_len_model;
- Len_model rep_len_model;
- unsigned rep0 = 0; // rep[0-3] latest four distances
- unsigned rep1 = 0; // used for efficient coding of
- unsigned rep2 = 0; // repeated distances
- unsigned rep3 = 0;
- State state;
-
- while( !std::feof( stdin ) && !std::ferror( stdin ) )
- {
- const int pos_state = data_position() & pos_state_mask;
- if( rdec.decode_bit( bm_match[state()][pos_state] ) == 0 ) // 1st bit
- {
- const uint8_t prev_byte = get_byte( 0 );
- const int literal_state = prev_byte >> ( 8 - literal_context_bits );
- Bit_model * const bm = bm_literal[literal_state];
- if( state.is_char() )
- put_byte( rdec.decode_tree( bm, 8 ) );
- else
- put_byte( rdec.decode_matched( bm, get_byte( rep0 ) ) );
- state.set_char();
- }
- else
- {
- int len;
- if( rdec.decode_bit( bm_rep[state()] ) != 0 ) // 2nd bit
- {
- if( rdec.decode_bit( bm_rep0[state()] ) != 0 ) // 3rd bit
- {
- unsigned distance;
- if( rdec.decode_bit( bm_rep1[state()] ) == 0 ) // 4th bit
- distance = rep1;
- else
- {
- if( rdec.decode_bit( bm_rep2[state()] ) == 0 ) // 5th bit
- distance = rep2;
- else
- { distance = rep3; rep3 = rep2; }
- rep2 = rep1;
- }
- rep1 = rep0;
- rep0 = distance;
- }
- else
- {
- if( rdec.decode_bit( bm_len[state()][pos_state] ) == 0 ) // 4th bit
- { state.set_short_rep(); put_byte( get_byte( rep0 ) ); continue; }
- }
- state.set_rep();
- len = min_match_len + rdec.decode_len( rep_len_model, pos_state );
- }
- else
- {
- rep3 = rep2; rep2 = rep1; rep1 = rep0;
- len = min_match_len + rdec.decode_len( match_len_model, pos_state );
- const int len_state = std::min( len - min_match_len, len_states - 1 );
- const int dis_slot =
- rdec.decode_tree( bm_dis_slot[len_state], dis_slot_bits );
- if( dis_slot < start_dis_model ) rep0 = dis_slot;
- else
- {
- const int direct_bits = ( dis_slot >> 1 ) - 1;
- rep0 = ( 2 | ( dis_slot & 1 ) ) << direct_bits;
- if( dis_slot < end_dis_model )
- rep0 += rdec.decode_tree_reversed( bm_dis + rep0 - dis_slot - 1,
- direct_bits );
- else
- {
- rep0 += rdec.decode( direct_bits - dis_align_bits ) << dis_align_bits;
- rep0 += rdec.decode_tree_reversed( bm_align, dis_align_bits );
- if( rep0 == 0xFFFFFFFFU ) // Marker found
- {
- flush_data();
- return ( len == min_match_len ); // End Of Stream marker
- }
- }
- }
- state.set_match();
- if( rep0 >= dictionary_size || ( rep0 >= pos && !partial_data_pos ) )
- return false;
- }
- for( int i = 0; i < len; ++i )
- put_byte( get_byte( rep0 ) );
- }
- }
- return false;
- }
-
-
-int main( const int argc, const char * const argv[] )
- {
- if( argc > 1 )
- {
- std::printf( "Lzd %s - Educational decompressor for lzip files.\n",
- PROGVERSION );
- std::printf( "Study the source to learn how a lzip decompressor works.\n"
- "See the lzip manual for an explanation of the code.\n"
- "It is not safe to use lzd for any real work.\n"
- "\nUsage: %s < file.lz > file\n", argv[0] );
- std::printf( "Lzd decompresses from standard input to standard output.\n"
- "\nCopyright (C) 2013 Antonio Diaz Diaz.\n"
- "This is free software: you are free to change and redistribute it.\n"
- "There is NO WARRANTY, to the extent permitted by law.\n"
- "Report bugs to lzip-bug@nongnu.org\n"
- "Lzd home page: http://www.nongnu.org/lzip/lzd.html\n" );
- return 0;
- }
-
- for( bool first_member = true; ; first_member = false )
- {
- File_header header;
- for( int i = 0; i < 6; ++i )
- header[i] = std::getc( stdin );
- if( std::feof( stdin ) || std::memcmp( header, "LZIP", 4 ) != 0 )
- {
- if( first_member )
- { std::fprintf( stderr, "Bad magic number (file not in lzip format)\n" );
- return 2; }
- break;
- }
- if( header[4] != 1 )
- {
- std::fprintf( stderr, "Version %d member format not supported.\n",
- header[4] );
- return 2;
- }
- unsigned dict_size = 1 << ( header[5] & 0x1F );
- dict_size -= ( dict_size / 16 ) * ( ( header[5] >> 5 ) & 7 );
- if( dict_size < min_dictionary_size || dict_size > max_dictionary_size )
- { std::fprintf( stderr, "Invalid dictionary size in member header\n" );
- return 2; }
-
- LZ_decoder decoder( dict_size );
- if( !decoder.decode_member() )
- { std::fprintf( stderr, "Data error\n" ); return 2; }
-
- File_trailer trailer;
- for( int i = 0; i < 20; ++i ) trailer[i] = std::getc( stdin );
- unsigned crc = 0;
- for( int i = 3; i >= 0; --i ) { crc <<= 8; crc += trailer[i]; }
- unsigned long long data_size = 0;
- for( int i = 11; i >= 4; --i ) { data_size <<= 8; data_size += trailer[i]; }
- if( crc != decoder.crc() || data_size != decoder.data_position() )
- { std::fprintf( stderr, "CRC error\n" ); return 2; }
- }
-
- if( std::fclose( stdout ) != 0 )
- { std::fprintf( stderr, "Can't close stdout: %s\n", std::strerror( errno ) );
- return 1; }
- return 0;
- }
-@end verbatim
-
-
-@node Concept index
-@unnumbered Concept index
-
-@printindex cp
-
-@bye