diff options
Diffstat (limited to 'doc/lzip.texi')
-rw-r--r-- | doc/lzip.texi | 137 |
1 files changed, 110 insertions, 27 deletions
diff --git a/doc/lzip.texi b/doc/lzip.texi index 69f44ae..845cb42 100644 --- a/doc/lzip.texi +++ b/doc/lzip.texi @@ -6,8 +6,8 @@ @finalout @c %**end of header -@set UPDATED 12 July 2015 -@set VERSION 1.17 +@set UPDATED 13 August 2015 +@set VERSION 1.18-pre1 @dircategory Data Compression @direntry @@ -41,6 +41,7 @@ This manual is for Lzip (version @value{VERSION}, @value{UPDATED}). * File format:: Detailed format of the compressed file * Algorithm:: How lzip compresses the data * Stream format:: Format of the LZMA stream in lzip files +* Trailing data:: Extra data appended to the file * Examples:: A small tutorial with examples * Problems:: Reporting bugs * Reference source code:: Source code illustrating stream format @@ -76,7 +77,7 @@ program can repair bit-flip errors (one of the most common forms of data corruption) in lzip files, and provides data recovery capabilities, including error-checked merging of damaged copies of a file. @ifnothtml -@ref{Data safety,,,lziprecover}. +@xref{Data safety,,,lziprecover}. @end ifnothtml @item @@ -190,6 +191,13 @@ Print an informative help message describing the options and exit. @itemx --version Print the version number of lzip on the standard output and exit. +@anchor{--trailing-error} +@item -a +@itemx --trailing-error +Exit with error status 2 if any remaining input is detected after +decompressing the last member. Such remaining input is usually trailing +garbage that can be safely ignored. @xref{concat-example}. + @item -b @var{bytes} @itemx --member-size=@var{bytes} Set the member size limit to @var{bytes}. A small member size may @@ -204,7 +212,8 @@ uncompressed data as possible when decompressing a corrupt file. @item -d @itemx --decompress -Decompress. +Decompress the specified file(s). If a file fails to decompress, lzip +exits immediately without decompressing the rest of the files. @item -f @itemx --force @@ -263,7 +272,8 @@ EiB. @itemx --test Check integrity of the specified file(s), but don't decompress them. This really performs a trial decompression and throws away the result. -Use it together with @samp{-v} to see information about the file. +Use it together with @samp{-v} to see information about the file(s). If +a file fails the test, lzip continues checking the rest of the files. @item -v @itemx --verbose @@ -273,7 +283,7 @@ second @samp{-v} shows the progress of compression.@* When decompressing or testing, further -v's (up to 4) increase the verbosity level, showing status, compression ratio, dictionary size, trailer contents (CRC, data size, member size), and up to 6 bytes of -trailing garbage (if any). +trailing data (if any). @item -0 .. -9 Set the compression parameters (dictionary size and match length limit) @@ -334,9 +344,10 @@ caused lzip to panic. @chapter Design, development and testing of lzip @cindex quality assurance -There are two ways of constructing a software design. One way is to make -it so simple that there are obviously no deficiencies and the other is -to make it so complicated that there are no obvious deficiencies.@* +There are two ways of constructing a software design: One way is to make +it so simple that there are obviously no deficiencies and the other way +is to make it so complicated that there are no obvious deficiencies. The +first method is far more difficult.@* --- C.A.R. Hoare Lzip has been designed, written and tested with great care to be the @@ -521,7 +532,7 @@ Each member has the following structure: All multibyte values are stored in little endian order. @table @samp -@item ID string +@item ID string (the "magic" bytes) A four byte string, identifying the lzip format, with the value "LZIP" (0x4C, 0x5A, 0x49, 0x50). @@ -659,7 +670,7 @@ What follows is a description of the decoding algorithm for LZMA-302eos streams using as reference the source code of "lzd", an educational decompressor for lzip files which can be downloaded from the lzip download directory. The source code of lzd is included in appendix A. -@ref{Reference source code} +@xref{Reference source code}. @sp 1 @section What is coded @@ -697,17 +708,38 @@ Lengths (the @samp{len} in the table above) are coded as follows: @end multitable @sp 1 -The coding of distances is a little more complicated. LZMA divides the -interval between any two powers of 2 into 2 halves, named slots. As -possible distances range from 0 to (2^32 - 1), there are 64 slots (0 to -63). The slot number is context-coded in 6 bits. @samp{direct_bits} are -the remaining bits (from 0 to 30) needed to form a complete distance, -and are calculated as (slot >> 1) - 1. If a distance needs 6 or more -direct_bits, the last 4 bits are coded separately. The last piece -(direct_bits for distances 4 to 127 or the last 4 bits for distances >= -128) is context-coded in reverse order (from LSB to MSB). For distances ->= 128, the @samp{direct_bits - 4} part is coded with fixed 0.5 -probability. +The coding of distances is a little more complicated, so I'll begin +explaining a simpler version of the encoding. + +Imagine you need to code a number from 0 to 2^32 - 1, and you want to do +it in a way that produces shorter codes for the smaller numbers. You may +first send the position of the most significant bit that is set to 1, +which you may find by making a bit scan from the left (from the MSB). A +position of 0 means that the number is 0 (no bit is set), 1 means the +LSB is the first bit set (the number is 1), and 32 means the MSB is set +(the number is >= 0x80000000). Lets call this bit position a "slot". +Then, if slot is > 1, you send the remaining slot - 1 bits. Lets call +these bits "direct_bits" because they are coded directly by value +instead of indirectly by position. + +The inconvenient of this simple method is that it needs 6 bits to code +the slot, but it just uses 33 of the 64 possible values, wasting almost +half of the codes. + +The intelligent trick of LZMA is that it encodes the position of the +most significant bit set, along with the value of the next bit, in the +same 6 bits that would take to encode the position alone. This seems to +need 66 slots (2 * position + next_bit), but for slots 0 and 1 there is +no next bit, so the number of needed slots is 64 (0 to 63). + +The slot number is context-coded in 6 bits. @samp{direct_bits} is the +amount of remaining bits (from 0 to 30) needed to form a complete +distance, and is calculated as (slot >> 1) - 1. If a distance needs 6 or +more direct_bits, the last 4 bits are coded separately. The last piece +(all the direct_bits for distances 4 to 127 or the last 4 bits for +distances >= 128) is context-coded in reverse order (from LSB to MSB). +For distances >= 128, the @samp{direct_bits - 4} part is coded with +fixed 0.5 probability. @multitable @columnfractions .5 .5 @headitem Bit sequence @tab Description @@ -845,6 +877,44 @@ sequences (matches, repeated matches, and literal bytes), until the "End Of Stream" marker is decoded. +@node Trailing data +@chapter Extra data appended to the file +@cindex trailing data + +Sometimes extra data is found appended to a lzip file after the last +member. Such trailing data may be: + +@itemize @bullet +@item +Padding added to make the file size a multiple of some block size, for +example when writing to a tape. + +@item +Garbage added by some not totally successful copy operation. + +@item +Useful data added by the user; a cryptographically secure hash, a +description of file contents, etc. + +@item +Malicious data added to the file in order to make its total size and +hash value (for a chosen hash) coincide with those of another file. + +@item +In very rare cases, trailing data could be the corrupt header of another +member. In multi-member or concatenated files the probability of +corruption happening in the magic bytes is 5 times smaller than the +probability of getting a false positive caused by the corruption of the +integrity information itself. Therefore it can be considered to be below +the noise level. +@end itemize + +Trailing data can be safely ignored in most cases. In some cases, like +user-added data, it is expected to be ignored. In those cases where a +file containing trailing data must be rejected, the option +@samp{--trailing-error} can be used. @xref{--trailing-error}. + + @node Examples @chapter A small tutorial with examples @cindex examples @@ -903,8 +973,21 @@ lzip -c /dev/fd0 > file.lz @end example @sp 1 +@anchor{concat-example} +@noindent +Example 6: The right way of concatenating compressed files. +@xref{Trailing data}. + +@example +Don't do this + cat file1.lz file2.lz file3.lz | lzip -d +Do this instead + lzip -cd file1.lz file2.lz file3.lz +@end example + +@sp 1 @noindent -Example 6: Decompress @samp{file.lz} partially until 10 KiB of +Example 7: Decompress @samp{file.lz} partially until 10 KiB of decompressed data are produced. @example @@ -913,7 +996,7 @@ lzip -cd file.lz | dd bs=1024 count=10 @sp 1 @noindent -Example 7: Decompress @samp{file.lz} partially from decompressed byte +Example 8: Decompress @samp{file.lz} partially from decompressed byte 10000 to decompressed byte 15000 (5000 bytes are produced). @example @@ -922,7 +1005,7 @@ lzip -cd file.lz | dd bs=1000 skip=10 count=5 @sp 1 @noindent -Example 8: Create a multivolume compressed tar archive with a volume +Example 9: Create a multivolume compressed tar archive with a volume size of 1440 KiB. @example @@ -931,7 +1014,7 @@ tar -c some_directory | lzip -S 1440KiB -o volume_name @sp 1 @noindent -Example 9: Extract a multivolume compressed tar archive. +Example 10: Extract a multivolume compressed tar archive. @example lzip -cd volume_name*.lz | tar -xf - @@ -939,7 +1022,7 @@ lzip -cd volume_name*.lz | tar -xf - @sp 1 @noindent -Example 10: Create a multivolume compressed backup of a large database +Example 11: Create a multivolume compressed backup of a large database file with a volume size of 650 MB, where each volume is a multi-member file with a member size of 32 MiB. |