diff options
Diffstat (limited to 'doc/lzip.info')
-rw-r--r-- | doc/lzip.info | 210 |
1 files changed, 130 insertions, 80 deletions
diff --git a/doc/lzip.info b/doc/lzip.info index 71d8f8e..0210f9e 100644 --- a/doc/lzip.info +++ b/doc/lzip.info @@ -11,7 +11,7 @@ File: lzip.info, Node: Top, Next: Introduction, Up: (dir) Lzip Manual *********** -This manual is for Lzip (version 1.18-pre1, 13 August 2015). +This manual is for Lzip (version 1.18, 14 May 2016). * Menu: @@ -28,7 +28,7 @@ This manual is for Lzip (version 1.18-pre1, 13 August 2015). * Concept index:: Index of concepts - Copyright (C) 2008-2015 Antonio Diaz Diaz. + Copyright (C) 2008-2016 Antonio Diaz Diaz. This manual is free documentation: you have unlimited permission to copy, distribute and modify it. @@ -72,15 +72,14 @@ corrupt byte near the beginning is a thing of the past. The member trailer stores the 32-bit CRC of the original data, the size of the original data and the size of the member. These values, -together with the value remaining in the range decoder and the -end-of-stream marker, provide a 4 factor integrity checking which -guarantees that the decompressed version of the data is identical to -the original. This guards against corruption of the compressed data, -and against undetected bugs in lzip (hopefully very unlikely). The -chances of data corruption going undetected are microscopic. Be aware, -though, that the check occurs upon decompression, so it can only tell -you that something is wrong. It can't help you recover the original -uncompressed data. +together with the end-of-stream marker, provide a 3 factor integrity +checking which guarantees that the decompressed version of the data is +identical to the original. This guards against corruption of the +compressed data, and against undetected bugs in lzip (hopefully very +unlikely). The chances of data corruption going undetected are +microscopic. Be aware, though, that the check occurs upon +decompression, so it can only tell you that something is wrong. It +can't help you recover the original uncompressed data. Lzip uses the same well-defined exit status values used by bzip2, which makes it safer than compressors returning ambiguous warning @@ -127,14 +126,14 @@ two or more compressed files. The result is the concatenation of the corresponding uncompressed files. Integrity testing of concatenated compressed files is also supported. - Lzip can produce multi-member files and safely recover, with + Lzip can produce multimember files and safely recover, with lziprecover, the undamaged members in case of file damage. Lzip can also split the compressed output in volumes of a given size, even when reading from standard input. This allows the direct creation of multivolume compressed tar archives. Lzip is able to compress and decompress streams of unlimited size by -automatically creating multi-member output. The members so created are +automatically creating multimember output. The members so created are large, about 2 PiB each. @@ -147,6 +146,10 @@ The format for running lzip is: lzip [OPTIONS] [FILES] +'-' used as a FILE argument means standard input. It can be mixed with +other FILES and is read just once, the first time it appears in the +command line. + Lzip supports the following options: '-h' @@ -172,15 +175,19 @@ The format for running lzip is: '-c' '--stdout' - Compress or decompress to standard output. Needed when reading - from a named pipe (fifo) or from a device. Use it to recover as - much of the uncompressed data as possible when decompressing a - corrupt file. + Compress or decompress to standard output; keep input files + unchanged. If compressing several files, each file is compressed + independently. This option is needed when reading from a named + pipe (fifo) or from a device. Use it also to recover as much of + the uncompressed data as possible when decompressing a corrupt + file. '-d' '--decompress' - Decompress the specified file(s). If a file fails to decompress, - lzip exits immediately without decompressing the rest of the files. + Decompress the specified file(s). If a file does not exist or + can't be opened, lzip continues decompressing the rest of the + files. If a file fails to decompress, lzip exits immediately + without decompressing the rest of the files. '-f' '--force' @@ -218,12 +225,13 @@ The format for running lzip is: '-s BYTES' '--dictionary-size=BYTES' - Set the dictionary size limit in bytes. Valid values range from 4 - KiB to 512 MiB. Lzip will use the smallest possible dictionary - size for each file without exceeding this limit. Note that - dictionary sizes are quantized. If the specified size does not - match one of the valid sizes, it will be rounded upwards by adding - up to (BYTES / 16) to it. + Set the dictionary size limit in bytes. Lzip will use the smallest + possible dictionary size for each file without exceeding this + limit. Valid values range from 4 KiB to 512 MiB. Values 12 to 29 + are interpreted as powers of two, meaning 2^12 to 2^29 bytes. Note + that dictionary sizes are quantized. If the specified size does + not match one of the valid sizes, it will be rounded upwards by + adding up to (BYTES / 8) to it. For maximum compression you should use a dictionary size limit as large as possible, but keep in mind that the decompression memory @@ -235,9 +243,9 @@ The format for running lzip is: Split the compressed output into several volume files with names 'original_name00001.lz', 'original_name00002.lz', etc, and set the volume size limit to BYTES. Each volume is a complete, maybe - multi-member, lzip file. A small volume size may degrade - compression ratio, so use it only when needed. Valid values range - from 100 kB to 4 EiB. + multimember, lzip file. A small volume size may degrade compression + ratio, so use it only when needed. Valid values range from 100 kB + to 4 EiB. '-t' '--test' @@ -259,14 +267,14 @@ The format for running lzip is: '-0 .. -9' Set the compression parameters (dictionary size and match length - limit) as shown in the table below. Note that '-9' can be much - slower than '-0'. These options have no effect when decompressing. + limit) as shown in the table below. The default compression level + is '-6'. Note that '-9' can be much slower than '-0'. These + options have no effect when decompressing. The bidimensional parameter space of LZMA can't be mapped to a linear scale optimal for all files. If your files are large, very - repetitive, etc, you may need to use the '--match-length' and - '--dictionary-size' options directly to achieve optimal - performance. + repetitive, etc, you may need to use the '--dictionary-size' and + '--match-length' options directly to achieve optimal performance. Level Dictionary size Match length limit -0 64 KiB 16 bytes @@ -334,21 +342,21 @@ file format. Today those limitations have mostly disappeared, and the format of gzip has proved to be unnecessarily complicated. It includes fields -that were never used, others that have lost its usefulness, and finally -others that have become too limited. +that were never used, others that have lost their usefulness, and +finally others that have become too limited. Bzip2 was designed 5 years later, and its format is simpler than the one of gzip. Probably the worst defect of the gzip format from the point of view of data safety is the variable size of its header. If the byte at -offset 3 (flags) of a gzip member gets corrupted, it mat become very +offset 3 (flags) of a gzip member gets corrupted, it may become very difficult to recover the data, even if the compressed blocks are intact, because it can't be known with certainty where the compressed blocks begin. By contrast, the header of a lzip member has a fixed length of 6. The -lzma stream in a lzip member always starts at offset 6, making it +LZMA stream in a lzip member always starts at offset 6, making it trivial to recover the data even if the whole header becomes corrupt. Bzip2 also provides a header of fixed length and marks the begin and @@ -358,9 +366,24 @@ not store the size of each compressed block, as lzip does. Lzip provides better data recovery capabilities than any other gzip-like compressor because its format has been designed from the -beginning to be simple and safe. It would be very difficult to write an +beginning to be simple and safe. It also helps that the LZMA data +stream as used by lzip is extraordinarily safe. It provides embedded +error detection. Any distance larger than the dictionary size acts as a +forbidden symbol, allowing the decompressor to detect the approximate +position of errors, and leaving very little work for the check sequence +(CRC and data sizes) in the detection of errors. Lzip is usually able +to detect all posible bit-flips in the compressed data without +resorting to the check sequence. It would be very difficult to write an automatic recovery tool like lziprecover for the gzip format. And, as -far as I know, it has never been writen. +far as I know, it has never been written. + + Lzip, like gzip and bzip2, uses a CRC32 to check the integrity of the +decompressed data because it provides more accurate error detection than +CRC64 up to a compressed size of about 16 GiB, a size larger than that +of most files. In the case of lzip, the additional detection capability +of the decompressor reduces the probability of undetected errors more +than a million times, making CRC32 more accurate than CRC64 up to about +20 PiB of compressed size. The lzip format is designed for long-term archiving. Therefore it excludes any unneeded features that may interfere with the future @@ -409,7 +432,7 @@ extraction of the uncompressed data. Bzip2 does not store the uncompressed size of the file. The lzip format provides a 64-bit field for the uncompressed size. - Additionaly, lzip produces multi-member output automatically when + Additionaly, lzip produces multimember output automatically when the size is too large for a single member, allowing for an unlimited uncompressed size. @@ -428,8 +451,16 @@ extraction of the uncompressed data. 3.2 Quality of implementation ============================= +'Accurate and robust error detection' + The lzip format provides 3 factor integrity checking and the + decompressors report mismatches in each factor separately. This + way if just one byte in one factor fails but the other two factors + match the data, it probably means that the data are intact and the + corruption just affects the mismatching factor (CRC or data size) + in the check sequence. + 'Multiple implementations' - Just like the lzip format provides 4 factor protection against + Just like the lzip format provides 3 factor protection against undetected data corruption, the development methodology of the lzip family of compressors provides 3 factor protection against undetected programming errors. @@ -443,6 +474,11 @@ extraction of the uncompressed data. serious undiscovered errors. In fact, no errors have been discovered in lzip since 2009. + Additionally, the three implementations have been extensively + tested with unzcrash, valgrind and 'american fuzzy lop' without + finding a single vulnerability or false negative. *Note Unzcrash: + (lziprecover)Unzcrash. + 'Dictionary size' Lzip automatically uses the smallest possible dictionary size for each file. In addition to reducing the amount of memory required @@ -485,7 +521,7 @@ additional information before, between, or after them. Each member has the following structure: +--+--+--+--+----+----+=============+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ -| ID string | VN | DS | Lzma stream | CRC32 | Data size | Member size | +| ID string | VN | DS | LZMA stream | CRC32 | Data size | Member size | +--+--+--+--+----+----+=============+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ All multibyte values are stored in little endian order. @@ -508,8 +544,8 @@ additional information before, between, or after them. Example: 0xD3 = 2^19 - 6 * 2^15 = 512 KiB - 6 * 32 KiB = 320 KiB Valid values for dictionary size range from 4 KiB to 512 MiB. -'Lzma stream' - The lzma stream, finished by an end of stream marker. Uses default +'LZMA stream' + The LZMA stream, finished by an end of stream marker. Uses default values for encoder properties. *Note Stream format::, for a complete description. @@ -523,7 +559,7 @@ additional information before, between, or after them. Total size of the member, including header and trailer. This field acts as a distributed index, allows the verification of stream integrity, and facilitates safe recovery of undamaged members from - multi-member files. + multimember files. @@ -603,7 +639,9 @@ properties", to adjust it for some kinds of binary data. These parameters are; 'literal_context_bits' (with a default value of 3), 'literal_pos_state_bits' (with a default value of 0), and 'pos_state_bits' (with a default value of 2). As a general purpose -compressor, lzip only uses the default values for these parameters. +compressor, lzip only uses the default values for these parameters. In +particular 'literal_pos_state_bits' has been optimized away and does +not even appear in the code. Lzip also finishes the LZMA stream with an "End Of Stream" marker (the distance-length pair 0xFFFFFFFFU, 2), which in conjunction with the @@ -655,7 +693,7 @@ Bit sequence Name Description used distance - In the following tables, multi-bit sequences are coded in normal + In the following tables, multibit sequences are coded in normal order, from MSB to LSB, except where noted otherwise. Lengths (the 'len' in the table above) are coded as follows: @@ -676,10 +714,10 @@ You may first send the position of the most significant bit that is set to 1, which you may find by making a bit scan from the left (from the MSB). A position of 0 means that the number is 0 (no bit is set), 1 means the LSB is the first bit set (the number is 1), and 32 means the -MSB is set (the number is >= 0x80000000). Lets call this bit position a -"slot". Then, if slot is > 1, you send the remaining slot - 1 bits. -Lets call these bits "direct_bits" because they are coded directly by -value instead of indirectly by position. +MSB is set (i.e., the number is >= 0x80000000). Lets call this bit +position a "slot". Then, if slot is > 1, you send the remaining slot - +1 bits. Lets call these bits "direct_bits" because they are coded +directly by value instead of indirectly by position. The inconvenient of this simple method is that it needs 6 bits to code the slot, but it just uses 33 of the 64 possible values, wasting @@ -849,15 +887,15 @@ member. Such trailing data may be: file. * In very rare cases, trailing data could be the corrupt header of - another member. In multi-member or concatenated files the + another member. In multimember or concatenated files the probability of corruption happening in the magic bytes is 5 times smaller than the probability of getting a false positive caused by the corruption of the integrity information itself. Therefore it can be considered to be below the noise level. Trailing data can be safely ignored in most cases. In some cases, -like user-added data, it is expected to be ignored. In those cases -where a file containing trailing data must be rejected, the option +like that of user-added data, it is expected to be ignored. In those +cases where a file containing trailing data must be rejected, the option '--trailing-error' can be used. *Note --trailing-error::. @@ -869,7 +907,7 @@ File: lzip.info, Node: Examples, Next: Problems, Prev: Trailing data, Up: To WARNING! Even if lzip is bug-free, other causes may result in a corrupt compressed file (bugs in the system libraries, memory errors, etc). Therefore, if the data you are going to compress are important, give the -'--keep' option to lzip and do not remove the original file until you +'--keep' option to lzip and don't remove the original file until you verify the compressed file with a command like 'lzip -cd file.lz | cmp file -'. @@ -880,8 +918,8 @@ and show the compression ratio. lzip -v file -Example 2: Like example 1 but the created 'file.lz' is multi-member -with a member size of 1 MiB. The compression ratio is not shown. +Example 2: Like example 1 but the created 'file.lz' is multimember with +a member size of 1 MiB. The compression ratio is not shown. lzip -b 1MiB file @@ -898,10 +936,10 @@ show status. lzip -tv file.lz -Example 5: Compress a whole floppy in /dev/fd0 and send the output to +Example 5: Compress a whole device in /dev/sdc and send the output to 'file.lz'. - lzip -c /dev/fd0 > file.lz + lzip -c /dev/sdc > file.lz Example 6: The right way of concatenating compressed files. *Note @@ -937,7 +975,7 @@ Example 10: Extract a multivolume compressed tar archive. Example 11: Create a multivolume compressed backup of a large database -file with a volume size of 650 MB, where each volume is a multi-member +file with a volume size of 650 MB, where each volume is a multimember file with a member size of 32 MiB. lzip -b 32MiB -S 650MB big_db @@ -964,10 +1002,18 @@ Appendix A Reference source code ******************************** /* Lzd - Educational decompressor for the lzip format - Copyright (C) 2013-2015 Antonio Diaz Diaz. + Copyright (C) 2013-2016 Antonio Diaz Diaz. + + This program is free software. Redistribution and use in source and + binary forms, with or without modification, are permitted provided + that the following conditions are met: + + 1. Redistributions of source code must retain the above copyright + notice, this list of conditions and the following disclaimer. - This program is free software: you have unlimited permission - to copy, distribute and modify it. + 2. Redistributions in binary form must reproduce the above copyright + notice, this list of conditions and the following disclaimer in the + documentation and/or other materials provided with the distribution. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of @@ -1017,6 +1063,7 @@ enum { min_dictionary_size = 1 << 12, max_dictionary_size = 1 << 29, literal_context_bits = 3, + literal_pos_state_bits = 0, // not used pos_state_bits = 2, pos_states = 1 << pos_state_bits, pos_state_mask = pos_states - 1, @@ -1203,6 +1250,7 @@ class LZ_decoder unsigned pos; // current pos in buffer unsigned stream_pos; // first byte not yet written to stdout uint32_t crc_; + bool pos_wrapped; void flush_data(); @@ -1227,7 +1275,8 @@ public: buffer( new uint8_t[dictionary_size] ), pos( 0 ), stream_pos( 0 ), - crc_( 0xFFFFFFFFU ) + crc_( 0xFFFFFFFFU ), + pos_wrapped( false ) { buffer[dictionary_size-1] = 0; } // prev_byte of first byte ~LZ_decoder() { delete[] buffer; } @@ -1249,7 +1298,8 @@ void LZ_decoder::flush_data() if( std::fwrite( buffer + stream_pos, 1, size, stdout ) != size ) { std::fprintf( stderr, "Write error: %s\n", std::strerror( errno ) ); std::exit( 1 ); } - if( pos >= dictionary_size ) { partial_data_pos += pos; pos = 0; } + if( pos >= dictionary_size ) + { partial_data_pos += pos; pos = 0; pos_wrapped = true; } stream_pos = pos; } } @@ -1345,7 +1395,7 @@ bool LZ_decoder::decode_member() // Returns false if error } } state.set_match(); - if( rep0 >= dictionary_size || rep0 >= data_position() ) + if( rep0 >= dictionary_size || ( rep0 >= pos && !pos_wrapped ) ) { flush_data(); return false; } } for( int i = 0; i < len; ++i ) put_byte( peek( rep0 ) ); @@ -1367,7 +1417,7 @@ int main( const int argc, const char * const argv[] ) "It is not safe to use lzd for any real work.\n" "\nUsage: %s < file.lz > file\n", argv[0] ); std::printf( "Lzd decompresses from standard input to standard output.\n" - "\nCopyright (C) 2015 Antonio Diaz Diaz.\n" + "\nCopyright (C) 2016 Antonio Diaz Diaz.\n" "This is free software: you are free to change and redistribute it.\n" "There is NO WARRANTY, to the extent permitted by law.\n" "Report bugs to lzip-bug@nongnu.org\n" @@ -1445,19 +1495,19 @@ Concept index Tag Table: Node: Top208 -Node: Introduction1153 -Node: Invoking lzip6126 -Ref: --trailing-error6536 -Node: Quality assurance12171 -Node: File format18728 -Node: Algorithm21133 -Node: Stream format23959 -Node: Trailing data34502 -Node: Examples35873 -Ref: concat-example37048 -Node: Problems38049 -Node: Reference source code38579 -Node: Concept index52232 +Node: Introduction1145 +Node: Invoking lzip6071 +Ref: --trailing-error6635 +Node: Quality assurance12628 +Node: File format20782 +Node: Algorithm23186 +Node: Stream format26012 +Node: Trailing data36660 +Node: Examples38038 +Ref: concat-example39211 +Node: Problems40211 +Node: Reference source code40741 +Node: Concept index54957 End Tag Table |