diff options
Diffstat (limited to '')
-rw-r--r-- | doc/clzip.info | 150 |
1 files changed, 74 insertions, 76 deletions
diff --git a/doc/clzip.info b/doc/clzip.info index adcc658..170c2e3 100644 --- a/doc/clzip.info +++ b/doc/clzip.info @@ -11,7 +11,7 @@ File: clzip.info, Node: Top, Next: Introduction, Up: (dir) Clzip Manual ************ -This manual is for Clzip (version 1.15-rc1, 23 November 2024). +This manual is for Clzip (version 1.15, 10 January 2025). * Menu: @@ -30,7 +30,7 @@ This manual is for Clzip (version 1.15-rc1, 23 November 2024). * Concept index:: Index of concepts - Copyright (C) 2010-2024 Antonio Diaz Diaz. + Copyright (C) 2010-2025 Antonio Diaz Diaz. This manual is free documentation: you have unlimited permission to copy, distribute, and modify it. @@ -68,12 +68,10 @@ alignment between tar members and lzip members. *Note tarlz manual: The lzip file format is designed for data sharing and long-term archiving, taking into account both data integrity and decoder availability: - * The lzip format provides very safe integrity checking and some data - recovery means. The program lziprecover can repair bit flip errors - (one of the most common forms of data corruption) in lzip files, and - provides data recovery capabilities, including error-checked merging - of damaged copies of a file. *Note Data safety: (lziprecover)Data - safety. + * The program lziprecover can repair bit flip errors (one of the most + common forms of data corruption) in lzip files, and provides data + recovery capabilities, including error-checked merging of damaged + copies of a file. *Note Data safety: (lziprecover)Data safety. * The lzip format is as simple as possible (but not simpler). The lzip manual provides the source code of a simple decompressor along with a @@ -92,13 +90,12 @@ byte near the beginning is a thing of the past. The member trailer stores the 32-bit CRC of the original data, the size of the original data, and the size of the member. These values, together -with the 'End Of Stream' marker, provide a 3-factor integrity checking which -guarantees that the decompressed version of the data is identical to the -original. This guards against corruption of the compressed data, and against -undetected bugs in clzip (hopefully very unlikely). The chances of data -corruption going undetected are microscopic. Be aware, though, that the -check occurs upon decompression, so it can only tell you that something is -wrong. It can't help you recover the original uncompressed data. +with the 'End Of Stream' marker, provide a 3-factor integrity checking that +guards against corruption of the compressed data and against undetected bugs +in clzip (hopefully very unlikely). The chances of data corruption going +undetected are microscopic. Be aware, though, that the check occurs upon +decompression, so it can only tell you that something is wrong. It can't +help you recover the original uncompressed data. Clzip uses the same well-defined exit status values used by bzip2, which makes it safer than compressors returning ambiguous warning values (like @@ -300,7 +297,8 @@ clzip supports the following options: *Note Argument syntax::. When compressing, set the match length limit in bytes. After a match this long is found, the search is finished. Valid values range from 5 to 273. Larger values usually give better compression ratios but - longer compression times. + longer compression times. A match is a Lempel-Ziv back-reference coded + as a distance-length pair. '-o FILE' '--output=FILE' @@ -569,8 +567,8 @@ The LZMA algorithm has three parameters, called 'special LZMA properties', to adjust it for some kinds of binary data. These parameters are: 'literal_context_bits' (with a default value of 3), 'literal_pos_state_bits' (with a default value of 0), and 'pos_state_bits' -(with a default value of 2). As a general purpose compressor, lzip only -uses the default values for these parameters. In particular +(with a default value of 2). As a general purpose compressed format, lzip +only uses the default values for these parameters. In particular 'literal_pos_state_bits' has been optimized away and does not even appear in the code. @@ -615,7 +613,7 @@ reusing a recently used distance). There are 7 different coding sequences: Bit sequence Name Description ----------------------------------------------------------------------------- 0 + byte literal literal byte -1 + 0 + len + dis match distance-length pair +1 + 0 + len + dis match LZ distance-length pair 1 + 1 + 0 + 0 shortrep 1 byte match at latest used distance 1 + 1 + 0 + 1 + len rep0 len bytes match at latest used distance 1 + 1 + 1 + 0 + len rep1 len bytes match at second latest used @@ -670,7 +668,8 @@ a complete distance, and is calculated as (slot >> 1) - 1. If a distance needs 6 or more direct_bits, the last 4 bits are encoded separately. The last piece (all the direct_bits for distances 4 to 127 (slots 4 to 13), or the last 4 bits for distances >= 128 (slot >= 14)) is context-coded in -reverse order (from LSB to MSB). For distances >= 128, the +reverse order (from LSB to MSB) because between distances the LSB tends to +correlate better than more significant bits. For distances >= 128, the 'direct_bits - 4' part is encoded with fixed 0.5 probability. Bit sequence Description @@ -689,9 +688,8 @@ integers representing the probability of the corresponding bit being 0. The indices used in these arrays are: 'state' - A state machine ('State' in the source) with 12 states (0 to 11), - coding the latest 2 to 4 types of sequences processed. The initial - state is 0. + A state machine ('State' in the source) with 12 states (0 to 11) coding + the latest 2 to 4 types of sequences processed. The initial state is 0. 'pos_state' Value of the 2 least significant bits of the current position in the @@ -819,10 +817,10 @@ been reviewed carefully and is believed to be free from design errors. 7.1 Format design ================= -When gzip was designed in 1992, computers and operating systems were much -less capable than they are today. The designers of gzip tried to work around -some of those limitations, like 8.3 file names, with additional fields in -the file format. +When gzip was designed in 1992, computers and operating systems were less +capable than they are today. The designers of gzip tried to work around some +of those limitations, like 8.3 file names, with additional fields in the +file format. Today those limitations have mostly disappeared, and the format of gzip has proved to be unnecessarily complicated. It includes fields that were @@ -830,7 +828,8 @@ never used, others that have lost their usefulness, and finally others that have become too limited. Bzip2 was designed 5 years later, and its format is simpler than the one -of gzip. +of gzip. Both gzip and bzip2 lack the fields required to implement a +reliable and efficient '--list' operation. Probably the worst defect of the gzip format from the point of view of data safety is the variable size of its header. If the byte at offset 3 @@ -852,21 +851,23 @@ the lzip format is extraordinarily safe. The simple and safe design of the file format complements the embedded error detection provided by the LZMA data stream. Any distance larger than the dictionary size acts as a forbidden symbol, allowing the decompressor to detect the approximate -position of errors, and leaving very little work for the check sequence -(CRC and data sizes) in the detection of errors. Lzip is usually able to -detect all possible bit flips in the compressed data without resorting to -the check sequence. It would be difficult to write an automatic recovery -tool like lziprecover for the gzip format. And, as far as I know, it has -never been written. +position of errors, and leaving little work for the check sequence (CRC and +data sizes) in the detection of errors. Lzip is usually able to detect all +possible bit flips in the compressed data without resorting to the check +sequence. It would be difficult to write an automatic recovery tool like +lziprecover for the gzip format. And, as far as I know, it has never been +written. Lzip, like gzip and bzip2, uses a CRC32 to check the integrity of the decompressed data because it provides optimal accuracy in the detection of errors up to a compressed size of about 16 GiB, a size larger than that of most files. In the case of lzip, the additional detection capability of the -decompressor reduces the probability of undetected errors several million +decompressor reduces the probability of undetected errors about 50 million times more, resulting in a combined integrity checking optimally accurate -for any member size produced by lzip. Preliminary results suggest that the -lzip format is safe enough to be used in critical safety avionics systems. +for any member size produced by lzip. Moreover, a CRC is better than a hash +of the same size for detection of errors in lzip files because the +decompressor catches almost all the large errors, while the CRC guarantees +the detection of the small errors (which the hash does not). The lzip format is designed for long-term archiving. Therefore it excludes any unneeded features that may interfere with the future @@ -877,11 +878,9 @@ extraction of the decompressed data. 'Multiple algorithms' Gzip provides a CM (Compression Method) field that has never been used - because it is a bad idea to begin with. New compression methods may - require additional fields, making it impossible to implement new - methods and, at the same time, keep the same format. This field does - not solve the problem of format proliferation; it just makes the - problem less obvious. + because it is too limiting. New compression methods may require + additional fields, making it impossible to implement new methods and, + at the same time, keep the same format. 'Optional fields in header' Unless special precautions are taken, optional fields are generally a @@ -892,13 +891,12 @@ extraction of the decompressed data. find neither the header CRC nor the compressed blocks. 'Optional CRC for the header' - Using an optional CRC for the header is not only a bad idea, it is an - error; it circumvents the Hamming distance (HD) of the CRC and may - prevent the extraction of perfectly good data. For example, if the CRC - is used and the bit enabling it is reset by a bit flip, then the - header seems to be intact (in spite of being corrupt) while the - compressed blocks seem to be totally unrecoverable (in spite of being - intact). Very misleading indeed. + Using an optional CRC for the header circumvents the Hamming distance + (HD) of the CRC and may prevent the extraction of good data. For + example, if the CRC is used and the bit enabling it is reset by a bit + flip, then the header seems to be intact (in spite of being corrupt) + while the compressed blocks seem to be unrecoverable (in spite of + being intact). 'Metadata' The gzip format stores some metadata, like the modification time of the @@ -925,9 +923,9 @@ extraction of the decompressed data. 'Distributed index' The lzip format provides a distributed index that, among other things, - helps plzip to decompress several times faster than pigz and helps - lziprecover do its job. Neither the gzip format nor the bzip2 format - do provide an index. + helps plzip to decompress faster than pigz and helps lziprecover do + its job. Neither the gzip format nor the bzip2 format do provide an + index. A distributed index is safer and more scalable than a monolithic index. The monolithic index introduces a single point of failure in @@ -960,7 +958,7 @@ software. Three related but independent compressor implementations, lzip, clzip, and minilzip/lzlib, are developed concurrently. Every stable release of any of them is tested to check that it produces identical output to - the other two. This guarantees that all three implement the same + the other two. This corroborates that all three implement the same algorithm, and makes it unlikely that any of them may contain serious undiscovered errors. In fact, no errors have been discovered in lzip since 2009. @@ -1207,7 +1205,7 @@ Appendix A Reference source code ******************************** /* Lzd - Educational decompressor for the lzip format - Copyright (C) 2013-2024 Antonio Diaz Diaz. + Copyright (C) 2013-2025 Antonio Diaz Diaz. This program is free software. Redistribution and use in source and binary forms, with or without modification, are permitted provided @@ -1258,9 +1256,9 @@ public: const int next[states] = { 0, 0, 0, 0, 1, 2, 3, 4, 5, 6, 4, 5 }; st = next[st]; } - void set_match() { st = ( st < 7 ) ? 7 : 10; } - void set_rep() { st = ( st < 7 ) ? 8 : 11; } - void set_short_rep() { st = ( st < 7 ) ? 9 : 11; } + void set_match() { st = ( st < 7 ) ? 7 : 10; } + void set_rep() { st = ( st < 7 ) ? 8 : 11; } + void set_shortrep() { st = ( st < 7 ) ? 9 : 11; } }; @@ -1564,7 +1562,7 @@ bool LZ_decoder::decode_member() // Return false if error if( rdec.decode_bit( bm_rep0[state()] ) == 0 ) // 3rd bit { if( rdec.decode_bit( bm_len[state()][pos_state] ) == 0 ) // 4th bit - { state.set_short_rep(); put_byte( peek( rep0 ) ); continue; } + { state.set_shortrep(); put_byte( peek( rep0 ) ); continue; } } else { @@ -1631,7 +1629,7 @@ int main( const int argc, const char * const argv[] ) "See the lzip manual for an explanation of the code.\n" "\nUsage: %s [-d] < file.lz > file\n" "Lzd decompresses from standard input to standard output.\n" - "\nCopyright (C) 2024 Antonio Diaz Diaz.\n" + "\nCopyright (C) 2025 Antonio Diaz Diaz.\n" "License 2-clause BSD.\n" "This is free software: you are free to change and redistribute " "it.\nThere is NO WARRANTY, to the extent permitted by law.\n" @@ -1729,23 +1727,23 @@ Concept index Tag Table: Node: Top205 -Node: Introduction1282 -Node: Output7168 -Node: Invoking clzip8771 -Ref: --trailing-error9617 -Node: Argument syntax19833 -Node: File format21597 -Ref: coded-dict-size23096 -Node: Stream format24328 -Ref: what-is-coded26853 -Node: Quality assurance35583 -Node: Algorithm44382 -Node: Trailing data47784 -Node: Examples50118 -Ref: concat-example51564 -Node: Problems52788 -Node: Reference source code53324 -Node: Concept index68636 +Node: Introduction1277 +Node: Output6979 +Node: Invoking clzip8582 +Ref: --trailing-error9428 +Node: Argument syntax19721 +Node: File format21485 +Ref: coded-dict-size22984 +Node: Stream format24216 +Ref: what-is-coded26748 +Node: Quality assurance35562 +Node: Algorithm44357 +Node: Trailing data47759 +Node: Examples50093 +Ref: concat-example51539 +Node: Problems52763 +Node: Reference source code53299 +Node: Concept index68607 End Tag Table |