diff options
-rw-r--r-- | AUTHORS | 7 | ||||
-rw-r--r-- | COPYING | 338 | ||||
-rw-r--r-- | ChangeLog | 334 | ||||
-rw-r--r-- | INSTALL | 81 | ||||
-rw-r--r-- | Makefile.in | 134 | ||||
-rw-r--r-- | NEWS | 11 | ||||
-rw-r--r-- | README | 135 | ||||
-rw-r--r-- | arg_parser.cc | 197 | ||||
-rw-r--r-- | arg_parser.h | 110 | ||||
-rwxr-xr-x | configure | 226 | ||||
-rw-r--r-- | decoder.cc | 284 | ||||
-rw-r--r-- | decoder.h | 344 | ||||
-rw-r--r-- | doc/lzip.1 | 130 | ||||
-rw-r--r-- | doc/lzip.info | 1708 | ||||
-rw-r--r-- | doc/lzip.texi | 1783 | ||||
-rw-r--r-- | encoder.cc | 594 | ||||
-rw-r--r-- | encoder.h | 290 | ||||
-rw-r--r-- | encoder_base.cc | 193 | ||||
-rw-r--r-- | encoder_base.h | 483 | ||||
-rw-r--r-- | fast_encoder.cc | 184 | ||||
-rw-r--r-- | fast_encoder.h | 61 | ||||
-rw-r--r-- | list.cc | 113 | ||||
-rw-r--r-- | lzip.h | 358 | ||||
-rw-r--r-- | lzip_index.cc | 214 | ||||
-rw-r--r-- | lzip_index.h | 91 | ||||
-rw-r--r-- | main.cc | 1071 | ||||
-rwxr-xr-x | testsuite/check.sh | 441 | ||||
-rw-r--r-- | testsuite/fox.lz | bin | 0 -> 80 bytes | |||
-rw-r--r-- | testsuite/fox_bcrc.lz | bin | 0 -> 80 bytes | |||
-rw-r--r-- | testsuite/fox_crc0.lz | bin | 0 -> 80 bytes | |||
-rw-r--r-- | testsuite/fox_das46.lz | bin | 0 -> 80 bytes | |||
-rw-r--r-- | testsuite/fox_de20.lz | bin | 0 -> 80 bytes | |||
-rw-r--r-- | testsuite/fox_mes81.lz | bin | 0 -> 80 bytes | |||
-rw-r--r-- | testsuite/fox_s11.lz | bin | 0 -> 80 bytes | |||
-rw-r--r-- | testsuite/fox_v2.lz | bin | 0 -> 80 bytes | |||
-rw-r--r-- | testsuite/test.txt | 676 | ||||
-rw-r--r-- | testsuite/test.txt.lz | bin | 0 -> 7376 bytes | |||
-rw-r--r-- | testsuite/test_em.txt.lz | bin | 0 -> 14024 bytes |
38 files changed, 10591 insertions, 0 deletions
@@ -0,0 +1,7 @@ +Lzip was written by Antonio Diaz Diaz. + +The ideas embodied in lzip are due to (at least) the following people: +Abraham Lempel and Jacob Ziv (for the LZ algorithm), Andrey Markov (for the +definition of Markov chains), G.N.N. Martin (for the definition of range +encoding), Igor Pavlov (for putting all the above together in LZMA), and +Julian Seward (for bzip2's CLI). @@ -0,0 +1,338 @@ + GNU GENERAL PUBLIC LICENSE + Version 2, June 1991 + + Copyright (C) 1989, 1991 Free Software Foundation, Inc., + 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + Everyone is permitted to copy and distribute verbatim copies + of this license document, but changing it is not allowed. + + Preamble + + The licenses for most software are designed to take away your +freedom to share and change it. By contrast, the GNU General Public +License is intended to guarantee your freedom to share and change free +software--to make sure the software is free for all its users. This +General Public License applies to most of the Free Software +Foundation's software and to any other program whose authors commit to +using it. (Some other Free Software Foundation software is covered by +the GNU Lesser General Public License instead.) You can apply it to +your programs, too. + + When we speak of free software, we are referring to freedom, not +price. Our General Public Licenses are designed to make sure that you +have the freedom to distribute copies of free software (and charge for +this service if you wish), that you receive source code or can get it +if you want it, that you can change the software or use pieces of it +in new free programs; and that you know you can do these things. + + To protect your rights, we need to make restrictions that forbid +anyone to deny you these rights or to ask you to surrender the rights. +These restrictions translate to certain responsibilities for you if you +distribute copies of the software, or if you modify it. + + For example, if you distribute copies of such a program, whether +gratis or for a fee, you must give the recipients all the rights that +you have. You must make sure that they, too, receive or can get the +source code. And you must show them these terms so they know their +rights. + + We protect your rights with two steps: (1) copyright the software, and +(2) offer you this license which gives you legal permission to copy, +distribute and/or modify the software. + + Also, for each author's protection and ours, we want to make certain +that everyone understands that there is no warranty for this free +software. If the software is modified by someone else and passed on, we +want its recipients to know that what they have is not the original, so +that any problems introduced by others will not reflect on the original +authors' reputations. + + Finally, any free program is threatened constantly by software +patents. We wish to avoid the danger that redistributors of a free +program will individually obtain patent licenses, in effect making the +program proprietary. To prevent this, we have made it clear that any +patent must be licensed for everyone's free use or not licensed at all. + + The precise terms and conditions for copying, distribution and +modification follow. + + GNU GENERAL PUBLIC LICENSE + TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION + + 0. This License applies to any program or other work which contains +a notice placed by the copyright holder saying it may be distributed +under the terms of this General Public License. The "Program", below, +refers to any such program or work, and a "work based on the Program" +means either the Program or any derivative work under copyright law: +that is to say, a work containing the Program or a portion of it, +either verbatim or with modifications and/or translated into another +language. (Hereinafter, translation is included without limitation in +the term "modification".) Each licensee is addressed as "you". + +Activities other than copying, distribution and modification are not +covered by this License; they are outside its scope. The act of +running the Program is not restricted, and the output from the Program +is covered only if its contents constitute a work based on the +Program (independent of having been made by running the Program). +Whether that is true depends on what the Program does. + + 1. You may copy and distribute verbatim copies of the Program's +source code as you receive it, in any medium, provided that you +conspicuously and appropriately publish on each copy an appropriate +copyright notice and disclaimer of warranty; keep intact all the +notices that refer to this License and to the absence of any warranty; +and give any other recipients of the Program a copy of this License +along with the Program. + +You may charge a fee for the physical act of transferring a copy, and +you may at your option offer warranty protection in exchange for a fee. + + 2. You may modify your copy or copies of the Program or any portion +of it, thus forming a work based on the Program, and copy and +distribute such modifications or work under the terms of Section 1 +above, provided that you also meet all of these conditions: + + a) You must cause the modified files to carry prominent notices + stating that you changed the files and the date of any change. + + b) You must cause any work that you distribute or publish, that in + whole or in part contains or is derived from the Program or any + part thereof, to be licensed as a whole at no charge to all third + parties under the terms of this License. + + c) If the modified program normally reads commands interactively + when run, you must cause it, when started running for such + interactive use in the most ordinary way, to print or display an + announcement including an appropriate copyright notice and a + notice that there is no warranty (or else, saying that you provide + a warranty) and that users may redistribute the program under + these conditions, and telling the user how to view a copy of this + License. (Exception: if the Program itself is interactive but + does not normally print such an announcement, your work based on + the Program is not required to print an announcement.) + +These requirements apply to the modified work as a whole. If +identifiable sections of that work are not derived from the Program, +and can be reasonably considered independent and separate works in +themselves, then this License, and its terms, do not apply to those +sections when you distribute them as separate works. But when you +distribute the same sections as part of a whole which is a work based +on the Program, the distribution of the whole must be on the terms of +this License, whose permissions for other licensees extend to the +entire whole, and thus to each and every part regardless of who wrote it. + +Thus, it is not the intent of this section to claim rights or contest +your rights to work written entirely by you; rather, the intent is to +exercise the right to control the distribution of derivative or +collective works based on the Program. + +In addition, mere aggregation of another work not based on the Program +with the Program (or with a work based on the Program) on a volume of +a storage or distribution medium does not bring the other work under +the scope of this License. + + 3. You may copy and distribute the Program (or a work based on it, +under Section 2) in object code or executable form under the terms of +Sections 1 and 2 above provided that you also do one of the following: + + a) Accompany it with the complete corresponding machine-readable + source code, which must be distributed under the terms of Sections + 1 and 2 above on a medium customarily used for software interchange; or, + + b) Accompany it with a written offer, valid for at least three + years, to give any third party, for a charge no more than your + cost of physically performing source distribution, a complete + machine-readable copy of the corresponding source code, to be + distributed under the terms of Sections 1 and 2 above on a medium + customarily used for software interchange; or, + + c) Accompany it with the information you received as to the offer + to distribute corresponding source code. (This alternative is + allowed only for noncommercial distribution and only if you + received the program in object code or executable form with such + an offer, in accord with Subsection b above.) + +The source code for a work means the preferred form of the work for +making modifications to it. For an executable work, complete source +code means all the source code for all modules it contains, plus any +associated interface definition files, plus the scripts used to +control compilation and installation of the executable. However, as a +special exception, the source code distributed need not include +anything that is normally distributed (in either source or binary +form) with the major components (compiler, kernel, and so on) of the +operating system on which the executable runs, unless that component +itself accompanies the executable. + +If distribution of executable or object code is made by offering +access to copy from a designated place, then offering equivalent +access to copy the source code from the same place counts as +distribution of the source code, even though third parties are not +compelled to copy the source along with the object code. + + 4. You may not copy, modify, sublicense, or distribute the Program +except as expressly provided under this License. Any attempt +otherwise to copy, modify, sublicense or distribute the Program is +void, and will automatically terminate your rights under this License. +However, parties who have received copies, or rights, from you under +this License will not have their licenses terminated so long as such +parties remain in full compliance. + + 5. You are not required to accept this License, since you have not +signed it. However, nothing else grants you permission to modify or +distribute the Program or its derivative works. These actions are +prohibited by law if you do not accept this License. Therefore, by +modifying or distributing the Program (or any work based on the +Program), you indicate your acceptance of this License to do so, and +all its terms and conditions for copying, distributing or modifying +the Program or works based on it. + + 6. Each time you redistribute the Program (or any work based on the +Program), the recipient automatically receives a license from the +original licensor to copy, distribute or modify the Program subject to +these terms and conditions. You may not impose any further +restrictions on the recipients' exercise of the rights granted herein. +You are not responsible for enforcing compliance by third parties to +this License. + + 7. If, as a consequence of a court judgment or allegation of patent +infringement or for any other reason (not limited to patent issues), +conditions are imposed on you (whether by court order, agreement or +otherwise) that contradict the conditions of this License, they do not +excuse you from the conditions of this License. If you cannot +distribute so as to satisfy simultaneously your obligations under this +License and any other pertinent obligations, then as a consequence you +may not distribute the Program at all. For example, if a patent +license would not permit royalty-free redistribution of the Program by +all those who receive copies directly or indirectly through you, then +the only way you could satisfy both it and this License would be to +refrain entirely from distribution of the Program. + +If any portion of this section is held invalid or unenforceable under +any particular circumstance, the balance of the section is intended to +apply and the section as a whole is intended to apply in other +circumstances. + +It is not the purpose of this section to induce you to infringe any +patents or other property right claims or to contest validity of any +such claims; this section has the sole purpose of protecting the +integrity of the free software distribution system, which is +implemented by public license practices. Many people have made +generous contributions to the wide range of software distributed +through that system in reliance on consistent application of that +system; it is up to the author/donor to decide if he or she is willing +to distribute software through any other system and a licensee cannot +impose that choice. + +This section is intended to make thoroughly clear what is believed to +be a consequence of the rest of this License. + + 8. If the distribution and/or use of the Program is restricted in +certain countries either by patents or by copyrighted interfaces, the +original copyright holder who places the Program under this License +may add an explicit geographical distribution limitation excluding +those countries, so that distribution is permitted only in or among +countries not thus excluded. In such case, this License incorporates +the limitation as if written in the body of this License. + + 9. The Free Software Foundation may publish revised and/or new versions +of the General Public License from time to time. Such new versions will +be similar in spirit to the present version, but may differ in detail to +address new problems or concerns. + +Each version is given a distinguishing version number. If the Program +specifies a version number of this License which applies to it and "any +later version", you have the option of following the terms and conditions +either of that version or of any later version published by the Free +Software Foundation. If the Program does not specify a version number of +this License, you may choose any version ever published by the Free Software +Foundation. + + 10. If you wish to incorporate parts of the Program into other free +programs whose distribution conditions are different, write to the author +to ask for permission. For software which is copyrighted by the Free +Software Foundation, write to the Free Software Foundation; we sometimes +make exceptions for this. Our decision will be guided by the two goals +of preserving the free status of all derivatives of our free software and +of promoting the sharing and reuse of software generally. + + NO WARRANTY + + 11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY +FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN +OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES +PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED +OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF +MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS +TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE +PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, +REPAIR OR CORRECTION. + + 12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING +WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR +REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, +INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING +OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED +TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY +YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER +PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE +POSSIBILITY OF SUCH DAMAGES. + + END OF TERMS AND CONDITIONS + + How to Apply These Terms to Your New Programs + + If you develop a new program, and you want it to be of the greatest +possible use to the public, the best way to achieve this is to make it +free software which everyone can redistribute and change under these terms. + + To do so, attach the following notices to the program. It is safest +to attach them to the start of each source file to most effectively +convey the exclusion of warranty; and each file should have at least +the "copyright" line and a pointer to where the full notice is found. + + <one line to give the program's name and a brief idea of what it does.> + Copyright (C) <year> <name of author> + + This program is free software: you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation, either version 2 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program. If not, see <http://www.gnu.org/licenses/>. + +Also add information on how to contact you by electronic and paper mail. + +If the program is interactive, make it output a short notice like this +when it starts in an interactive mode: + + Gnomovision version 69, Copyright (C) <year> <name of author> + Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'. + This is free software, and you are welcome to redistribute it + under certain conditions; type `show c' for details. + +The hypothetical commands `show w' and `show c' should show the appropriate +parts of the General Public License. Of course, the commands you use may +be called something other than `show w' and `show c'; they could even be +mouse-clicks or menu items--whatever suits your program. + +You should also get your employer (if you work as a programmer) or your +school, if any, to sign a "copyright disclaimer" for the program, if +necessary. Here is a sample; alter the names: + + Yoyodyne, Inc., hereby disclaims all copyright interest in the program + `Gnomovision' (which makes passes at compilers) written by James Hacker. + + <signature of Ty Coon>, 1 April 1989 + Ty Coon, President of Vice + +This General Public License does not permit incorporating your program into +proprietary programs. If your program is a subroutine library, you may +consider it more useful to permit linking proprietary applications with the +library. If this is what you want to do, use the GNU Lesser General +Public License instead of this License. diff --git a/ChangeLog b/ChangeLog new file mode 100644 index 0000000..7dfae8d --- /dev/null +++ b/ChangeLog @@ -0,0 +1,334 @@ +2022-01-24 Antonio Diaz Diaz <antonio@gnu.org> + + * Version 1.23 released. + * Decompression time has been reduced by 5-12% depending on the file. + * main.cc (getnum): Show option name and valid range if error. + * Improve several descriptions in manual, '--help', and man page. + * lzip.texi: Change GNU Texinfo category to 'Compression'. + (Reported by Alfred M. Szmidt). + +2021-01-04 Antonio Diaz Diaz <antonio@gnu.org> + + * Version 1.22 released. + * main.cc (main): Report an error if a file name is empty. + Make '-o' behave like '-c', but writing to file instead of stdout. + Make '-c' and '-o' check whether the output is a terminal only once. + Do not open output if input is a terminal. + * configure: Build, check, and install without 'make'. + * Replace 'decompressed', 'compressed' with 'out', 'in' in output. + * lzip_index.cc: Improve messages for corruption in last header. + * main.cc: Set a valid invocation_name even if argc == 0. + * Document extraction from tar.lz in manual, '--help', and man page. + * lzip.texi (Introduction): Mention plzip and tarlz as alternatives. + * lzip.texi: Several fixes and improvements. + * testsuite: Add 9 new test files. + +2019-01-03 Antonio Diaz Diaz <antonio@gnu.org> + + * Version 1.21 released. + * Rename File_* to Lzip_*. + * lzip.h (Lzip_trailer): New function 'verify_consistency'. + * lzip_index.cc: Detect some kinds of corrupt trailers. + * main.cc (main): Check return value of close( infd ). + * main.cc: Compile on DOS with DJGPP. + * Fix a GCC warning about catching std::bad_alloc by value. + * lzip.texi: Improve description of '-0..-9', '-m', and '-s'. + * configure: Accept appending to CXXFLAGS; 'CXXFLAGS+=OPTIONS'. + * INSTALL: Document use of CXXFLAGS+='-D __USE_MINGW_ANSI_STDIO'. + +2018-02-11 Antonio Diaz Diaz <antonio@gnu.org> + + * Version 1.20 released. + * New option '--loose-trailing'. + * Improve corrupt header detection to HD=3. + * main.cc: Show corrupt or truncated header in multimember file. + * main.cc (main): Option '-S, --volume-size' now keeps input files. + * encoder_base.*: Adjust dictionary size for each member. + * Replace 'bits/byte' with inverse compression ratio in output. + * Show progress of decompression at verbosity level 2 (-vv). + * Show progress of (de)compression only if stderr is a terminal. + * main.cc: Show final diagnostic when testing multiple files. + * main.cc: Do not add a second extension '.lz' to the arg of '-o'. + * decoder.cc (verify_trailer): Show stored sizes also in hex. + Show dictionary size at verbosity level 4 (-vvvv). + * lzip.texi: New chapter 'Meaning of lzip's output'. + +2017-04-13 Antonio Diaz Diaz <antonio@gnu.org> + + * Version 1.19 released. + * The option '-l, --list' has been ported from lziprecover. + * Don't allow mixing different operations (-d, -l or -t). + * Compression time of option '-0' has been slightly reduced. + * Decompression time has been reduced by 2%. + * main.cc: Continue testing if any input file is a terminal. + * main.cc: Show trailing data in both hexadecimal and ASCII. + * encoder.cc (Matchfinder_base): Verify the size passed to new. + * lzip_index.cc: Improve detection of bad dict and trailing data. + * lzip.h: Unify messages for bad magic, trailing data, etc. + +2016-05-14 Antonio Diaz Diaz <antonio@gnu.org> + + * Version 1.18 released. + * New option '-a, --trailing-error'. + * Decompression time has been reduced by 2%. + * decoder.cc (verify_trailer): Remove test of final code. + * main.cc (main): Delete '--output' file if infd is a terminal. + * main.cc (main): Don't use stdin more than once. + * Remove decompression support for version 0 files. + * lzip.texi: New chapter 'Trailing data'. + * configure: Avoid warning on some shells when testing for g++. + * Makefile.in: Detect the existence of install-info. + * check.sh: A POSIX shell is required to run the tests. + * check.sh: Don't check error messages. + +2015-07-12 Antonio Diaz Diaz <antonio@gnu.org> + + * Version 1.17 released. + * Reorganization of the compression code. + * lzip.texi: New chapter 'Quality assurance'. + * Makefile.in: New targets 'install*-compress'. + +2014-08-26 Antonio Diaz Diaz <antonio@gnu.org> + + * Version 1.16 released. + * Compression ratio of option '-9' has been slightly increased. + * Compression time has been reduced by 4%. + * Compression time of option '-0' has been reduced by 2%. + * main.cc (close_and_set_permissions): Behave like 'cp -p'. + * Minor improvements. + * lzip.texinfo: Rename to lzip.texi. + * Change license to GPL version 2 or later. + +2013-09-20 Antonio Diaz Diaz <antonio@gnu.org> + + * Version 1.15 released. + * Show progress of compression at verbosity level 2 (-vv). + * main.cc (show_header): Don't show header version. + * Ignore option '-n, --threads' for compatibility with plzip. + * configure: Options now accept a separate argument. + * lzip.texinfo: New chapter 'Stream format' and appendix + 'Reference source code'. + +2013-02-17 Antonio Diaz Diaz <ant_diaz@teleline.es> + + * Version 1.14 released. + * Multi-step trials have been implemented. + * Compression ratio has been slightly increased. + * Compression time has been reduced by 5%. + * Decompression time has been reduced by 12%. + * Makefile.in: New target 'install-bin'. + * main.cc: Use 'setmode' instead of '_setmode' on Windows and OS/2. + * main.cc: Define 'strtoull' to 'std::strtoul' on Windows. + +2012-02-24 Antonio Diaz Diaz <ant_diaz@teleline.es> + + * Version 1.13 released. + * Lziprecover has been moved to its own package. + * main.cc (close_and_set_permissions): Inability to change output + file attributes has been downgraded from error to warning. + * Compression time of option '-0' has been reduced by 2%. + * Reorganization of the compression code. + * Small change in '--help' output and man page. + * Change quote characters in messages as advised by GNU Standards. + * configure: Rename 'datadir' to 'datarootdir'. + * 'unzcrash.cc' has been moved to package 'lziprecover'. + +2011-04-30 Antonio Diaz Diaz <ant_diaz@teleline.es> + + * Version 1.12 released. + * New option '-F, --recompress'. + * encoder.h (update_prices): Update high length symbol prices + independently of the value of 'pos_state'. This gives better + compression for large values of '--match-length' without being + slower. + * encoder.h, encoder.cc: Optimize pair price calculations, reducing + compression time for large values of '--match-length' by up to 6%. + * Compression time of option '-0' has been reduced by 2%. + * main.cc (decompress): Print only one status line for each + multimember file when only one '-v' is specified. + * main.cc (decompress): Print up to 6 bytes of trailing data + when '-vvvv' is specified. + * main.cc (open_instream): Don't show the message + " and '--stdout' was not specified" for directories, etc. + * lziprecover.cc: If '-v' is not specified show errors only. + * unzcrash.cc: Use Arg_parser. + * unzcrash.cc: New options '-b, --bits', '-p, --position', and + '-s, --size'. + +2010-09-16 Antonio Diaz Diaz <ant_diaz@teleline.es> + + * Version 1.11 released. + * New option '-0', which produces a compression speed and ratio + comparable to those of 'gzip -9'. + * fast_encoder.h, fast_encoder.cc: New files. + * main.cc: Match length limit set by options -1 to -8 has been + reduced to extend range of use towards gzip. Lower numbers now + compress less but faster. (-1 now takes 43% less time for only 20% + larger compressed size). + Exit with status 1 if any output file exists and is skipped. + * Compression ratio of option '-9' has been slightly increased. + * lziprecover.cc: New option '-m, --merge', which tries to produce a + correct file by merging the good parts of two or more damaged copies. + * lziprecover.cc: New option '-R, --repair' for repairing a + 1-byte error in single-member files. + * decoder.cc (decode_member): Detect file errors earlier to improve + efficiency of lziprecover's new repair capability. + This change also prevents (harmless) access to uninitialized + memory when decompressing a corrupt file. + * lziprecover.cc: New options '-f, --force' and '-o, --output'. + * lziprecover.cc: New option '-s, --split' to select the until + now only operation of splitting multimember files. + * lziprecover.cc: If no operation is specified, warn the user and do + nothing. + * main.cc: Fix warning about fchown's return value being ignored. + * decoder.cc: '-tvvvv' now also shows compression ratio. + * main.cc: Set stdin/stdout in binary mode on MSVC and OS2. + * lzip.texinfo: New examples. + * testsuite: Rename 'test1' to 'test.txt'. New tests. + * Matchfinder types HC4 (4 bytes hash-chain) and HT4 (4 bytes + hash-table) have been tested and found no better than the current + BT4. + +2010-04-05 Antonio Diaz Diaz <ant_diaz@teleline.es> + + * Version 1.10 released. + * decoder.h: Input_buffer integrated in Range_decoder. + * main.cc: File specified with option '-o' is now created with mode + 0666 if umask allows it, deleted if interrupted by user. + * main.cc: New constant 'o_binary'. + * main.cc: Dictionary size for options -2, -3, -4 and -8 has been + changed to improve linearity of compressed sizes. + * lzip.h: Fix warnings produced by over-optimization (-O3). + * Makefile.in: Add quotes to directory names. + +2010-01-17 Antonio Diaz Diaz <ant_diaz@teleline.es> + + * Version 1.9 released. + * main.cc (main): Return at least 1 if closing stdout fails. + * Makefile.in: Add option '--name' to help2man invocation. + * check.sh: Use 'test1' instead of 'COPYING' for testing. + +2009-09-02 Antonio Diaz Diaz <ant_diaz@teleline.es> + + * Version 1.8 released. + * Compression time has been reduced by 4%. + * Lzdiff and lzgrep have been moved to the new package zutils. + * Fix warnings on systems where uint32_t != unsigned int. + +2009-06-25 Antonio Diaz Diaz <ant_diaz@teleline.es> + + * Version 1.7 released. + * decoder.h (copy_block): Fix memcpy overlap introduced in 1.6. + +2009-06-22 Antonio Diaz Diaz <ant_diaz@teleline.es> + + * Version 1.6 released. + * Decompression time has been reduced by 17%. + * Add decompression support for Sync Flush marker. + * Add support for the extension '.tbz' to lzdiff and lzgrep. + * Add man pages for lzdiff, lzgrep and lziprecover. + * encoder.cc (Matchfinder): Reduce memory use to 9x if input file is + smaller than dictionary size limit. + * decoder.cc: Add extra flush calls to improve partial decompression + of corrupt files. + * '--test' no longer needs '/dev/null'. + * Remove some 'bashisms' from lzdiff and lzgrep. + * Dictionary size for options '-1' to '-4' has been changed. + * main.cc (signal_handler): Declare as 'extern "C"'. + * Makefile.in: Extra files are now installed by default. + * check.sh: Test lziprecover. + * Add 'export LC_ALL=C' to all scripts. + +2009-04-12 Antonio Diaz Diaz <ant_diaz@teleline.es> + + * Version 1.5 released. + * lzip.h: Implement coded dictionary size in Lzip_header. + * Fix some includes that prevented compilation with GCC 4.4. + * 'member_size' and 'volume_size' are now accurate limits. + * Compression speed has been improved. + * Implement bt4 type matchfinder. + * lzip.texinfo: New chapter 'Algorithm'. + * Lzdiff and lzgrep now accept '-h' for '--help' and + '-V' for '--version'. + * Makefile.in: Man page is now installed by default. + * check.sh: Verify that files are opened in binary mode. + +2009-01-24 Antonio Diaz Diaz <ant_diaz@teleline.es> + + * Version 1.4 released. + * Implement compression of version 1 files. + * New options '-b, --member-size', '-S, --volume-size', and + '-o, --output'. + * main.cc: Read from non-regular files if '--stdout' is specified. + * Add 'lziprecover', a member recoverer program. + * unzcrash.cc: Test all 1-byte errors. + +2008-12-21 Antonio Diaz Diaz <ant_diaz@teleline.es> + + * Version 1.3 released. + * This version automatically chooses the smallest possible + dictionary size for each file during compression, saving memory + during decompression. + * Implement decompression of version 1 files. + * check.sh: Replace 'diff -q' with 'cmp'. + +2008-12-10 Antonio Diaz Diaz <ant_diaz@teleline.es> + + * Version 1.2 released. + * encoder.cc: A 1-byte read outside allocated memory has been fixed. + * lzip.h: Dictionary size limit has been reduced to 512MiB because + setting it to 1GiB causes overflow of a 32 bit integer. + * Add 'lzdiff', a diff/cmp wrapper for gzip, bzip2, lzip and + non-compressed files. + * Add 'lzgrep', a grep wrapper for gzip, bzip2, lzip and + non-compressed files. + * 'make install-info' should now work on Debian and OS X. + +2008-11-17 Antonio Diaz Diaz <ant_diaz@teleline.es> + + * Version 1.1 released. + * Change short name of option '--dictionary-size' to '-s'. + * Change short name of option '--match-length' to '-m'. + * Change LONG_LONG_MAX to LLONG_MAX. + +2008-10-14 Antonio Diaz Diaz <ant_diaz@teleline.es> + + * Version 1.0 released. + * '-tvv' shows file version and dictionary size. + +2008-09-30 Antonio Diaz Diaz <ant_diaz@teleline.es> + + * Version 0.5 released. + * Decompression is now 1% faster. + +2008-09-23 Antonio Diaz Diaz <ant_diaz@teleline.es> + + * Version 0.4 released. + * Code cleanup for global variable 'verbosity'. + * Regain the compression ratio of 0.2 with 5% faster speed. + * lzip.h: Fix compilation on systems where size_t != unsigned int. + +2008-09-15 Antonio Diaz Diaz <ant_diaz@teleline.es> + + * Version 0.3 released. + * encoder.cc: Compression is now 15% faster, 1% worse. + * main.cc (main): Make option '-t' override '-c'. + * main.cc (decompress): Show 'done' instead of 'ok' when not testing. + * encoder.h: Use trials[] to return the list of pairs. + +2008-09-09 Antonio Diaz Diaz <ant_diaz@teleline.es> + + * Version 0.2 released. + * encoder.cc: Small improvements in compression speed. + * Small documentation changes. + +2008-08-20 Antonio Diaz Diaz <ant_diaz@teleline.es> + + * Version 0.1 released. + + +Copyright (C) 2008-2022 Antonio Diaz Diaz. + +This file is a collection of facts, and thus it is not copyrightable, +but just in case, you have unlimited permission to copy, distribute, and +modify it. @@ -0,0 +1,81 @@ +Requirements +------------ +You will need a C++98 compiler with suport for 'long long'. +(gcc 3.3.6 or newer is recommended). +I use gcc 6.1.0 and 3.3.6, but the code should compile with any standards +compliant compiler. +Gcc is available at http://gcc.gnu.org. + +The operating system must allow signal handlers read access to objects with +static storage duration so that the cleanup handler for Control-C can delete +the partial output file. + + +Procedure +--------- +1. Unpack the archive if you have not done so already: + + tar -xf lzip[version].tar.lz +or + lzip -cd lzip[version].tar.lz | tar -xf - + +This creates the directory ./lzip[version] containing the source from +the main archive. + +2. Change to lzip directory and run configure. + (Try 'configure --help' for usage instructions). + + cd lzip[version] + ./configure + + If you are compiling on MinGW, use: + + ./configure CXXFLAGS+='-D __USE_MINGW_ANSI_STDIO' + +3. Run make. + + make + +4. Optionally, type 'make check' to run the tests that come with lzip. + +5. Type 'make install' to install the program and any data files and + documentation. + + Or type 'make install-compress', which additionally compresses the + info manual and the man page after installation. + (Installing compressed docs may become the default in the future). + + You can install only the program, the info manual, or the man page by + typing 'make install-bin', 'make install-info', or 'make install-man' + respectively. + + +Another way +----------- +You can also compile lzip into a separate directory. +To do this, you must use a version of 'make' that supports the variable +'VPATH', such as GNU 'make'. 'cd' to the directory where you want the +object files and executables to go and run the 'configure' script. +'configure' automatically checks for the source code in '.', in '..', and +in the directory that 'configure' is in. + +'configure' recognizes the option '--srcdir=DIR' to control where to +look for the sources. Usually 'configure' can determine that directory +automatically. + +After running 'configure', you can run 'make' and 'make install' as +explained above. + + +Building without 'make' +----------------------- +If you need to build lzip on a system lacking a 'make' program, you can use +'configure' to build, check, and install the lzip executable like this: + + ./configure --build --check --installdir=/usr/local/bin + + +Copyright (C) 2008-2022 Antonio Diaz Diaz. + +This file is free documentation: you have unlimited permission to copy, +distribute, and modify it. diff --git a/Makefile.in b/Makefile.in new file mode 100644 index 0000000..d07ad5a --- /dev/null +++ b/Makefile.in @@ -0,0 +1,134 @@ + +DISTNAME = $(pkgname)-$(pkgversion) +INSTALL = install +INSTALL_PROGRAM = $(INSTALL) -m 755 +INSTALL_DATA = $(INSTALL) -m 644 +INSTALL_DIR = $(INSTALL) -d -m 755 +SHELL = /bin/sh +CAN_RUN_INSTALLINFO = $(SHELL) -c "install-info --version" > /dev/null 2>&1 + +objs = arg_parser.o lzip_index.o list.o encoder_base.o encoder.o \ + fast_encoder.o decoder.o main.o + + +.PHONY : all install install-bin install-info install-man \ + install-strip install-compress install-strip-compress \ + install-bin-strip install-info-compress install-man-compress \ + uninstall uninstall-bin uninstall-info uninstall-man \ + doc info man check dist clean distclean + +all : $(progname) + +$(progname) : $(objs) + $(CXX) $(CXXFLAGS) $(LDFLAGS) -o $@ $(objs) + +main.o : main.cc + $(CXX) $(CPPFLAGS) $(CXXFLAGS) -DPROGVERSION=\"$(pkgversion)\" -c -o $@ $< + +%.o : %.cc + $(CXX) $(CPPFLAGS) $(CXXFLAGS) -c -o $@ $< + +$(objs) : Makefile +arg_parser.o : arg_parser.h +decoder.o : lzip.h decoder.h +encoder_base.o : lzip.h encoder_base.h +encoder.o : lzip.h encoder_base.h encoder.h +fast_encoder.o : lzip.h encoder_base.h fast_encoder.h +list.o : lzip.h lzip_index.h +lzip_index.o : lzip.h lzip_index.h +main.o : arg_parser.h lzip.h decoder.h encoder_base.h encoder.h fast_encoder.h + + +doc : info man + +info : $(VPATH)/doc/$(pkgname).info + +$(VPATH)/doc/$(pkgname).info : $(VPATH)/doc/$(pkgname).texi + cd $(VPATH)/doc && makeinfo $(pkgname).texi + +man : $(VPATH)/doc/$(progname).1 + +$(VPATH)/doc/$(progname).1 : $(progname) + help2man -n 'reduces the size of files' -o $@ ./$(progname) + +Makefile : $(VPATH)/configure $(VPATH)/Makefile.in + ./config.status + +check : all + @$(VPATH)/testsuite/check.sh $(VPATH)/testsuite $(pkgversion) + +install : install-bin install-info install-man +install-strip : install-bin-strip install-info install-man +install-compress : install-bin install-info-compress install-man-compress +install-strip-compress : install-bin-strip install-info-compress install-man-compress + +install-bin : all + if [ ! -d "$(DESTDIR)$(bindir)" ] ; then $(INSTALL_DIR) "$(DESTDIR)$(bindir)" ; fi + $(INSTALL_PROGRAM) ./$(progname) "$(DESTDIR)$(bindir)/$(progname)" + +install-bin-strip : all + $(MAKE) INSTALL_PROGRAM='$(INSTALL_PROGRAM) -s' install-bin + +install-info : + if [ ! -d "$(DESTDIR)$(infodir)" ] ; then $(INSTALL_DIR) "$(DESTDIR)$(infodir)" ; fi + -rm -f "$(DESTDIR)$(infodir)/$(pkgname).info"* + $(INSTALL_DATA) $(VPATH)/doc/$(pkgname).info "$(DESTDIR)$(infodir)/$(pkgname).info" + -if $(CAN_RUN_INSTALLINFO) ; then \ + install-info --info-dir="$(DESTDIR)$(infodir)" "$(DESTDIR)$(infodir)/$(pkgname).info" ; \ + fi + +install-info-compress : install-info + lzip -v -9 "$(DESTDIR)$(infodir)/$(pkgname).info" + +install-man : + if [ ! -d "$(DESTDIR)$(mandir)/man1" ] ; then $(INSTALL_DIR) "$(DESTDIR)$(mandir)/man1" ; fi + -rm -f "$(DESTDIR)$(mandir)/man1/$(progname).1"* + $(INSTALL_DATA) $(VPATH)/doc/$(progname).1 "$(DESTDIR)$(mandir)/man1/$(progname).1" + +install-man-compress : install-man + lzip -v -9 "$(DESTDIR)$(mandir)/man1/$(progname).1" + +uninstall : uninstall-man uninstall-info uninstall-bin + +uninstall-bin : + -rm -f "$(DESTDIR)$(bindir)/$(progname)" + +uninstall-info : + -if $(CAN_RUN_INSTALLINFO) ; then \ + install-info --info-dir="$(DESTDIR)$(infodir)" --remove "$(DESTDIR)$(infodir)/$(pkgname).info" ; \ + fi + -rm -f "$(DESTDIR)$(infodir)/$(pkgname).info"* + +uninstall-man : + -rm -f "$(DESTDIR)$(mandir)/man1/$(progname).1"* + +dist : doc + ln -sf $(VPATH) $(DISTNAME) + tar -Hustar --owner=root --group=root -cvf $(DISTNAME).tar \ + $(DISTNAME)/AUTHORS \ + $(DISTNAME)/COPYING \ + $(DISTNAME)/ChangeLog \ + $(DISTNAME)/INSTALL \ + $(DISTNAME)/Makefile.in \ + $(DISTNAME)/NEWS \ + $(DISTNAME)/README \ + $(DISTNAME)/configure \ + $(DISTNAME)/doc/$(progname).1 \ + $(DISTNAME)/doc/$(pkgname).info \ + $(DISTNAME)/doc/$(pkgname).texi \ + $(DISTNAME)/*.h \ + $(DISTNAME)/*.cc \ + $(DISTNAME)/testsuite/check.sh \ + $(DISTNAME)/testsuite/test.txt \ + $(DISTNAME)/testsuite/fox.lz \ + $(DISTNAME)/testsuite/fox_*.lz \ + $(DISTNAME)/testsuite/test.txt.lz \ + $(DISTNAME)/testsuite/test_em.txt.lz + rm -f $(DISTNAME) + lzip -v -9 $(DISTNAME).tar + +clean : + -rm -f $(progname) $(objs) + +distclean : clean + -rm -f Makefile config.status *.tar *.tar.lz @@ -0,0 +1,11 @@ +Changes in version 1.23: + +Decompression time has been reduced by 5-12% depending on the file. + +In case of error in a numerical argument to a command line option, lzip +now shows the name of the option and the range of valid values. + +Several descriptions have been improved in manual, '--help', and man page. + +The texinfo category of the manual has been changed from 'Data Compression' +to 'Compression' to match that of gzip. (Reported by Alfred M. Szmidt). @@ -0,0 +1,135 @@ +Description + +Lzip is a lossless data compressor with a user interface similar to the one +of gzip or bzip2. Lzip uses a simplified form of the 'Lempel-Ziv-Markov +chain-Algorithm' (LZMA) stream format and provides a 3 factor integrity +checking to maximize interoperability and optimize safety. Lzip can compress +about as fast as gzip (lzip -0) or compress most files more than bzip2 +(lzip -9). Decompression speed is intermediate between gzip and bzip2. +Lzip is better than gzip and bzip2 from a data recovery perspective. Lzip +has been designed, written, and tested with great care to replace gzip and +bzip2 as the standard general-purpose compressed format for unix-like +systems. + +For compressing/decompressing large files on multiprocessor machines plzip +can be much faster than lzip at the cost of a slightly reduced compression +ratio. + +For creation and manipulation of compressed tar archives tarlz can be more +efficient than using tar and plzip because tarlz is able to keep the +alignment between tar members and lzip members. + +The lzip file format is designed for data sharing and long-term archiving, +taking into account both data integrity and decoder availability: + + * The lzip format provides very safe integrity checking and some data + recovery means. The program lziprecover can repair bit flip errors + (one of the most common forms of data corruption) in lzip files, and + provides data recovery capabilities, including error-checked merging + of damaged copies of a file. + + * The lzip format is as simple as possible (but not simpler). The lzip + manual provides the source code of a simple decompressor along with a + detailed explanation of how it works, so that with the only help of the + lzip manual it would be possible for a digital archaeologist to extract + the data from a lzip file long after quantum computers eventually + render LZMA obsolete. + + * Additionally the lzip reference implementation is copylefted, which + guarantees that it will remain free forever. + +A nice feature of the lzip format is that a corrupt byte is easier to repair +the nearer it is from the beginning of the file. Therefore, with the help of +lziprecover, losing an entire archive just because of a corrupt byte near +the beginning is a thing of the past. + +Lzip uses the same well-defined exit status values used by bzip2, which +makes it safer than compressors returning ambiguous warning values (like +gzip) when it is used as a back end for other programs like tar or zutils. + +Lzip will automatically use for each file the largest dictionary size that +does not exceed neither the file size nor the limit given. Keep in mind that +the decompression memory requirement is affected at compression time by the +choice of dictionary size limit. + +The amount of memory required for compression is about 1 or 2 times the +dictionary size limit (1 if input file size is less than dictionary size +limit, else 2) plus 9 times the dictionary size really used. The option '-0' +is special and only requires about 1.5 MiB at most. The amount of memory +required for decompression is about 46 kB larger than the dictionary size +really used. + +When compressing, lzip replaces every file given in the command line +with a compressed version of itself, with the name "original_name.lz". +When decompressing, lzip attempts to guess the name for the decompressed +file from that of the compressed file as follows: + +filename.lz becomes filename +filename.tlz becomes filename.tar +anyothername becomes anyothername.out + +(De)compressing a file is much like copying or moving it. Therefore lzip +preserves the access and modification dates, permissions, and, when +possible, ownership of the file just as 'cp -p' does. (If the user ID or +the group ID can't be duplicated, the file permission bits S_ISUID and +S_ISGID are cleared). + +Lzip is able to read from some types of non-regular files if either the +option '-c' or the option '-o' is specified. + +If no file names are specified, lzip compresses (or decompresses) from +standard input to standard output. Lzip will refuse to read compressed data +from a terminal or write compressed data to a terminal, as this would be +entirely incomprehensible and might leave the terminal in an abnormal state. + +Lzip will correctly decompress a file which is the concatenation of two or +more compressed files. The result is the concatenation of the corresponding +decompressed files. Integrity testing of concatenated compressed files is +also supported. + +Lzip can produce multimember files, and lziprecover can safely recover the +undamaged members in case of file damage. Lzip can also split the compressed +output in volumes of a given size, even when reading from standard input. +This allows the direct creation of multivolume compressed tar archives. + +Lzip is able to compress and decompress streams of unlimited size by +automatically creating multimember output. The members so created are large, +about 2 PiB each. + +In spite of its name (Lempel-Ziv-Markov chain-Algorithm), LZMA is not a +concrete algorithm; it is more like "any algorithm using the LZMA coding +scheme". For example, the option '-0' of lzip uses the scheme in almost the +simplest way possible; issuing the longest match it can find, or a literal +byte if it can't find a match. Inversely, a much more elaborated way of +finding coding sequences of minimum size than the one currently used by lzip +could be developed, and the resulting sequence could also be coded using the +LZMA coding scheme. + +Lzip currently implements two variants of the LZMA algorithm: fast +(used by option '-0') and normal (used by all other compression levels). + +The high compression of LZMA comes from combining two basic, well-proven +compression ideas: sliding dictionaries (LZ77/78) and markov models (the +thing used by every compression algorithm that uses a range encoder or +similar order-0 entropy coder as its last stage) with segregation of +contexts according to what the bits are used for. + +The ideas embodied in lzip are due to (at least) the following people: +Abraham Lempel and Jacob Ziv (for the LZ algorithm), Andrey Markov (for the +definition of Markov chains), G.N.N. Martin (for the definition of range +encoding), Igor Pavlov (for putting all the above together in LZMA), and +Julian Seward (for bzip2's CLI). + +LANGUAGE NOTE: Uncompressed = not compressed = plain data; it may never have +been compressed. Decompressed is used to refer to data which have undergone +the process of decompression. + + +Copyright (C) 2008-2022 Antonio Diaz Diaz. + +This file is free documentation: you have unlimited permission to copy, +distribute, and modify it. + +The file Makefile.in is a data file used by configure to produce the +Makefile. It has the same copyright owner and permissions that configure +itself. diff --git a/arg_parser.cc b/arg_parser.cc new file mode 100644 index 0000000..59998ac --- /dev/null +++ b/arg_parser.cc @@ -0,0 +1,197 @@ +/* Arg_parser - POSIX/GNU command line argument parser. (C++ version) + Copyright (C) 2006-2022 Antonio Diaz Diaz. + + This library is free software. Redistribution and use in source and + binary forms, with or without modification, are permitted provided + that the following conditions are met: + + 1. Redistributions of source code must retain the above copyright + notice, this list of conditions, and the following disclaimer. + + 2. Redistributions in binary form must reproduce the above copyright + notice, this list of conditions, and the following disclaimer in the + documentation and/or other materials provided with the distribution. + + This library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. +*/ + +#include <cstring> +#include <string> +#include <vector> + +#include "arg_parser.h" + + +bool Arg_parser::parse_long_option( const char * const opt, const char * const arg, + const Option options[], int & argind ) + { + unsigned len; + int index = -1; + bool exact = false, ambig = false; + + for( len = 0; opt[len+2] && opt[len+2] != '='; ++len ) ; + + // Test all long options for either exact match or abbreviated matches. + for( int i = 0; options[i].code != 0; ++i ) + if( options[i].long_name && + std::strncmp( options[i].long_name, &opt[2], len ) == 0 ) + { + if( std::strlen( options[i].long_name ) == len ) // Exact match found + { index = i; exact = true; break; } + else if( index < 0 ) index = i; // First nonexact match found + else if( options[index].code != options[i].code || + options[index].has_arg != options[i].has_arg ) + ambig = true; // Second or later nonexact match found + } + + if( ambig && !exact ) + { + error_ = "option '"; error_ += opt; error_ += "' is ambiguous"; + return false; + } + + if( index < 0 ) // nothing found + { + error_ = "unrecognized option '"; error_ += opt; error_ += '\''; + return false; + } + + ++argind; + data.push_back( Record( options[index].code, options[index].long_name ) ); + + if( opt[len+2] ) // '--<long_option>=<argument>' syntax + { + if( options[index].has_arg == no ) + { + error_ = "option '--"; error_ += options[index].long_name; + error_ += "' doesn't allow an argument"; + return false; + } + if( options[index].has_arg == yes && !opt[len+3] ) + { + error_ = "option '--"; error_ += options[index].long_name; + error_ += "' requires an argument"; + return false; + } + data.back().argument = &opt[len+3]; + return true; + } + + if( options[index].has_arg == yes ) + { + if( !arg || !arg[0] ) + { + error_ = "option '--"; error_ += options[index].long_name; + error_ += "' requires an argument"; + return false; + } + ++argind; data.back().argument = arg; + return true; + } + + return true; + } + + +bool Arg_parser::parse_short_option( const char * const opt, const char * const arg, + const Option options[], int & argind ) + { + int cind = 1; // character index in opt + + while( cind > 0 ) + { + int index = -1; + const unsigned char c = opt[cind]; + + if( c != 0 ) + for( int i = 0; options[i].code; ++i ) + if( c == options[i].code ) + { index = i; break; } + + if( index < 0 ) + { + error_ = "invalid option -- '"; error_ += c; error_ += '\''; + return false; + } + + data.push_back( Record( c ) ); + if( opt[++cind] == 0 ) { ++argind; cind = 0; } // opt finished + + if( options[index].has_arg != no && cind > 0 && opt[cind] ) + { + data.back().argument = &opt[cind]; ++argind; cind = 0; + } + else if( options[index].has_arg == yes ) + { + if( !arg || !arg[0] ) + { + error_ = "option requires an argument -- '"; error_ += c; + error_ += '\''; + return false; + } + data.back().argument = arg; ++argind; cind = 0; + } + } + return true; + } + + +Arg_parser::Arg_parser( const int argc, const char * const argv[], + const Option options[], const bool in_order ) + { + if( argc < 2 || !argv || !options ) return; + + std::vector< const char * > non_options; // skipped non-options + int argind = 1; // index in argv + + while( argind < argc ) + { + const unsigned char ch1 = argv[argind][0]; + const unsigned char ch2 = ch1 ? argv[argind][1] : 0; + + if( ch1 == '-' && ch2 ) // we found an option + { + const char * const opt = argv[argind]; + const char * const arg = ( argind + 1 < argc ) ? argv[argind+1] : 0; + if( ch2 == '-' ) + { + if( !argv[argind][2] ) { ++argind; break; } // we found "--" + else if( !parse_long_option( opt, arg, options, argind ) ) break; + } + else if( !parse_short_option( opt, arg, options, argind ) ) break; + } + else + { + if( in_order ) data.push_back( Record( argv[argind++] ) ); + else non_options.push_back( argv[argind++] ); + } + } + if( !error_.empty() ) data.clear(); + else + { + for( unsigned i = 0; i < non_options.size(); ++i ) + data.push_back( Record( non_options[i] ) ); + while( argind < argc ) + data.push_back( Record( argv[argind++] ) ); + } + } + + +Arg_parser::Arg_parser( const char * const opt, const char * const arg, + const Option options[] ) + { + if( !opt || !opt[0] || !options ) return; + + if( opt[0] == '-' && opt[1] ) // we found an option + { + int argind = 1; // dummy + if( opt[1] == '-' ) + { if( opt[2] ) parse_long_option( opt, arg, options, argind ); } + else + parse_short_option( opt, arg, options, argind ); + if( !error_.empty() ) data.clear(); + } + else data.push_back( Record( opt ) ); + } diff --git a/arg_parser.h b/arg_parser.h new file mode 100644 index 0000000..e854838 --- /dev/null +++ b/arg_parser.h @@ -0,0 +1,110 @@ +/* Arg_parser - POSIX/GNU command line argument parser. (C++ version) + Copyright (C) 2006-2022 Antonio Diaz Diaz. + + This library is free software. Redistribution and use in source and + binary forms, with or without modification, are permitted provided + that the following conditions are met: + + 1. Redistributions of source code must retain the above copyright + notice, this list of conditions, and the following disclaimer. + + 2. Redistributions in binary form must reproduce the above copyright + notice, this list of conditions, and the following disclaimer in the + documentation and/or other materials provided with the distribution. + + This library is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. +*/ + +/* Arg_parser reads the arguments in 'argv' and creates a number of + option codes, option arguments, and non-option arguments. + + In case of error, 'error' returns a non-empty error message. + + 'options' is an array of 'struct Option' terminated by an element + containing a code which is zero. A null long_name means a short-only + option. A code value outside the unsigned char range means a long-only + option. + + Arg_parser normally makes it appear as if all the option arguments + were specified before all the non-option arguments for the purposes + of parsing, even if the user of your program intermixed option and + non-option arguments. If you want the arguments in the exact order + the user typed them, call 'Arg_parser' with 'in_order' = true. + + The argument '--' terminates all options; any following arguments are + treated as non-option arguments, even if they begin with a hyphen. + + The syntax for optional option arguments is '-<short_option><argument>' + (without whitespace), or '--<long_option>=<argument>'. +*/ + +class Arg_parser + { +public: + enum Has_arg { no, yes, maybe }; + + struct Option + { + int code; // Short option letter or code ( code != 0 ) + const char * long_name; // Long option name (maybe null) + Has_arg has_arg; + }; + +private: + struct Record + { + int code; + std::string parsed_name; + std::string argument; + explicit Record( const unsigned char c ) + : code( c ), parsed_name( "-" ) { parsed_name += c; } + Record( const int c, const char * const long_name ) + : code( c ), parsed_name( "--" ) { parsed_name += long_name; } + explicit Record( const char * const arg ) : code( 0 ), argument( arg ) {} + }; + + const std::string empty_arg; + std::string error_; + std::vector< Record > data; + + bool parse_long_option( const char * const opt, const char * const arg, + const Option options[], int & argind ); + bool parse_short_option( const char * const opt, const char * const arg, + const Option options[], int & argind ); + +public: + Arg_parser( const int argc, const char * const argv[], + const Option options[], const bool in_order = false ); + + // Restricted constructor. Parses a single token and argument (if any). + Arg_parser( const char * const opt, const char * const arg, + const Option options[] ); + + const std::string & error() const { return error_; } + + // The number of arguments parsed. May be different from argc. + int arguments() const { return data.size(); } + + /* If code( i ) is 0, argument( i ) is a non-option. + Else argument( i ) is the option's argument (or empty). */ + int code( const int i ) const + { + if( i >= 0 && i < arguments() ) return data[i].code; + else return 0; + } + + // Full name of the option parsed (short or long). + const std::string & parsed_name( const int i ) const + { + if( i >= 0 && i < arguments() ) return data[i].parsed_name; + else return empty_arg; + } + + const std::string & argument( const int i ) const + { + if( i >= 0 && i < arguments() ) return data[i].argument; + else return empty_arg; + } + }; diff --git a/configure b/configure new file mode 100755 index 0000000..0ea474c --- /dev/null +++ b/configure @@ -0,0 +1,226 @@ +#! /bin/sh +# configure script for Lzip - LZMA lossless data compressor +# Copyright (C) 2008-2022 Antonio Diaz Diaz. +# +# This configure script is free software: you have unlimited permission +# to copy, distribute, and modify it. + +pkgname=lzip +pkgversion=1.23 +progname=lzip +srctrigger=doc/${pkgname}.texi + +# clear some things potentially inherited from environment. +LC_ALL=C +export LC_ALL +srcdir= +prefix=/usr/local +exec_prefix='$(prefix)' +bindir='$(exec_prefix)/bin' +datarootdir='$(prefix)/share' +infodir='$(datarootdir)/info' +mandir='$(datarootdir)/man' +build=no +check=no +installdir= +CXX=g++ +CPPFLAGS= +CXXFLAGS='-Wall -W -O2' +LDFLAGS= + +# checking whether we are using GNU C++. +/bin/sh -c "${CXX} --version" > /dev/null 2>&1 || { CXX=c++ ; CXXFLAGS=-O2 ; } + +# Loop over all args +args= +no_create= +while [ $# != 0 ] ; do + + # Get the first arg, and shuffle + option=$1 ; arg2=no + shift + + # Add the argument quoted to args + if [ -z "${args}" ] ; then args="\"${option}\"" + else args="${args} \"${option}\"" ; fi + + # Split out the argument for options that take them + case ${option} in + *=*) optarg=`echo "${option}" | sed -e 's,^[^=]*=,,;s,/$,,'` ;; + esac + + # Process the options + case ${option} in + --help | -h) + echo "Usage: $0 [OPTION]... [VAR=VALUE]..." + echo + echo "To assign makefile variables (e.g., CXX, CXXFLAGS...), specify them as" + echo "arguments to configure in the form VAR=VALUE." + echo + echo "Options and variables: [defaults in brackets]" + echo " -h, --help display this help and exit" + echo " -V, --version output version information and exit" + echo " --srcdir=DIR find the sources in DIR [. or ..]" + echo " --prefix=DIR install into DIR [${prefix}]" + echo " --exec-prefix=DIR base directory for arch-dependent files [${exec_prefix}]" + echo " --bindir=DIR user executables directory [${bindir}]" + echo " --datarootdir=DIR base directory for doc and data [${datarootdir}]" + echo " --infodir=DIR info files directory [${infodir}]" + echo " --mandir=DIR man pages directory [${mandir}]" + echo " --build build in one step without using 'make'" + echo " --check check without using 'make', implies --build" + echo " --installdir=BINDIR install without using 'make', implies --build" + echo " CXX=COMPILER C++ compiler to use [${CXX}]" + echo " CPPFLAGS=OPTIONS command line options for the preprocessor [${CPPFLAGS}]" + echo " CXXFLAGS=OPTIONS command line options for the C++ compiler [${CXXFLAGS}]" + echo " CXXFLAGS+=OPTIONS append options to the current value of CXXFLAGS" + echo " LDFLAGS=OPTIONS command line options for the linker [${LDFLAGS}]" + echo + exit 0 ;; + --version | -V) + echo "Configure script for ${pkgname} version ${pkgversion}" + exit 0 ;; + --srcdir) srcdir=$1 ; arg2=yes ;; + --prefix) prefix=$1 ; arg2=yes ;; + --exec-prefix) exec_prefix=$1 ; arg2=yes ;; + --bindir) bindir=$1 ; arg2=yes ;; + --datarootdir) datarootdir=$1 ; arg2=yes ;; + --infodir) infodir=$1 ; arg2=yes ;; + --mandir) mandir=$1 ; arg2=yes ;; + --installdir) installdir=$1 ; arg2=yes ;; + + --srcdir=*) srcdir=${optarg} ;; + --prefix=*) prefix=${optarg} ;; + --exec-prefix=*) exec_prefix=${optarg} ;; + --bindir=*) bindir=${optarg} ;; + --datarootdir=*) datarootdir=${optarg} ;; + --infodir=*) infodir=${optarg} ;; + --mandir=*) mandir=${optarg} ;; + --build) build=yes ;; + --check) check=yes ; build=yes ;; + --installdir=*) installdir=${optarg} ; build=yes ;; + --no-create) no_create=yes ;; + + CXX=*) CXX=${optarg} ;; + CPPFLAGS=*) CPPFLAGS=${optarg} ;; + CXXFLAGS=*) CXXFLAGS=${optarg} ;; + CXXFLAGS+=*) CXXFLAGS="${CXXFLAGS} ${optarg}" ;; + LDFLAGS=*) LDFLAGS=${optarg} ;; + + --*) + echo "configure: WARNING: unrecognized option: '${option}'" 1>&2 ;; + *=* | *-*-*) ;; + *) + echo "configure: unrecognized option: '${option}'" 1>&2 + echo "Try 'configure --help' for more information." 1>&2 + exit 1 ;; + esac + + # Check if the option took a separate argument + if [ "${arg2}" = yes ] ; then + if [ $# != 0 ] ; then args="${args} \"$1\"" ; shift + else echo "configure: Missing argument to '${option}'" 1>&2 + exit 1 + fi + fi +done + +# Find the source files, if location was not specified. +srcdirtext= +if [ -z "${srcdir}" ] ; then + srcdirtext="or . or .." ; srcdir=. + if [ ! -r "${srcdir}/${srctrigger}" ] ; then srcdir=.. ; fi + if [ ! -r "${srcdir}/${srctrigger}" ] ; then + ## the sed command below emulates the dirname command + srcdir=`echo "$0" | sed -e 's,[^/]*$,,;s,/$,,;s,^$,.,'` + fi +fi + +if [ ! -r "${srcdir}/${srctrigger}" ] ; then + echo "configure: Can't find sources in ${srcdir} ${srcdirtext}" 1>&2 + echo "configure: (At least ${srctrigger} is missing)." 1>&2 + exit 1 +fi + +# Set srcdir to . if that's what it is. +if [ "`pwd`" = "`cd "${srcdir}" ; pwd`" ] ; then srcdir=. ; fi + +if [ "${build}" = yes ] ; then + objs=$(sed -e :a -e '/\\$/N; s/\\\n//; ta' "${srcdir}/Makefile.in" | \ + sed -n -e 's/^ *objs *= *//p' | sed -e 's/ \{2,\}/ /g') + for ofile in ${objs} ; do + file="${ofile%.o}.cc" ; pver= + [ "${ofile}" = main.o ] && pver=" -DPROGVERSION=\"${pkgversion}\"" + compile_command="${CXX} ${CPPFLAGS} ${CXXFLAGS}${pver} -c -o ${ofile}" + echo "${compile_command} ${srcdir}/${file}" + ${compile_command} "${srcdir}/${file}" || exit 1 + done + link_command="${CXX} ${LDFLAGS} ${CXXFLAGS} -o ${progname} ${objs}" + echo "${link_command}" ; ${link_command} || exit 1 + if [ "${check}" = yes ] ; then + "${srcdir}/testsuite/check.sh" "${srcdir}/testsuite" ${pkgversion} || exit 1 + fi + if [ -n "${installdir}" ] ; then + echo "installing ${progname} in ${installdir}" + [ -d "${installdir}" ] || mkdir -p "${installdir}" || exit 1 + cp -fp ${progname} "${installdir}/${progname}" || exit 1 + fi + exit 0 +fi + +echo +if [ -z "${no_create}" ] ; then + echo "creating config.status" + rm -f config.status + cat > config.status << EOF +#! /bin/sh +# This file was generated automatically by configure. Don't edit. +# Run this file to recreate the current configuration. +# +# This script is free software: you have unlimited permission +# to copy, distribute, and modify it. + +exec /bin/sh $0 ${args} --no-create +EOF + chmod +x config.status +fi + +echo "creating Makefile" +echo "VPATH = ${srcdir}" +echo "prefix = ${prefix}" +echo "exec_prefix = ${exec_prefix}" +echo "bindir = ${bindir}" +echo "datarootdir = ${datarootdir}" +echo "infodir = ${infodir}" +echo "mandir = ${mandir}" +echo "CXX = ${CXX}" +echo "CPPFLAGS = ${CPPFLAGS}" +echo "CXXFLAGS = ${CXXFLAGS}" +echo "LDFLAGS = ${LDFLAGS}" +rm -f Makefile +cat > Makefile << EOF +# Makefile for Lzip - LZMA lossless data compressor +# Copyright (C) 2008-2022 Antonio Diaz Diaz. +# This file was generated automatically by configure. Don't edit. +# +# This Makefile is free software: you have unlimited permission +# to copy, distribute, and modify it. + +pkgname = ${pkgname} +pkgversion = ${pkgversion} +progname = ${progname} +VPATH = ${srcdir} +prefix = ${prefix} +exec_prefix = ${exec_prefix} +bindir = ${bindir} +datarootdir = ${datarootdir} +infodir = ${infodir} +mandir = ${mandir} +CXX = ${CXX} +CPPFLAGS = ${CPPFLAGS} +CXXFLAGS = ${CXXFLAGS} +LDFLAGS = ${LDFLAGS} +EOF +cat "${srcdir}/Makefile.in" >> Makefile + +echo "OK. Now you can run make." diff --git a/decoder.cc b/decoder.cc new file mode 100644 index 0000000..4cd2819 --- /dev/null +++ b/decoder.cc @@ -0,0 +1,284 @@ +/* Lzip - LZMA lossless data compressor + Copyright (C) 2008-2022 Antonio Diaz Diaz. + + This program is free software: you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation, either version 2 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program. If not, see <http://www.gnu.org/licenses/>. +*/ + +#define _FILE_OFFSET_BITS 64 + +#include <algorithm> +#include <cerrno> +#include <cstdio> +#include <cstdlib> +#include <cstring> +#include <string> +#include <vector> +#include <stdint.h> +#include <unistd.h> + +#include "lzip.h" +#include "decoder.h" + + +/* Return the number of bytes really read. + If (value returned < size) and (errno == 0), means EOF was reached. +*/ +int readblock( const int fd, uint8_t * const buf, const int size ) + { + int sz = 0; + errno = 0; + while( sz < size ) + { + const int n = read( fd, buf + sz, size - sz ); + if( n > 0 ) sz += n; + else if( n == 0 ) break; // EOF + else if( errno != EINTR ) break; + errno = 0; + } + return sz; + } + + +/* Return the number of bytes really written. + If (value returned < size), it is always an error. +*/ +int writeblock( const int fd, const uint8_t * const buf, const int size ) + { + int sz = 0; + errno = 0; + while( sz < size ) + { + const int n = write( fd, buf + sz, size - sz ); + if( n > 0 ) sz += n; + else if( n < 0 && errno != EINTR ) break; + errno = 0; + } + return sz; + } + + +bool Range_decoder::read_block() + { + if( !at_stream_end ) + { + stream_pos = readblock( infd, buffer, buffer_size ); + if( stream_pos != buffer_size && errno ) throw Error( "Read error" ); + at_stream_end = ( stream_pos < buffer_size ); + partial_member_pos += pos; + pos = 0; + show_dprogress(); + } + return pos < stream_pos; + } + + +void LZ_decoder::flush_data() + { + if( pos > stream_pos ) + { + const int size = pos - stream_pos; + crc32.update_buf( crc_, buffer + stream_pos, size ); + if( outfd >= 0 && writeblock( outfd, buffer + stream_pos, size ) != size ) + throw Error( "Write error" ); + if( pos >= dictionary_size ) + { partial_data_pos += pos; pos = 0; pos_wrapped = true; } + stream_pos = pos; + } + } + + +bool LZ_decoder::verify_trailer( const Pretty_print & pp ) const + { + Lzip_trailer trailer; + int size = rdec.read_data( trailer.data, Lzip_trailer::size ); + const unsigned long long data_size = data_position(); + const unsigned long long member_size = rdec.member_position(); + bool error = false; + + if( size < Lzip_trailer::size ) + { + error = true; + if( verbosity >= 0 ) + { + pp(); + std::fprintf( stderr, "Trailer truncated at trailer position %d;" + " some checks may fail.\n", size ); + } + while( size < Lzip_trailer::size ) trailer.data[size++] = 0; + } + + const unsigned td_crc = trailer.data_crc(); + if( td_crc != crc() ) + { + error = true; + if( verbosity >= 0 ) + { + pp(); + std::fprintf( stderr, "CRC mismatch; stored %08X, computed %08X\n", + td_crc, crc() ); + } + } + const unsigned long long td_size = trailer.data_size(); + if( td_size != data_size ) + { + error = true; + if( verbosity >= 0 ) + { + pp(); + std::fprintf( stderr, "Data size mismatch; stored %llu (0x%llX), computed %llu (0x%llX)\n", + td_size, td_size, data_size, data_size ); + } + } + const unsigned long long tm_size = trailer.member_size(); + if( tm_size != member_size ) + { + error = true; + if( verbosity >= 0 ) + { + pp(); + std::fprintf( stderr, "Member size mismatch; stored %llu (0x%llX), computed %llu (0x%llX)\n", + tm_size, tm_size, member_size, member_size ); + } + } + if( error ) return false; + if( verbosity >= 2 ) + { + if( verbosity >= 4 ) show_header( dictionary_size ); + if( data_size == 0 || member_size == 0 ) + std::fputs( "no data compressed. ", stderr ); + else + std::fprintf( stderr, "%6.3f:1, %5.2f%% ratio, %5.2f%% saved. ", + (double)data_size / member_size, + ( 100.0 * member_size ) / data_size, + 100.0 - ( ( 100.0 * member_size ) / data_size ) ); + if( verbosity >= 4 ) std::fprintf( stderr, "CRC %08X, ", td_crc ); + if( verbosity >= 3 ) + std::fprintf( stderr, "%9llu out, %8llu in. ", data_size, member_size ); + } + return true; + } + + +/* Return value: 0 = OK, 1 = decoder error, 2 = unexpected EOF, + 3 = trailer error, 4 = unknown marker found. */ +int LZ_decoder::decode_member( const Pretty_print & pp ) + { + Bit_model bm_literal[1<<literal_context_bits][0x300]; + Bit_model bm_match[State::states][pos_states]; + Bit_model bm_rep[State::states]; + Bit_model bm_rep0[State::states]; + Bit_model bm_rep1[State::states]; + Bit_model bm_rep2[State::states]; + Bit_model bm_len[State::states][pos_states]; + Bit_model bm_dis_slot[len_states][1<<dis_slot_bits]; + Bit_model bm_dis[modeled_distances-end_dis_model+1]; + Bit_model bm_align[dis_align_size]; + Len_model match_len_model; + Len_model rep_len_model; + unsigned rep0 = 0; // rep[0-3] latest four distances + unsigned rep1 = 0; // used for efficient coding of + unsigned rep2 = 0; // repeated distances + unsigned rep3 = 0; + State state; + + rdec.load(); + while( !rdec.finished() ) + { + const int pos_state = data_position() & pos_state_mask; + if( rdec.decode_bit( bm_match[state()][pos_state] ) == 0 ) // 1st bit + { + // literal byte + Bit_model * const bm = bm_literal[get_lit_state(peek_prev())]; + if( state.is_char_set_char() ) + put_byte( rdec.decode_tree8( bm ) ); + else + put_byte( rdec.decode_matched( bm, peek( rep0 ) ) ); + continue; + } + // match or repeated match + int len; + if( rdec.decode_bit( bm_rep[state()] ) != 0 ) // 2nd bit + { + if( rdec.decode_bit( bm_rep0[state()] ) == 0 ) // 3rd bit + { + if( rdec.decode_bit( bm_len[state()][pos_state] ) == 0 ) // 4th bit + { state.set_short_rep(); put_byte( peek( rep0 ) ); continue; } + } + else + { + unsigned distance; + if( rdec.decode_bit( bm_rep1[state()] ) == 0 ) // 4th bit + distance = rep1; + else + { + if( rdec.decode_bit( bm_rep2[state()] ) == 0 ) // 5th bit + distance = rep2; + else + { distance = rep3; rep3 = rep2; } + rep2 = rep1; + } + rep1 = rep0; + rep0 = distance; + } + state.set_rep(); + len = rdec.decode_len( rep_len_model, pos_state ); + } + else // match + { + len = rdec.decode_len( match_len_model, pos_state ); + unsigned distance = rdec.decode_tree6( bm_dis_slot[get_len_state(len)] ); + if( distance >= start_dis_model ) + { + const unsigned dis_slot = distance; + const int direct_bits = ( dis_slot >> 1 ) - 1; + distance = ( 2 | ( dis_slot & 1 ) ) << direct_bits; + if( dis_slot < end_dis_model ) + distance += rdec.decode_tree_reversed( + bm_dis + ( distance - dis_slot ), direct_bits ); + else + { + distance += + rdec.decode( direct_bits - dis_align_bits ) << dis_align_bits; + distance += rdec.decode_tree_reversed4( bm_align ); + if( distance == 0xFFFFFFFFU ) // marker found + { + rdec.normalize(); + flush_data(); + if( len == min_match_len ) // End Of Stream marker + { + if( verify_trailer( pp ) ) return 0; else return 3; + } + if( len == min_match_len + 1 ) // Sync Flush marker + { + rdec.load(); continue; + } + if( verbosity >= 0 ) + { + pp(); + std::fprintf( stderr, "Unsupported marker code '%d'\n", len ); + } + return 4; + } + } + } + rep3 = rep2; rep2 = rep1; rep1 = rep0; rep0 = distance; + state.set_match(); + if( rep0 >= dictionary_size || ( rep0 >= pos && !pos_wrapped ) ) + { flush_data(); return 1; } + } + copy_block( rep0, len ); + } + flush_data(); + return 2; + } diff --git a/decoder.h b/decoder.h new file mode 100644 index 0000000..29fdef6 --- /dev/null +++ b/decoder.h @@ -0,0 +1,344 @@ +/* Lzip - LZMA lossless data compressor + Copyright (C) 2008-2022 Antonio Diaz Diaz. + + This program is free software: you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation, either version 2 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program. If not, see <http://www.gnu.org/licenses/>. +*/ + +class Range_decoder + { + enum { buffer_size = 16384 }; + unsigned long long partial_member_pos; + uint8_t * const buffer; // input buffer + int pos; // current pos in buffer + int stream_pos; // when reached, a new block must be read + uint32_t code; + uint32_t range; + const int infd; // input file descriptor + bool at_stream_end; + + bool read_block(); + + Range_decoder( const Range_decoder & ); // declared as private + void operator=( const Range_decoder & ); // declared as private + +public: + explicit Range_decoder( const int ifd ) + : + partial_member_pos( 0 ), + buffer( new uint8_t[buffer_size] ), + pos( 0 ), + stream_pos( 0 ), + code( 0 ), + range( 0xFFFFFFFFU ), + infd( ifd ), + at_stream_end( false ) + {} + + ~Range_decoder() { delete[] buffer; } + + bool finished() { return pos >= stream_pos && !read_block(); } + + unsigned long long member_position() const + { return partial_member_pos + pos; } + + void reset_member_position() + { partial_member_pos = 0; partial_member_pos -= pos; } + + uint8_t get_byte() + { + // 0xFF avoids decoder error if member is truncated at EOS marker + if( finished() ) return 0xFF; + return buffer[pos++]; + } + + int read_data( uint8_t * const outbuf, const int size ) + { + int sz = 0; + while( sz < size && !finished() ) + { + const int rd = std::min( size - sz, stream_pos - pos ); + std::memcpy( outbuf + sz, buffer + pos, rd ); + pos += rd; + sz += rd; + } + return sz; + } + + void load() + { + code = 0; + for( int i = 0; i < 5; ++i ) code = ( code << 8 ) | get_byte(); + range = 0xFFFFFFFFU; + code &= range; // make sure that first byte is discarded + } + + void normalize() + { + if( range <= 0x00FFFFFFU ) + { range <<= 8; code = ( code << 8 ) | get_byte(); } + } + + unsigned decode( const int num_bits ) + { + unsigned symbol = 0; + for( int i = num_bits; i > 0; --i ) + { + normalize(); + range >>= 1; +// symbol <<= 1; +// if( code >= range ) { code -= range; symbol |= 1; } + const bool bit = ( code >= range ); + symbol <<= 1; symbol += bit; + code -= range & ( 0U - bit ); + } + return symbol; + } + + bool decode_bit( Bit_model & bm ) + { + normalize(); + const uint32_t bound = ( range >> bit_model_total_bits ) * bm.probability; + if( code < bound ) + { + range = bound; + bm.probability += + ( bit_model_total - bm.probability ) >> bit_model_move_bits; + return 0; + } + else + { + code -= bound; + range -= bound; + bm.probability -= bm.probability >> bit_model_move_bits; + return 1; + } + } + + void decode_symbol_bit( Bit_model & bm, unsigned & symbol ) + { + normalize(); + symbol <<= 1; + const uint32_t bound = ( range >> bit_model_total_bits ) * bm.probability; + if( code < bound ) + { + range = bound; + bm.probability += + ( bit_model_total - bm.probability ) >> bit_model_move_bits; + } + else + { + code -= bound; + range -= bound; + bm.probability -= bm.probability >> bit_model_move_bits; + symbol |= 1; + } + } + + void decode_symbol_bit_reversed( Bit_model & bm, unsigned & model, + unsigned & symbol, const int i ) + { + normalize(); + model <<= 1; + const uint32_t bound = ( range >> bit_model_total_bits ) * bm.probability; + if( code < bound ) + { + range = bound; + bm.probability += + ( bit_model_total - bm.probability ) >> bit_model_move_bits; + } + else + { + code -= bound; + range -= bound; + bm.probability -= bm.probability >> bit_model_move_bits; + model |= 1; + symbol |= 1 << i; + } + } + + unsigned decode_tree6( Bit_model bm[] ) + { + unsigned symbol = 1; + decode_symbol_bit( bm[symbol], symbol ); + decode_symbol_bit( bm[symbol], symbol ); + decode_symbol_bit( bm[symbol], symbol ); + decode_symbol_bit( bm[symbol], symbol ); + decode_symbol_bit( bm[symbol], symbol ); + decode_symbol_bit( bm[symbol], symbol ); + return symbol & 0x3F; + } + + unsigned decode_tree8( Bit_model bm[] ) + { + unsigned symbol = 1; + decode_symbol_bit( bm[symbol], symbol ); + decode_symbol_bit( bm[symbol], symbol ); + decode_symbol_bit( bm[symbol], symbol ); + decode_symbol_bit( bm[symbol], symbol ); + decode_symbol_bit( bm[symbol], symbol ); + decode_symbol_bit( bm[symbol], symbol ); + decode_symbol_bit( bm[symbol], symbol ); + decode_symbol_bit( bm[symbol], symbol ); + return symbol & 0xFF; + } + + unsigned decode_tree_reversed( Bit_model bm[], const int num_bits ) + { + unsigned model = 1; + unsigned symbol = 0; + for( int i = 0; i < num_bits; ++i ) + decode_symbol_bit_reversed( bm[model], model, symbol, i ); + return symbol; + } + + unsigned decode_tree_reversed4( Bit_model bm[] ) + { + unsigned model = 1; + unsigned symbol = 0; + decode_symbol_bit_reversed( bm[model], model, symbol, 0 ); + decode_symbol_bit_reversed( bm[model], model, symbol, 1 ); + decode_symbol_bit_reversed( bm[model], model, symbol, 2 ); + decode_symbol_bit_reversed( bm[model], model, symbol, 3 ); + return symbol; + } + + unsigned decode_matched( Bit_model bm[], unsigned match_byte ) + { + Bit_model * const bm1 = bm + 0x100; + unsigned symbol = 1; + while( symbol < 0x100 ) + { + const unsigned match_bit = ( match_byte <<= 1 ) & 0x100; + const bool bit = decode_bit( bm1[symbol+match_bit] ); + symbol <<= 1; symbol |= bit; + if( match_bit >> 8 != bit ) + { + while( symbol < 0x100 ) decode_symbol_bit( bm[symbol], symbol ); + break; + } + } + return symbol & 0xFF; + } + + unsigned decode_len( Len_model & lm, const int pos_state ) + { + Bit_model * bm; + unsigned mask, offset, symbol = 1; + + if( decode_bit( lm.choice1 ) == 0 ) + { bm = lm.bm_low[pos_state]; mask = 7; offset = 0; goto len3; } + if( decode_bit( lm.choice2 ) == 0 ) + { bm = lm.bm_mid[pos_state]; mask = 7; offset = len_low_symbols; goto len3; } + bm = lm.bm_high; mask = 0xFF; offset = len_low_symbols + len_mid_symbols; + decode_symbol_bit( bm[symbol], symbol ); + decode_symbol_bit( bm[symbol], symbol ); + decode_symbol_bit( bm[symbol], symbol ); + decode_symbol_bit( bm[symbol], symbol ); + decode_symbol_bit( bm[symbol], symbol ); +len3: + decode_symbol_bit( bm[symbol], symbol ); + decode_symbol_bit( bm[symbol], symbol ); + decode_symbol_bit( bm[symbol], symbol ); + return ( symbol & mask ) + min_match_len + offset; + } + }; + + +class LZ_decoder + { + unsigned long long partial_data_pos; + Range_decoder & rdec; + const unsigned dictionary_size; + uint8_t * const buffer; // output buffer + unsigned pos; // current pos in buffer + unsigned stream_pos; // first byte not yet written to file + uint32_t crc_; + const int outfd; // output file descriptor + bool pos_wrapped; + + void flush_data(); + bool verify_trailer( const Pretty_print & pp ) const; + + uint8_t peek_prev() const + { return buffer[((pos > 0) ? pos : dictionary_size)-1]; } + + uint8_t peek( const unsigned distance ) const + { + const unsigned i = ( ( pos > distance ) ? 0 : dictionary_size ) + + pos - distance - 1; + return buffer[i]; + } + + void put_byte( const uint8_t b ) + { + buffer[pos] = b; + if( ++pos >= dictionary_size ) flush_data(); + } + + void copy_block( const unsigned distance, unsigned len ) + { + unsigned lpos = pos, i = lpos - distance - 1; + bool fast, fast2; + if( lpos > distance ) + { + fast = ( len < dictionary_size - lpos ); + fast2 = ( fast && len <= lpos - i ); + } + else + { + i += dictionary_size; + fast = ( len < dictionary_size - i ); // (i == pos) may happen + fast2 = ( fast && len <= i - lpos ); + } + if( fast ) // no wrap + { + pos += len; + if( fast2 ) // no wrap, no overlap + std::memcpy( buffer + lpos, buffer + i, len ); + else + for( ; len > 0; --len ) buffer[lpos++] = buffer[i++]; + } + else for( ; len > 0; --len ) + { + buffer[pos] = buffer[i]; + if( ++pos >= dictionary_size ) flush_data(); + if( ++i >= dictionary_size ) i = 0; + } + } + + LZ_decoder( const LZ_decoder & ); // declared as private + void operator=( const LZ_decoder & ); // declared as private + +public: + LZ_decoder( Range_decoder & rde, const unsigned dict_size, const int ofd ) + : + partial_data_pos( 0 ), + rdec( rde ), + dictionary_size( dict_size ), + buffer( new uint8_t[dictionary_size] ), + pos( 0 ), + stream_pos( 0 ), + crc_( 0xFFFFFFFFU ), + outfd( ofd ), + pos_wrapped( false ) + // prev_byte of first byte; also for peek( 0 ) on corrupt file + { buffer[dictionary_size-1] = 0; } + + ~LZ_decoder() { delete[] buffer; } + + unsigned crc() const { return crc_ ^ 0xFFFFFFFFU; } + unsigned long long data_position() const { return partial_data_pos + pos; } + + int decode_member( const Pretty_print & pp ); + }; diff --git a/doc/lzip.1 b/doc/lzip.1 new file mode 100644 index 0000000..720c7a0 --- /dev/null +++ b/doc/lzip.1 @@ -0,0 +1,130 @@ +.\" DO NOT MODIFY THIS FILE! It was generated by help2man 1.47.16. +.TH LZIP "1" "January 2022" "lzip 1.23" "User Commands" +.SH NAME +lzip \- reduces the size of files +.SH SYNOPSIS +.B lzip +[\fI\,options\/\fR] [\fI\,files\/\fR] +.SH DESCRIPTION +Lzip is a lossless data compressor with a user interface similar to the one +of gzip or bzip2. Lzip uses a simplified form of the 'Lempel\-Ziv\-Markov +chain\-Algorithm' (LZMA) stream format and provides a 3 factor integrity +checking to maximize interoperability and optimize safety. Lzip can compress +about as fast as gzip (lzip \fB\-0\fR) or compress most files more than bzip2 +(lzip \fB\-9\fR). Decompression speed is intermediate between gzip and bzip2. +Lzip is better than gzip and bzip2 from a data recovery perspective. Lzip +has been designed, written, and tested with great care to replace gzip and +bzip2 as the standard general\-purpose compressed format for unix\-like +systems. +.SH OPTIONS +.TP +\fB\-h\fR, \fB\-\-help\fR +display this help and exit +.TP +\fB\-V\fR, \fB\-\-version\fR +output version information and exit +.TP +\fB\-a\fR, \fB\-\-trailing\-error\fR +exit with error status if trailing data +.TP +\fB\-b\fR, \fB\-\-member\-size=\fR<bytes> +set member size limit in bytes +.TP +\fB\-c\fR, \fB\-\-stdout\fR +write to standard output, keep input files +.TP +\fB\-d\fR, \fB\-\-decompress\fR +decompress +.TP +\fB\-f\fR, \fB\-\-force\fR +overwrite existing output files +.TP +\fB\-F\fR, \fB\-\-recompress\fR +force re\-compression of compressed files +.TP +\fB\-k\fR, \fB\-\-keep\fR +keep (don't delete) input files +.TP +\fB\-l\fR, \fB\-\-list\fR +print (un)compressed file sizes +.TP +\fB\-m\fR, \fB\-\-match\-length=\fR<bytes> +set match length limit in bytes [36] +.TP +\fB\-o\fR, \fB\-\-output=\fR<file> +write to <file>, keep input files +.TP +\fB\-q\fR, \fB\-\-quiet\fR +suppress all messages +.TP +\fB\-s\fR, \fB\-\-dictionary\-size=\fR<bytes> +set dictionary size limit in bytes [8 MiB] +.TP +\fB\-S\fR, \fB\-\-volume\-size=\fR<bytes> +set volume size limit in bytes +.TP +\fB\-t\fR, \fB\-\-test\fR +test compressed file integrity +.TP +\fB\-v\fR, \fB\-\-verbose\fR +be verbose (a 2nd \fB\-v\fR gives more) +.TP +\fB\-0\fR .. \fB\-9\fR +set compression level [default 6] +.TP +\fB\-\-fast\fR +alias for \fB\-0\fR +.TP +\fB\-\-best\fR +alias for \fB\-9\fR +.TP +\fB\-\-loose\-trailing\fR +allow trailing data seeming corrupt header +.PP +If no file names are given, or if a file is '\-', lzip compresses or +decompresses from standard input to standard output. +Numbers may be followed by a multiplier: k = kB = 10^3 = 1000, +Ki = KiB = 2^10 = 1024, M = 10^6, Mi = 2^20, G = 10^9, Gi = 2^30, etc... +Dictionary sizes 12 to 29 are interpreted as powers of two, meaning 2^12 +to 2^29 bytes. +.PP +The bidimensional parameter space of LZMA can't be mapped to a linear +scale optimal for all files. If your files are large, very repetitive, +etc, you may need to use the options \fB\-\-dictionary\-size\fR and \fB\-\-match\-length\fR +directly to achieve optimal performance. +.PP +To extract all the files from archive 'foo.tar.lz', use the commands +\&'tar \fB\-xf\fR foo.tar.lz' or 'lzip \fB\-cd\fR foo.tar.lz | tar \fB\-xf\fR \-'. +.PP +Exit status: 0 for a normal exit, 1 for environmental problems (file +not found, invalid flags, I/O errors, etc), 2 to indicate a corrupt or +invalid input file, 3 for an internal consistency error (e.g., bug) which +caused lzip to panic. +.PP +The ideas embodied in lzip are due to (at least) the following people: +Abraham Lempel and Jacob Ziv (for the LZ algorithm), Andrey Markov (for the +definition of Markov chains), G.N.N. Martin (for the definition of range +encoding), Igor Pavlov (for putting all the above together in LZMA), and +Julian Seward (for bzip2's CLI). +.SH "REPORTING BUGS" +Report bugs to lzip\-bug@nongnu.org +.br +Lzip home page: http://www.nongnu.org/lzip/lzip.html +.SH COPYRIGHT +Copyright \(co 2022 Antonio Diaz Diaz. +License GPLv2+: GNU GPL version 2 or later <http://gnu.org/licenses/gpl.html> +.br +This is free software: you are free to change and redistribute it. +There is NO WARRANTY, to the extent permitted by law. +.SH "SEE ALSO" +The full documentation for +.B lzip +is maintained as a Texinfo manual. If the +.B info +and +.B lzip +programs are properly installed at your site, the command +.IP +.B info lzip +.PP +should give you access to the complete manual. diff --git a/doc/lzip.info b/doc/lzip.info new file mode 100644 index 0000000..7c6d812 --- /dev/null +++ b/doc/lzip.info @@ -0,0 +1,1708 @@ +This is lzip.info, produced by makeinfo version 4.13+ from lzip.texi. + +INFO-DIR-SECTION Compression +START-INFO-DIR-ENTRY +* Lzip: (lzip). LZMA lossless data compressor +END-INFO-DIR-ENTRY + + +File: lzip.info, Node: Top, Next: Introduction, Up: (dir) + +Lzip Manual +*********** + +This manual is for Lzip (version 1.23, 24 January 2022). + +* Menu: + +* Introduction:: Purpose and features of lzip +* Output:: Meaning of lzip's output +* Invoking lzip:: Command line interface +* Quality assurance:: Design, development, and testing of lzip +* Algorithm:: How lzip compresses the data +* File format:: Detailed format of the compressed file +* Stream format:: Format of the LZMA stream in lzip files +* Trailing data:: Extra data appended to the file +* Examples:: A small tutorial with examples +* Problems:: Reporting bugs +* Reference source code:: Source code illustrating stream format +* Concept index:: Index of concepts + + + Copyright (C) 2008-2022 Antonio Diaz Diaz. + + This manual is free documentation: you have unlimited permission to copy, +distribute, and modify it. + + +File: lzip.info, Node: Introduction, Next: Output, Prev: Top, Up: Top + +1 Introduction +************** + +Lzip is a lossless data compressor with a user interface similar to the one +of gzip or bzip2. Lzip uses a simplified form of the 'Lempel-Ziv-Markov +chain-Algorithm' (LZMA) stream format and provides a 3 factor integrity +checking to maximize interoperability and optimize safety. Lzip can compress +about as fast as gzip (lzip -0) or compress most files more than bzip2 +(lzip -9). Decompression speed is intermediate between gzip and bzip2. Lzip +is better than gzip and bzip2 from a data recovery perspective. Lzip has +been designed, written, and tested with great care to replace gzip and +bzip2 as the standard general-purpose compressed format for unix-like +systems. + + For compressing/decompressing large files on multiprocessor machines +plzip can be much faster than lzip at the cost of a slightly reduced +compression ratio. *Note plzip manual: (plzip)Top. + + For creation and manipulation of compressed tar archives tarlz can be +more efficient than using tar and plzip because tarlz is able to keep the +alignment between tar members and lzip members. *Note tarlz manual: +(tarlz)Top. + + The lzip file format is designed for data sharing and long-term +archiving, taking into account both data integrity and decoder availability: + + * The lzip format provides very safe integrity checking and some data + recovery means. The program lziprecover can repair bit flip errors + (one of the most common forms of data corruption) in lzip files, and + provides data recovery capabilities, including error-checked merging + of damaged copies of a file. *Note Data safety: (lziprecover)Data + safety. + + * The lzip format is as simple as possible (but not simpler). The lzip + manual provides the source code of a simple decompressor along with a + detailed explanation of how it works, so that with the only help of the + lzip manual it would be possible for a digital archaeologist to extract + the data from a lzip file long after quantum computers eventually + render LZMA obsolete. + + * Additionally the lzip reference implementation is copylefted, which + guarantees that it will remain free forever. + + A nice feature of the lzip format is that a corrupt byte is easier to +repair the nearer it is from the beginning of the file. Therefore, with the +help of lziprecover, losing an entire archive just because of a corrupt +byte near the beginning is a thing of the past. + + The member trailer stores the 32-bit CRC of the original data, the size +of the original data, and the size of the member. These values, together +with the "End Of Stream" marker, provide a 3 factor integrity checking +which guarantees that the decompressed version of the data is identical to +the original. This guards against corruption of the compressed data, and +against undetected bugs in lzip (hopefully very unlikely). The chances of +data corruption going undetected are microscopic. Be aware, though, that +the check occurs upon decompression, so it can only tell you that something +is wrong. It can't help you recover the original uncompressed data. + + Lzip uses the same well-defined exit status values used by bzip2, which +makes it safer than compressors returning ambiguous warning values (like +gzip) when it is used as a back end for other programs like tar or zutils. + + Lzip will automatically use for each file the largest dictionary size +that does not exceed neither the file size nor the limit given. Keep in +mind that the decompression memory requirement is affected at compression +time by the choice of dictionary size limit. + + The amount of memory required for compression is about 1 or 2 times the +dictionary size limit (1 if input file size is less than dictionary size +limit, else 2) plus 9 times the dictionary size really used. The option +'-0' is special and only requires about 1.5 MiB at most. The amount of +memory required for decompression is about 46 kB larger than the dictionary +size really used. + + When compressing, lzip replaces every file given in the command line +with a compressed version of itself, with the name "original_name.lz". When +decompressing, lzip attempts to guess the name for the decompressed file +from that of the compressed file as follows: + +filename.lz becomes filename +filename.tlz becomes filename.tar +anyothername becomes anyothername.out + + (De)compressing a file is much like copying or moving it. Therefore lzip +preserves the access and modification dates, permissions, and, when +possible, ownership of the file just as 'cp -p' does. (If the user ID or +the group ID can't be duplicated, the file permission bits S_ISUID and +S_ISGID are cleared). + + Lzip is able to read from some types of non-regular files if either the +option '-c' or the option '-o' is specified. + + Lzip will refuse to read compressed data from a terminal or write +compressed data to a terminal, as this would be entirely incomprehensible +and might leave the terminal in an abnormal state. + + Lzip will correctly decompress a file which is the concatenation of two +or more compressed files. The result is the concatenation of the +corresponding decompressed files. Integrity testing of concatenated +compressed files is also supported. + + Lzip can produce multimember files, and lziprecover can safely recover +the undamaged members in case of file damage. Lzip can also split the +compressed output in volumes of a given size, even when reading from +standard input. This allows the direct creation of multivolume compressed +tar archives. + + Lzip is able to compress and decompress streams of unlimited size by +automatically creating multimember output. The members so created are large, +about 2 PiB each. + + +File: lzip.info, Node: Output, Next: Invoking lzip, Prev: Introduction, Up: Top + +2 Meaning of lzip's output +************************** + +The output of lzip looks like this: + + lzip -v foo + foo: 6.676:1, 14.98% ratio, 85.02% saved, 450560 in, 67493 out. + + lzip -tvvv foo.lz + foo.lz: 6.676:1, 14.98% ratio, 85.02% saved. 450560 out, 67493 in. ok + + The meaning of each field is as follows: + +'N:1' + The compression ratio (uncompressed_size / compressed_size), shown as + N to 1. + +'ratio' + The inverse compression ratio (compressed_size / uncompressed_size), + shown as a percentage. A decimal ratio is easily obtained by moving the + decimal point two places to the left; 14.98% = 0.1498. + +'saved' + The space saved by compression (1 - ratio), shown as a percentage. + +'in' + Size of the input data. This is the uncompressed size when + compressing, or the compressed size when decompressing or testing. + Note that lzip always prints the uncompressed size before the + compressed size when compressing, decompressing, testing, or listing. + +'out' + Size of the output data. This is the compressed size when compressing, + or the decompressed size when decompressing or testing. + + + When decompressing or testing at verbosity level 4 (-vvvv), the +dictionary size used to compress the file and the CRC32 of the uncompressed +data are also shown. + + LANGUAGE NOTE: Uncompressed = not compressed = plain data; it may never +have been compressed. Decompressed is used to refer to data which have +undergone the process of decompression. + + +File: lzip.info, Node: Invoking lzip, Next: Quality assurance, Prev: Output, Up: Top + +3 Invoking lzip +*************** + +The format for running lzip is: + + lzip [OPTIONS] [FILES] + +If no file names are specified, lzip compresses (or decompresses) from +standard input to standard output. A hyphen '-' used as a FILE argument +means standard input. It can be mixed with other FILES and is read just +once, the first time it appears in the command line. + + lzip supports the following options: *Note Argument syntax: +(arg_parser)Argument syntax. + +'-h' +'--help' + Print an informative help message describing the options and exit. + +'-V' +'--version' + Print the version number of lzip on the standard output and exit. This + version number should be included in all bug reports. + +'-a' +'--trailing-error' + Exit with error status 2 if any remaining input is detected after + decompressing the last member. Such remaining input is usually trailing + garbage that can be safely ignored. *Note concat-example::. + +'-b BYTES' +'--member-size=BYTES' + When compressing, set the member size limit to BYTES. It is advisable + to keep members smaller than RAM size so that they can be repaired with + lziprecover in case of corruption. A small member size may degrade + compression ratio, so use it only when needed. Valid values range from + 100 kB to 2 PiB. Defaults to 2 PiB. + +'-c' +'--stdout' + Compress or decompress to standard output; keep input files unchanged. + If compressing several files, each file is compressed independently. + (The output consists of a sequence of independently compressed + members). This option (or '-o') is needed when reading from a named + pipe (fifo) or from a device. Use it also to recover as much of the + decompressed data as possible when decompressing a corrupt file. '-c' + overrides '-o' and '-S'. '-c' has no effect when testing or listing. + +'-d' +'--decompress' + Decompress the files specified. If a file does not exist, can't be + opened, or the destination file already exists and '--force' has not + been specified, lzip continues decompressing the rest of the files and + exits with error status 1. If a file fails to decompress, or is a + terminal, lzip exits immediately with error status 2 without + decompressing the rest of the files. A terminal is considered an + uncompressed file, and therefore invalid. + +'-f' +'--force' + Force overwrite of output files. + +'-F' +'--recompress' + When compressing, force re-compression of files whose name already has + the '.lz' or '.tlz' suffix. + +'-k' +'--keep' + Keep (don't delete) input files during compression or decompression. + +'-l' +'--list' + Print the uncompressed size, compressed size, and percentage saved of + the files specified. Trailing data are ignored. The values produced + are correct even for multimember files. If more than one file is + given, a final line containing the cumulative sizes is printed. With + '-v', the dictionary size, the number of members in the file, and the + amount of trailing data (if any) are also printed. With '-vv', the + positions and sizes of each member in multimember files are also + printed. + + If any file is damaged, does not exist, can't be opened, or is not + regular, the final exit status will be > 0. '-lq' can be used to verify + quickly (without decompressing) the structural integrity of the files + specified. (Use '--test' to verify the data integrity). '-alq' + additionally verifies that none of the files specified contain + trailing data. + +'-m BYTES' +'--match-length=BYTES' + When compressing, set the match length limit in bytes. After a match + this long is found, the search is finished. Valid values range from 5 + to 273. Larger values usually give better compression ratios but longer + compression times. + +'-o FILE' +'--output=FILE' + If '-c' has not been also specified, write the (de)compressed output to + FILE; keep input files unchanged. If compressing several files, each + file is compressed independently. (The output consists of a sequence of + independently compressed members). This option (or '-c') is needed when + reading from a named pipe (fifo) or from a device. '-o -' is + equivalent to '-c'. '-o' has no effect when testing or listing. + + In order to keep backward compatibility with lzip versions prior to + 1.22, when compressing from standard input and no other file names are + given, the extension '.lz' is appended to FILE unless it already ends + in '.lz' or '.tlz'. This feature will be removed in a future version + of lzip. Meanwhile, redirection may be used instead of '-o' to write + the compressed output to a file without the extension '.lz' in its + name: 'lzip < file > foo'. + + When compressing and splitting the output in volumes, FILE is used as + a prefix, and several files named 'FILE00001.lz', 'FILE00002.lz', etc, + are created. In this case, only one input file is allowed. + +'-q' +'--quiet' + Quiet operation. Suppress all messages. + +'-s BYTES' +'--dictionary-size=BYTES' + When compressing, set the dictionary size limit in bytes. Lzip will use + for each file the largest dictionary size that does not exceed neither + the file size nor this limit. Valid values range from 4 KiB to + 512 MiB. Values 12 to 29 are interpreted as powers of two, meaning + 2^12 to 2^29 bytes. Dictionary sizes are quantized so that they can be + coded in just one byte (*note coded-dict-size::). If the size specified + does not match one of the valid sizes, it will be rounded upwards by + adding up to (BYTES / 8) to it. + + For maximum compression you should use a dictionary size limit as large + as possible, but keep in mind that the decompression memory requirement + is affected at compression time by the choice of dictionary size limit. + +'-S BYTES' +'--volume-size=BYTES' + When compressing, and '-c' has not been also specified, split the + compressed output into several volume files with names + 'original_name00001.lz', 'original_name00002.lz', etc, and set the + volume size limit to BYTES. Input files are kept unchanged. Each + volume is a complete, maybe multimember, lzip file. A small volume + size may degrade compression ratio, so use it only when needed. Valid + values range from 100 kB to 4 EiB. + +'-t' +'--test' + Check integrity of the files specified, but don't decompress them. This + really performs a trial decompression and throws away the result. Use + it together with '-v' to see information about the files. If a file + fails the test, does not exist, can't be opened, or is a terminal, lzip + continues checking the rest of the files. A final diagnostic is shown + at verbosity level 1 or higher if any file fails the test when testing + multiple files. + +'-v' +'--verbose' + Verbose mode. + When compressing, show the compression ratio and size for each file + processed. + When decompressing or testing, further -v's (up to 4) increase the + verbosity level, showing status, compression ratio, dictionary size, + trailer contents (CRC, data size, member size), and up to 6 bytes of + trailing data (if any) both in hexadecimal and as a string of printable + ASCII characters. + Two or more '-v' options show the progress of (de)compression. + +'-0 .. -9' + Compression level. Set the compression parameters (dictionary size and + match length limit) as shown in the table below. The default + compression level is '-6', equivalent to '-s8MiB -m36'. Note that '-9' + can be much slower than '-0'. These options have no effect when + decompressing, testing, or listing. + + The bidimensional parameter space of LZMA can't be mapped to a linear + scale optimal for all files. If your files are large, very repetitive, + etc, you may need to use the options '--dictionary-size' and + '--match-length' directly to achieve optimal performance. + + If several compression levels or '-s' or '-m' options are given, the + last setting is used. For example '-9 -s64MiB' is equivalent to + '-s64MiB -m273' + + Level Dictionary size (-s) Match length limit (-m) + -0 64 KiB 16 bytes + -1 1 MiB 5 bytes + -2 1.5 MiB 6 bytes + -3 2 MiB 8 bytes + -4 3 MiB 12 bytes + -5 4 MiB 20 bytes + -6 8 MiB 36 bytes + -7 16 MiB 68 bytes + -8 24 MiB 132 bytes + -9 32 MiB 273 bytes + +'--fast' +'--best' + Aliases for GNU gzip compatibility. + +'--loose-trailing' + When decompressing, testing, or listing, allow trailing data whose + first bytes are so similar to the magic bytes of a lzip header that + they can be confused with a corrupt header. Use this option if a file + triggers a "corrupt header" error and the cause is not indeed a + corrupt header. + + + Numbers given as arguments to options may be followed by a multiplier +and an optional 'B' for "byte". + + Table of SI and binary prefixes (unit multipliers): + +Prefix Value | Prefix Value +k kilobyte (10^3 = 1000) | Ki kibibyte (2^10 = 1024) +M megabyte (10^6) | Mi mebibyte (2^20) +G gigabyte (10^9) | Gi gibibyte (2^30) +T terabyte (10^12) | Ti tebibyte (2^40) +P petabyte (10^15) | Pi pebibyte (2^50) +E exabyte (10^18) | Ei exbibyte (2^60) +Z zettabyte (10^21) | Zi zebibyte (2^70) +Y yottabyte (10^24) | Yi yobibyte (2^80) + + + Exit status: 0 for a normal exit, 1 for environmental problems (file not +found, invalid flags, I/O errors, etc), 2 to indicate a corrupt or invalid +input file, 3 for an internal consistency error (e.g., bug) which caused +lzip to panic. + + +File: lzip.info, Node: Quality assurance, Next: Algorithm, Prev: Invoking lzip, Up: Top + +4 Design, development, and testing of lzip +****************************************** + +There are two ways of constructing a software design: One way is to make it +so simple that there are obviously no deficiencies and the other way is to +make it so complicated that there are no obvious deficiencies. The first +method is far more difficult. +-- C.A.R. Hoare + + Lzip is developed by volunteers who lack the resources required for +extensive testing in all circumstances. It is up to you to test lzip before +using it in mission-critical applications. However, a compressor like lzip +is not a toy, and maintaining it is not a hobby. Many people's data depend +on it. Therefore the lzip file format has been reviewed carefully and is +believed to be free from negligent design errors. + + Lzip has been designed, written, and tested with great care to replace +gzip and bzip2 as the standard general-purpose compressed format for +unix-like systems. This chapter describes the lessons learned from these +previous formats, and their application to the design of lzip. + + +4.1 Format design +================= + +When gzip was designed in 1992, computers and operating systems were much +less capable than they are today. The designers of gzip tried to work around +some of those limitations, like 8.3 file names, with additional fields in +the file format. + + Today those limitations have mostly disappeared, and the format of gzip +has proved to be unnecessarily complicated. It includes fields that were +never used, others that have lost their usefulness, and finally others that +have become too limited. + + Bzip2 was designed 5 years later, and its format is simpler than the one +of gzip. + + Probably the worst defect of the gzip format from the point of view of +data safety is the variable size of its header. If the byte at offset 3 +(flags) of a gzip member gets corrupted, it may become difficult to recover +the data, even if the compressed blocks are intact, because it can't be +known with certainty where the compressed blocks begin. + + By contrast, the header of a lzip member has a fixed length of 6. The +LZMA stream in a lzip member always starts at offset 6, making it trivial to +recover the data even if the whole header becomes corrupt. + + Bzip2 also provides a header of fixed length and marks the begin and end +of each compressed block with six magic bytes, making it possible to find +the compressed blocks even in case of file damage. But bzip2 does not store +the size of each compressed block, as lzip does. + + Lziprecover is able to provide unique data recovery capabilities because +the lzip format is extraordinarily safe. The simple and safe design of the +file format complements the embedded error detection provided by the LZMA +data stream. Any distance larger than the dictionary size acts as a +forbidden symbol, allowing the decompressor to detect the approximate +position of errors, and leaving very little work for the check sequence +(CRC and data sizes) in the detection of errors. Lzip is usually able to +detect all possible bit flips in the compressed data without resorting to +the check sequence. It would be difficult to write an automatic recovery +tool like lziprecover for the gzip format. And, as far as I know, it has +never been written. + + Lzip, like gzip and bzip2, uses a CRC32 to check the integrity of the +decompressed data because it provides optimal accuracy in the detection of +errors up to a compressed size of about 16 GiB, a size larger than that of +most files. In the case of lzip, the additional detection capability of the +decompressor reduces the probability of undetected errors several million +times more, resulting in a combined integrity checking optimally accurate +for any member size produced by lzip. Preliminary results suggest that the +lzip format is safe enough to be used in critical safety avionics systems. + + The lzip format is designed for long-term archiving. Therefore it +excludes any unneeded features that may interfere with the future +extraction of the decompressed data. + + +4.1.1 Gzip format (mis)features not present in lzip +--------------------------------------------------- + +'Multiple algorithms' + Gzip provides a CM (Compression Method) field that has never been used + because it is a bad idea to begin with. New compression methods may + require additional fields, making it impossible to implement new + methods and, at the same time, keep the same format. This field does + not solve the problem of format proliferation; it just makes the + problem less obvious. + +'Optional fields in header' + Unless special precautions are taken, optional fields are generally a + bad idea because they produce a header of variable size. The gzip + header has 2 fields that, in addition to being optional, are + zero-terminated. This means that if any byte inside the field gets + zeroed, or if the terminating zero gets altered, gzip won't be able to + find neither the header CRC nor the compressed blocks. + +'Optional CRC for the header' + Using an optional CRC for the header is not only a bad idea, it is an + error; it circumvents the Hamming distance (HD) of the CRC and may + prevent the extraction of perfectly good data. For example, if the CRC + is used and the bit enabling it is reset by a bit flip, the header + will appear to be intact (in spite of being corrupt) while the + compressed blocks will appear to be totally unrecoverable (in spite of + being intact). Very misleading indeed. + +'Metadata' + The gzip format stores some metadata, like the modification time of the + original file or the operating system on which compression took place. + This complicates reproducible compression (obtaining identical + compressed output from identical input). + + +4.1.2 Lzip format improvements over gzip and bzip2 +-------------------------------------------------- + +'64-bit size field' + Probably the most frequently reported shortcoming of the gzip format + is that it only stores the least significant 32 bits of the + uncompressed size. The size of any file larger than 4 GiB gets + truncated. + + Bzip2 does not store the uncompressed size of the file. + + The lzip format provides a 64-bit field for the uncompressed size. + Additionally, lzip produces multimember output automatically when the + size is too large for a single member, allowing for an unlimited + uncompressed size. + +'Distributed index' + The lzip format provides a distributed index that, among other things, + helps plzip to decompress several times faster than pigz and helps + lziprecover do its job. Neither the gzip format nor the bzip2 format + do provide an index. + + A distributed index is safer and more scalable than a monolithic + index. The monolithic index introduces a single point of failure in + the compressed file and may limit the number of members or the total + uncompressed size. + + +4.2 Quality of implementation +============================= + +'Accurate and robust error detection' + The lzip format provides 3 factor integrity checking, and the + decompressors report mismatches in each factor separately. This method + detects most false positives for corruption. If just one byte in one + factor fails but the other two factors match the data, it probably + means that the data are intact and the corruption just affects the + mismatching factor (CRC, data size, or member size) in the member + trailer. + +'Multiple implementations' + Just like the lzip format provides 3 factor protection against + undetected data corruption, the development methodology of the lzip + family of compressors provides 3 factor protection against undetected + programming errors. + + Three related but independent compressor implementations, lzip, clzip, + and minilzip/lzlib, are developed concurrently. Every stable release + of any of them is tested to verify that it produces identical output + to the other two. This guarantees that all three implement the same + algorithm, and makes it unlikely that any of them may contain serious + undiscovered errors. In fact, no errors have been discovered in lzip + since 2009. + + Additionally, the three implementations have been extensively tested + with unzcrash, valgrind, and 'american fuzzy lop' without finding a + single vulnerability or false negative. *Note Unzcrash: + (lziprecover)Unzcrash. + +'Dictionary size' + Lzip automatically adapts the dictionary size to the size of each file. + In addition to reducing the amount of memory required for + decompression, this feature also minimizes the probability of being + affected by RAM errors during compression. + +'Exit status' + Returning a warning status of 2 is a design flaw of compress that + leaked into the design of gzip. Both bzip2 and lzip are free from this + flaw. + + + +File: lzip.info, Node: Algorithm, Next: File format, Prev: Quality assurance, Up: Top + +5 Algorithm +*********** + +In spite of its name (Lempel-Ziv-Markov chain-Algorithm), LZMA is not a +concrete algorithm; it is more like "any algorithm using the LZMA coding +scheme". LZMA compression consists in describing the uncompressed data as a +succession of coding sequences from the set shown in Section 'What is +coded' (*note what-is-coded::), and then encoding them using a range +encoder. For example, the option '-0' of lzip uses the scheme in almost the +simplest way possible; issuing the longest match it can find, or a literal +byte if it can't find a match. Inversely, a much more elaborated way of +finding coding sequences of minimum size than the one currently used by +lzip could be developed, and the resulting sequence could also be coded +using the LZMA coding scheme. + + Lzip currently implements two variants of the LZMA algorithm: fast (used +by option '-0') and normal (used by all other compression levels). + + The high compression of LZMA comes from combining two basic, well-proven +compression ideas: sliding dictionaries (LZ77/78) and markov models (the +thing used by every compression algorithm that uses a range encoder or +similar order-0 entropy coder as its last stage) with segregation of +contexts according to what the bits are used for. + + Lzip is a two stage compressor. The first stage is a Lempel-Ziv coder, +which reduces redundancy by translating chunks of data to their +corresponding distance-length pairs. The second stage is a range encoder +that uses a different probability model for each type of data: distances, +lengths, literal bytes, etc. + + Here is how it works, step by step: + + 1) The member header is written to the output stream. + + 2) The first byte is coded literally, because there are no previous +bytes to which the match finder can refer to. + + 3) The main encoder advances to the next byte in the input data and +calls the match finder. + + 4) The match finder fills an array with the minimum distances before the +current byte where a match of a given length can be found. + + 5) Go back to step 3 until a sequence (formed of pairs, repeated +distances, and literal bytes) of minimum price has been formed. Where the +price represents the number of output bits produced. + + 6) The range encoder encodes the sequence produced by the main encoder +and sends the bytes produced to the output stream. + + 7) Go back to step 3 until the input data are finished or until the +member or volume size limits are reached. + + 8) The range encoder is flushed. + + 9) The member trailer is written to the output stream. + + 10) If there are more data to compress, go back to step 1. + + + During compression, lzip reads data in large blocks (one dictionary size +at a time). Therefore it may block for up to tens of seconds any process +feeding data to it through a pipe. This is normal. The blocking intervals +get longer with higher compression levels because dictionary size increases +(and compression speed decreases) with compression level. + +The ideas embodied in lzip are due to (at least) the following people: +Abraham Lempel and Jacob Ziv (for the LZ algorithm), Andrey Markov (for the +definition of Markov chains), G.N.N. Martin (for the definition of range +encoding), Igor Pavlov (for putting all the above together in LZMA), and +Julian Seward (for bzip2's CLI). + + +File: lzip.info, Node: File format, Next: Stream format, Prev: Algorithm, Up: Top + +6 File format +************* + +Perfection is reached, not when there is no longer anything to add, but +when there is no longer anything to take away. +-- Antoine de Saint-Exupery + + + In the diagram below, a box like this: + ++---+ +| | <-- the vertical bars might be missing ++---+ + + represents one byte; a box like this: + ++==============+ +| | ++==============+ + + represents a variable number of bytes. + + + A lzip file consists of a series of independent "members" (compressed +data sets). The members simply appear one after another in the file, with no +additional information before, between, or after them. Each member can +encode in compressed form up to 16 EiB - 1 byte of uncompressed data. The +size of a multimember file is unlimited. + + Each member has the following structure: + ++--+--+--+--+----+----+=============+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +| ID string | VN | DS | LZMA stream | CRC32 | Data size | Member size | ++--+--+--+--+----+----+=============+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ + + All multibyte values are stored in little endian order. + +'ID string (the "magic" bytes)' + A four byte string, identifying the lzip format, with the value "LZIP" + (0x4C, 0x5A, 0x49, 0x50). + +'VN (version number, 1 byte)' + Just in case something needs to be modified in the future. 1 for now. + +'DS (coded dictionary size, 1 byte)' + The dictionary size is calculated by taking a power of 2 (the base + size) and subtracting from it a fraction between 0/16 and 7/16 of the + base size. + Bits 4-0 contain the base 2 logarithm of the base size (12 to 29). + Bits 7-5 contain the numerator of the fraction (0 to 7) to subtract + from the base size to obtain the dictionary size. + Example: 0xD3 = 2^19 - 6 * 2^15 = 512 KiB - 6 * 32 KiB = 320 KiB + Valid values for dictionary size range from 4 KiB to 512 MiB. + +'LZMA stream' + The LZMA stream, finished by an "End Of Stream" marker. Uses default + values for encoder properties. *Note Stream format::, for a complete + description. + +'CRC32 (4 bytes)' + Cyclic Redundancy Check (CRC) of the original uncompressed data. + +'Data size (8 bytes)' + Size of the original uncompressed data. + +'Member size (8 bytes)' + Total size of the member, including header and trailer. This field acts + as a distributed index, allows the verification of stream integrity, + and facilitates the safe recovery of undamaged members from + multimember files. Member size should be limited to 2 PiB to prevent + the data size field from overflowing. + + + +File: lzip.info, Node: Stream format, Next: Trailing data, Prev: File format, Up: Top + +7 Format of the LZMA stream in lzip files +***************************************** + +The LZMA algorithm has three parameters, called "special LZMA properties", +to adjust it for some kinds of binary data. These parameters are: +'literal_context_bits' (with a default value of 3), +'literal_pos_state_bits' (with a default value of 0), and 'pos_state_bits' +(with a default value of 2). As a general purpose compressor, lzip only +uses the default values for these parameters. In particular +'literal_pos_state_bits' has been optimized away and does not even appear +in the code. + + Lzip finishes the LZMA stream with an "End Of Stream" (EOS) marker (the +distance-length pair 0xFFFFFFFFU, 2), which in conjunction with the 'member +size' field in the member trailer allows the verification of stream +integrity. The EOS marker is the only marker allowed in lzip files. The +LZMA stream in lzip files always has these two features (default properties +and EOS marker) and is referred to in this document as LZMA-302eos. This +simplified form of the LZMA stream format has been chosen to maximize +interoperability and safety. + + The second stage of LZMA is a range encoder that uses a different +probability model for each type of symbol: distances, lengths, literal +bytes, etc. Range encoding conceptually encodes all the symbols of the +message into one number. Unlike Huffman coding, which assigns to each +symbol a bit-pattern and concatenates all the bit-patterns together, range +encoding can compress one symbol to less than one bit. Therefore the +compressed data produced by a range encoder can't be split in pieces that +could be described individually. + + It seems that the only way of describing the LZMA-302eos stream is to +describe the algorithm that decodes it. And given the many details about +the range decoder that need to be described accurately, the source code of +a real decompressor seems the only appropriate reference to use. + + What follows is a description of the decoding algorithm for LZMA-302eos +streams using as reference the source code of "lzd", an educational +decompressor for lzip files which can be downloaded from the lzip download +directory. Lzd is written in C++11 and its source code is included in +appendix A. *Note Reference source code::. + + +7.1 What is coded +================= + +The LZMA stream includes literals, matches, and repeated matches (matches +reusing a recently used distance). There are 7 different coding sequences: + +Bit sequence Name Description +----------------------------------------------------------------------------- +0 + byte literal literal byte +1 + 0 + len + dis match distance-length pair +1 + 1 + 0 + 0 shortrep 1 byte match at latest used distance +1 + 1 + 0 + 1 + len rep0 len bytes match at latest used distance +1 + 1 + 1 + 0 + len rep1 len bytes match at second latest used + distance +1 + 1 + 1 + 1 + 0 + len rep2 len bytes match at third latest used + distance +1 + 1 + 1 + 1 + 1 + len rep3 len bytes match at fourth latest used + distance + + + In the following tables, multibit sequences are coded in normal order, +from most significant bit (MSB) to least significant bit (LSB), except +where noted otherwise. + + Lengths (the 'len' in the table above) are coded as follows: + +Bit sequence Description +---------------------------------------------------------------------------- +0 + 3 bits lengths from 2 to 9 +1 + 0 + 3 bits lengths from 10 to 17 +1 + 1 + 8 bits lengths from 18 to 273 + + + The coding of distances is a little more complicated, so I'll begin by +explaining a simpler version of the encoding. + + Imagine you need to encode a number from 0 to 2^32 - 1, and you want to +do it in a way that produces shorter codes for the smaller numbers. You may +first encode the position of the most significant bit that is set to 1, +which you may find by making a bit scan from the left (from the MSB). A +position of 0 means that the number is 0 (no bit is set), 1 means the LSB is +the first bit set (the number is 1), and 32 means the MSB is set (i.e., the +number is >= 0x80000000). Then, if the position is >= 2, you encode the +remaining position - 1 bits. Let's call these bits "direct bits" because +they are coded directly by value instead of indirectly by position. + + The inconvenient of this simple method is that it needs 6 bits to encode +the position, but it just uses 33 of the 64 possible values, wasting almost +half of the codes. + + The intelligent trick of LZMA is that it encodes in what it calls a +"slot" the position of the most significant bit set, along with the value +of the next bit, using the same 6 bits that would take to encode the +position alone. This seems to need 66 slots (twice the number of +positions), but for positions 0 and 1 there is no next bit, so the number +of slots needed is 64 (0 to 63). + + The 6 bits representing this "slot number" are then context-coded. If +the distance is >= 4, the remaining bits are encoded as follows. +'direct_bits' is the amount of remaining bits (from 1 to 30) needed to form +a complete distance, and is calculated as (slot >> 1) - 1. If a distance +needs 6 or more direct_bits, the last 4 bits are encoded separately. The +last piece (all the direct_bits for distances 4 to 127, or the last 4 bits +for distances >= 128) is context-coded in reverse order (from LSB to MSB). +For distances >= 128, the 'direct_bits - 4' part is encoded with fixed 0.5 +probability. + +Bit sequence Description +---------------------------------------------------------------------------- +slot distances from 0 to 3 +slot + direct_bits distances from 4 to 127 +slot + (direct_bits - 4) + 4 bits distances from 128 to 2^32 - 1 + + +7.2 The coding contexts +======================= + +These contexts ('Bit_model' in the source), are integers or arrays of +integers representing the probability of the corresponding bit being 0. + + The indices used in these arrays are: + +'state' + A state machine ('State' in the source) with 12 states (0 to 11), + coding the latest 2 to 4 types of sequences processed. The initial + state is 0. + +'pos_state' + Value of the 2 least significant bits of the current position in the + decoded data. + +'literal_state' + Value of the 3 most significant bits of the latest byte decoded. + +'len_state' + Coded value of the current match length (length - 2), with a maximum + of 3. The resulting value is in the range 0 to 3. + + + The types of previous sequences corresponding to each state are shown in +the following table. '!literal' is any sequence except a literal byte. +'rep' is any one of 'rep0', 'rep1', 'rep2', or 'rep3'. The last type in +each line is the most recent. + +State Types of previous sequences +------------------------------------------------------ +0 literal, literal, literal +1 match, literal, literal +2 rep or (!literal, shortrep), literal, literal +3 literal, shortrep, literal, literal +4 match, literal +5 rep or (!literal, shortrep), literal +6 literal, shortrep, literal +7 literal, match +8 literal, rep +9 literal, shortrep +10 !literal, match +11 !literal, (rep or shortrep) + + + The contexts for decoding the type of coding sequence are: + +Name Indices Used when +---------------------------------------------------------------------------- +bm_match state, pos_state sequence start +bm_rep state after sequence 1 +bm_rep0 state after sequence 11 +bm_rep1 state after sequence 111 +bm_rep2 state after sequence 1111 +bm_len state, pos_state after sequence 110 + + + The contexts for decoding distances are: + +Name Indices Used when +---------------------------------------------------------------------------- +bm_dis_slot len_state, bit tree distance start +bm_dis reverse bit tree after slots 4 to 13 +bm_align reverse bit tree for distances >= 128, after fixed + probability bits + + + There are two separate sets of contexts for lengths ('Len_model' in the +source). One for normal matches, the other for repeated matches. The +contexts in each Len_model are (see 'decode_len' in the source): + +Name Indices Used when +--------------------------------------------------------------------------- +choice1 none length start +choice2 none after sequence 1 +bm_low pos_state, bit tree after sequence 0 +bm_mid pos_state, bit tree after sequence 10 +bm_high bit tree after sequence 11 + + + The context array 'bm_literal' is special. In principle it acts as a +normal bit tree context, the one selected by 'literal_state'. But if the +previous decoded byte was not a literal, two other bit tree contexts are +used depending on the value of each bit in 'match_byte' (the byte at the +latest used distance), until a bit is decoded that is different from its +corresponding bit in 'match_byte'. After the first difference is found, the +rest of the byte is decoded using the normal bit tree context. (See +'decode_matched' in the source). + + +7.3 The range decoder +===================== + +The LZMA stream is consumed one byte at a time by the range decoder. (See +'normalize' in the source). Every byte consumed produces a variable number +of decoded bits, depending on how well these bits agree with their context. +(See 'decode_bit' in the source). + + The range decoder state consists of two unsigned 32-bit variables: +'range' (representing the most significant part of the range size not yet +decoded) and 'code' (representing the current point within 'range'). +'range' is initialized to 2^32 - 1, and 'code' is initialized to 0. + + The range encoder produces a first 0 byte that must be ignored by the +range decoder. This is done by shifting 5 bytes in the initialization of +'code' instead of 4. (See the 'Range_decoder' constructor in the source). + + +7.4 Decoding and verifying the LZMA stream +========================================== + +After decoding the member header and obtaining the dictionary size, the +range decoder is initialized and then the LZMA decoder enters a loop (see +'decode_member' in the source) where it invokes the range decoder with the +appropriate contexts to decode the different coding sequences (matches, +repeated matches, and literal bytes), until the "End Of Stream" marker is +decoded. + + Once the "End Of Stream" marker has been decoded, the decompressor reads +and decodes the member trailer, and verifies that the three integrity +factors stored there (CRC, data size, and member size) match those computed +from the data. + + +File: lzip.info, Node: Trailing data, Next: Examples, Prev: Stream format, Up: Top + +8 Extra data appended to the file +********************************* + +Sometimes extra data are found appended to a lzip file after the last +member. Such trailing data may be: + + * Padding added to make the file size a multiple of some block size, for + example when writing to a tape. It is safe to append any amount of + padding zero bytes to a lzip file. + + * Useful data added by the user; a cryptographically secure hash, a + description of file contents, etc. It is safe to append any amount of + text to a lzip file as long as none of the first four bytes of the text + match the corresponding byte in the string "LZIP", and the text does + not contain any zero bytes (null characters). Nonzero bytes and zero + bytes can't be safely mixed in trailing data. + + * Garbage added by some not totally successful copy operation. + + * Malicious data added to the file in order to make its total size and + hash value (for a chosen hash) coincide with those of another file. + + * In rare cases, trailing data could be the corrupt header of another + member. In multimember or concatenated files the probability of + corruption happening in the magic bytes is 5 times smaller than the + probability of getting a false positive caused by the corruption of the + integrity information itself. Therefore it can be considered to be + below the noise level. Additionally, the test used by lzip to + discriminate trailing data from a corrupt header has a Hamming + distance (HD) of 3, and the 3 bit flips must happen in different magic + bytes for the test to fail. In any case, the option '--trailing-error' + guarantees that any corrupt header will be detected. + + Trailing data are in no way part of the lzip file format, but tools +reading lzip files are expected to behave as correctly and usefully as +possible in the presence of trailing data. + + Trailing data can be safely ignored in most cases. In some cases, like +that of user-added data, they are expected to be ignored. In those cases +where a file containing trailing data must be rejected, the option +'--trailing-error' can be used. *Note --trailing-error::. + + +File: lzip.info, Node: Examples, Next: Problems, Prev: Trailing data, Up: Top + +9 A small tutorial with examples +******************************** + +WARNING! Even if lzip is bug-free, other causes may result in a corrupt +compressed file (bugs in the system libraries, memory errors, etc). +Therefore, if the data you are going to compress are important, give the +option '--keep' to lzip and don't remove the original file until you verify +the compressed file with a command like 'lzip -cd file.lz | cmp file -'. +Most RAM errors happening during compression can only be detected by +comparing the compressed file with the original because the corruption +happens before lzip compresses the RAM contents, resulting in a valid +compressed file containing wrong data. + + +Example 1: Extract all the files from archive 'foo.tar.lz'. + + tar -xf foo.tar.lz + or + lzip -cd foo.tar.lz | tar -xf - + + +Example 2: Replace a regular file with its compressed version 'file.lz' and +show the compression ratio. + + lzip -v file + + +Example 3: Like example 2 but the created 'file.lz' is multimember with a +member size of 1 MiB. The compression ratio is not shown. + + lzip -b 1MiB file + + +Example 4: Restore a regular file from its compressed version 'file.lz'. If +the operation is successful, 'file.lz' is removed. + + lzip -d file.lz + + +Example 5: Verify the integrity of the compressed file 'file.lz' and show +status. + + lzip -tv file.lz + + +Example 6: The right way of concatenating the decompressed output of two or +more compressed files. *Note Trailing data::. + + Don't do this + cat file1.lz file2.lz file3.lz | lzip -d - + Do this instead + lzip -cd file1.lz file2.lz file3.lz + + +Example 7: Decompress 'file.lz' partially until 10 KiB of decompressed data +are produced. + + lzip -cd file.lz | dd bs=1024 count=10 + + +Example 8: Decompress 'file.lz' partially from decompressed byte at offset +10000 to decompressed byte at offset 14999 (5000 bytes are produced). + + lzip -cd file.lz | dd bs=1000 skip=10 count=5 + + +Example 9: Compress a whole device in /dev/sdc and send the output to +'file.lz'. + + lzip -c /dev/sdc > file.lz + or + lzip /dev/sdc -o file.lz + + +Example 10: Create a multivolume compressed tar archive with a volume size +of 1440 KiB. + + tar -c some_directory | lzip -S 1440KiB -o volume_name - + + +Example 11: Extract a multivolume compressed tar archive. + + lzip -cd volume_name*.lz | tar -xf - + + +Example 12: Create a multivolume compressed backup of a large database file +with a volume size of 650 MB, where each volume is a multimember file with +a member size of 32 MiB. + + lzip -b 32MiB -S 650MB big_db + + +File: lzip.info, Node: Problems, Next: Reference source code, Prev: Examples, Up: Top + +10 Reporting bugs +***************** + +There are probably bugs in lzip. There are certainly errors and omissions +in this manual. If you report them, they will get fixed. If you don't, no +one will ever know about them and they will remain unfixed for all +eternity, if not longer. + + If you find a bug in lzip, please send electronic mail to +<lzip-bug@nongnu.org>. Include the version number, which you can find by +running 'lzip --version'. + + +File: lzip.info, Node: Reference source code, Next: Concept index, Prev: Problems, Up: Top + +Appendix A Reference source code +******************************** + +/* Lzd - Educational decompressor for the lzip format + Copyright (C) 2013-2022 Antonio Diaz Diaz. + + This program is free software. Redistribution and use in source and + binary forms, with or without modification, are permitted provided + that the following conditions are met: + + 1. Redistributions of source code must retain the above copyright + notice, this list of conditions, and the following disclaimer. + + 2. Redistributions in binary form must reproduce the above copyright + notice, this list of conditions, and the following disclaimer in the + documentation and/or other materials provided with the distribution. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. +*/ +/* + Exit status: 0 for a normal exit, 1 for environmental problems + (file not found, invalid flags, I/O errors, etc), 2 to indicate a + corrupt or invalid input file. +*/ + +#include <algorithm> +#include <cerrno> +#include <cstdio> +#include <cstdlib> +#include <cstring> +#include <stdint.h> +#include <unistd.h> +#if defined __MSVCRT__ || defined __OS2__ || defined __DJGPP__ +#include <fcntl.h> +#include <io.h> +#endif + + +class State + { + int st; + +public: + enum { states = 12 }; + State() : st( 0 ) {} + int operator()() const { return st; } + bool is_char() const { return st < 7; } + + void set_char() + { + const int next[states] = { 0, 0, 0, 0, 1, 2, 3, 4, 5, 6, 4, 5 }; + st = next[st]; + } + void set_match() { st = ( st < 7 ) ? 7 : 10; } + void set_rep() { st = ( st < 7 ) ? 8 : 11; } + void set_short_rep() { st = ( st < 7 ) ? 9 : 11; } + }; + + +enum { + min_dictionary_size = 1 << 12, + max_dictionary_size = 1 << 29, + literal_context_bits = 3, + literal_pos_state_bits = 0, // not used + pos_state_bits = 2, + pos_states = 1 << pos_state_bits, + pos_state_mask = pos_states - 1, + + len_states = 4, + dis_slot_bits = 6, + start_dis_model = 4, + end_dis_model = 14, + modeled_distances = 1 << ( end_dis_model / 2 ), // 128 + dis_align_bits = 4, + dis_align_size = 1 << dis_align_bits, + + len_low_bits = 3, + len_mid_bits = 3, + len_high_bits = 8, + len_low_symbols = 1 << len_low_bits, + len_mid_symbols = 1 << len_mid_bits, + len_high_symbols = 1 << len_high_bits, + max_len_symbols = len_low_symbols + len_mid_symbols + len_high_symbols, + + min_match_len = 2, // must be 2 + + bit_model_move_bits = 5, + bit_model_total_bits = 11, + bit_model_total = 1 << bit_model_total_bits }; + +struct Bit_model + { + int probability; + Bit_model() : probability( bit_model_total / 2 ) {} + }; + +struct Len_model + { + Bit_model choice1; + Bit_model choice2; + Bit_model bm_low[pos_states][len_low_symbols]; + Bit_model bm_mid[pos_states][len_mid_symbols]; + Bit_model bm_high[len_high_symbols]; + }; + + +class CRC32 + { + uint32_t data[256]; // Table of CRCs of all 8-bit messages. + +public: + CRC32() + { + for( unsigned n = 0; n < 256; ++n ) + { + unsigned c = n; + for( int k = 0; k < 8; ++k ) + { if( c & 1 ) c = 0xEDB88320U ^ ( c >> 1 ); else c >>= 1; } + data[n] = c; + } + } + + void update_buf( uint32_t & crc, const uint8_t * const buffer, + const int size ) const + { + for( int i = 0; i < size; ++i ) + crc = data[(crc^buffer[i])&0xFF] ^ ( crc >> 8 ); + } + }; + +const CRC32 crc32; + + +typedef uint8_t Lzip_header[6]; // 0-3 magic bytes + // 4 version + // 5 coded dictionary size +typedef uint8_t Lzip_trailer[20]; + // 0-3 CRC32 of the uncompressed data + // 4-11 size of the uncompressed data + // 12-19 member size including header and trailer + +class Range_decoder + { + unsigned long long member_pos; + uint32_t code; + uint32_t range; + +public: + Range_decoder() : member_pos( 6 ), code( 0 ), range( 0xFFFFFFFFU ) + { + for( int i = 0; i < 5; ++i ) code = ( code << 8 ) | get_byte(); + } + + uint8_t get_byte() { ++member_pos; return std::getc( stdin ); } + unsigned long long member_position() const { return member_pos; } + + unsigned decode( const int num_bits ) + { + unsigned symbol = 0; + for( int i = num_bits; i > 0; --i ) + { + range >>= 1; + symbol <<= 1; + if( code >= range ) { code -= range; symbol |= 1; } + if( range <= 0x00FFFFFFU ) // normalize + { range <<= 8; code = ( code << 8 ) | get_byte(); } + } + return symbol; + } + + unsigned decode_bit( Bit_model & bm ) + { + unsigned symbol; + const uint32_t bound = ( range >> bit_model_total_bits ) * bm.probability; + if( code < bound ) + { + range = bound; + bm.probability += + ( bit_model_total - bm.probability ) >> bit_model_move_bits; + symbol = 0; + } + else + { + range -= bound; + code -= bound; + bm.probability -= bm.probability >> bit_model_move_bits; + symbol = 1; + } + if( range <= 0x00FFFFFFU ) // normalize + { range <<= 8; code = ( code << 8 ) | get_byte(); } + return symbol; + } + + unsigned decode_tree( Bit_model bm[], const int num_bits ) + { + unsigned symbol = 1; + for( int i = 0; i < num_bits; ++i ) + symbol = ( symbol << 1 ) | decode_bit( bm[symbol] ); + return symbol - ( 1 << num_bits ); + } + + unsigned decode_tree_reversed( Bit_model bm[], const int num_bits ) + { + unsigned symbol = decode_tree( bm, num_bits ); + unsigned reversed_symbol = 0; + for( int i = 0; i < num_bits; ++i ) + { + reversed_symbol = ( reversed_symbol << 1 ) | ( symbol & 1 ); + symbol >>= 1; + } + return reversed_symbol; + } + + unsigned decode_matched( Bit_model bm[], const unsigned match_byte ) + { + unsigned symbol = 1; + for( int i = 7; i >= 0; --i ) + { + const unsigned match_bit = ( match_byte >> i ) & 1; + const unsigned bit = decode_bit( bm[symbol+(match_bit<<8)+0x100] ); + symbol = ( symbol << 1 ) | bit; + if( match_bit != bit ) + { + while( symbol < 0x100 ) + symbol = ( symbol << 1 ) | decode_bit( bm[symbol] ); + break; + } + } + return symbol & 0xFF; + } + + unsigned decode_len( Len_model & lm, const int pos_state ) + { + if( decode_bit( lm.choice1 ) == 0 ) + return decode_tree( lm.bm_low[pos_state], len_low_bits ); + if( decode_bit( lm.choice2 ) == 0 ) + return len_low_symbols + + decode_tree( lm.bm_mid[pos_state], len_mid_bits ); + return len_low_symbols + len_mid_symbols + + decode_tree( lm.bm_high, len_high_bits ); + } + }; + + +class LZ_decoder + { + unsigned long long partial_data_pos; + Range_decoder rdec; + const unsigned dictionary_size; + uint8_t * const buffer; // output buffer + unsigned pos; // current pos in buffer + unsigned stream_pos; // first byte not yet written to stdout + uint32_t crc_; + bool pos_wrapped; + + void flush_data(); + + uint8_t peek( const unsigned distance ) const + { + if( pos > distance ) return buffer[pos - distance - 1]; + if( pos_wrapped ) return buffer[dictionary_size + pos - distance - 1]; + return 0; // prev_byte of first byte + } + + void put_byte( const uint8_t b ) + { + buffer[pos] = b; + if( ++pos >= dictionary_size ) flush_data(); + } + +public: + explicit LZ_decoder( const unsigned dict_size ) + : + partial_data_pos( 0 ), + dictionary_size( dict_size ), + buffer( new uint8_t[dictionary_size] ), + pos( 0 ), + stream_pos( 0 ), + crc_( 0xFFFFFFFFU ), + pos_wrapped( false ) + {} + + ~LZ_decoder() { delete[] buffer; } + + unsigned crc() const { return crc_ ^ 0xFFFFFFFFU; } + unsigned long long data_position() const + { return partial_data_pos + pos; } + uint8_t get_byte() { return rdec.get_byte(); } + unsigned long long member_position() const + { return rdec.member_position(); } + + bool decode_member(); + }; + + +void LZ_decoder::flush_data() + { + if( pos > stream_pos ) + { + const unsigned size = pos - stream_pos; + crc32.update_buf( crc_, buffer + stream_pos, size ); + if( std::fwrite( buffer + stream_pos, 1, size, stdout ) != size ) + { std::fprintf( stderr, "Write error: %s\n", std::strerror( errno ) ); + std::exit( 1 ); } + if( pos >= dictionary_size ) + { partial_data_pos += pos; pos = 0; pos_wrapped = true; } + stream_pos = pos; + } + } + + +bool LZ_decoder::decode_member() // Returns false if error + { + Bit_model bm_literal[1<<literal_context_bits][0x300]; + Bit_model bm_match[State::states][pos_states]; + Bit_model bm_rep[State::states]; + Bit_model bm_rep0[State::states]; + Bit_model bm_rep1[State::states]; + Bit_model bm_rep2[State::states]; + Bit_model bm_len[State::states][pos_states]; + Bit_model bm_dis_slot[len_states][1<<dis_slot_bits]; + Bit_model bm_dis[modeled_distances-end_dis_model+1]; + Bit_model bm_align[dis_align_size]; + Len_model match_len_model; + Len_model rep_len_model; + unsigned rep0 = 0; // rep[0-3] latest four distances + unsigned rep1 = 0; // used for efficient coding of + unsigned rep2 = 0; // repeated distances + unsigned rep3 = 0; + State state; + + while( !std::feof( stdin ) && !std::ferror( stdin ) ) + { + const int pos_state = data_position() & pos_state_mask; + if( rdec.decode_bit( bm_match[state()][pos_state] ) == 0 ) // 1st bit + { + // literal byte + const uint8_t prev_byte = peek( 0 ); + const int literal_state = prev_byte >> ( 8 - literal_context_bits ); + Bit_model * const bm = bm_literal[literal_state]; + if( state.is_char() ) + put_byte( rdec.decode_tree( bm, 8 ) ); + else + put_byte( rdec.decode_matched( bm, peek( rep0 ) ) ); + state.set_char(); + continue; + } + // match or repeated match + int len; + if( rdec.decode_bit( bm_rep[state()] ) != 0 ) // 2nd bit + { + if( rdec.decode_bit( bm_rep0[state()] ) == 0 ) // 3rd bit + { + if( rdec.decode_bit( bm_len[state()][pos_state] ) == 0 ) // 4th bit + { state.set_short_rep(); put_byte( peek( rep0 ) ); continue; } + } + else + { + unsigned distance; + if( rdec.decode_bit( bm_rep1[state()] ) == 0 ) // 4th bit + distance = rep1; + else + { + if( rdec.decode_bit( bm_rep2[state()] ) == 0 ) // 5th bit + distance = rep2; + else + { distance = rep3; rep3 = rep2; } + rep2 = rep1; + } + rep1 = rep0; + rep0 = distance; + } + state.set_rep(); + len = min_match_len + rdec.decode_len( rep_len_model, pos_state ); + } + else // match + { + rep3 = rep2; rep2 = rep1; rep1 = rep0; + len = min_match_len + rdec.decode_len( match_len_model, pos_state ); + const int len_state = std::min( len - min_match_len, len_states - 1 ); + rep0 = rdec.decode_tree( bm_dis_slot[len_state], dis_slot_bits ); + if( rep0 >= start_dis_model ) + { + const unsigned dis_slot = rep0; + const int direct_bits = ( dis_slot >> 1 ) - 1; + rep0 = ( 2 | ( dis_slot & 1 ) ) << direct_bits; + if( dis_slot < end_dis_model ) + rep0 += rdec.decode_tree_reversed( bm_dis + ( rep0 - dis_slot ), + direct_bits ); + else + { + rep0 += + rdec.decode( direct_bits - dis_align_bits ) << dis_align_bits; + rep0 += rdec.decode_tree_reversed( bm_align, dis_align_bits ); + if( rep0 == 0xFFFFFFFFU ) // marker found + { + flush_data(); + return ( len == min_match_len ); // End Of Stream marker + } + } + } + state.set_match(); + if( rep0 >= dictionary_size || ( rep0 >= pos && !pos_wrapped ) ) + { flush_data(); return false; } + } + for( int i = 0; i < len; ++i ) put_byte( peek( rep0 ) ); + } + flush_data(); + return false; + } + + +int main( const int argc, const char * const argv[] ) + { + if( argc > 2 || ( argc == 2 && std::strcmp( argv[1], "-d" ) != 0 ) ) + { + std::printf( + "Lzd %s - Educational decompressor for the lzip format.\n" + "Study the source to learn how a lzip decompressor works.\n" + "See the lzip manual for an explanation of the code.\n" + "\nUsage: %s [-d] < file.lz > file\n" + "Lzd decompresses from standard input to standard output.\n" + "\nCopyright (C) 2022 Antonio Diaz Diaz.\n" + "License 2-clause BSD.\n" + "This is free software: you are free to change and redistribute it.\n" + "There is NO WARRANTY, to the extent permitted by law.\n" + "Report bugs to lzip-bug@nongnu.org\n" + "Lzd home page: http://www.nongnu.org/lzip/lzd.html\n", + PROGVERSION, argv[0] ); + return 0; + } + +#if defined __MSVCRT__ || defined __OS2__ || defined __DJGPP__ + setmode( STDIN_FILENO, O_BINARY ); + setmode( STDOUT_FILENO, O_BINARY ); +#endif + + for( bool first_member = true; ; first_member = false ) + { + Lzip_header header; // verify header + for( int i = 0; i < 6; ++i ) header[i] = std::getc( stdin ); + if( std::feof( stdin ) || std::memcmp( header, "LZIP\x01", 5 ) != 0 ) + { + if( first_member ) + { std::fputs( "Bad magic number (file not in lzip format).\n", + stderr ); return 2; } + break; // ignore trailing data + } + unsigned dict_size = 1 << ( header[5] & 0x1F ); + dict_size -= ( dict_size / 16 ) * ( ( header[5] >> 5 ) & 7 ); + if( dict_size < min_dictionary_size || dict_size > max_dictionary_size ) + { std::fputs( "Invalid dictionary size in member header.\n", stderr ); + return 2; } + + LZ_decoder decoder( dict_size ); // decode LZMA stream + if( !decoder.decode_member() ) + { std::fputs( "Data error\n", stderr ); return 2; } + + Lzip_trailer trailer; // verify trailer + for( int i = 0; i < 20; ++i ) trailer[i] = decoder.get_byte(); + int retval = 0; + unsigned crc = 0; + for( int i = 3; i >= 0; --i ) crc = ( crc << 8 ) + trailer[i]; + if( crc != decoder.crc() ) + { std::fputs( "CRC mismatch\n", stderr ); retval = 2; } + + unsigned long long data_size = 0; + for( int i = 11; i >= 4; --i ) + data_size = ( data_size << 8 ) + trailer[i]; + if( data_size != decoder.data_position() ) + { std::fputs( "Data size mismatch\n", stderr ); retval = 2; } + + unsigned long long member_size = 0; + for( int i = 19; i >= 12; --i ) + member_size = ( member_size << 8 ) + trailer[i]; + if( member_size != decoder.member_position() ) + { std::fputs( "Member size mismatch\n", stderr ); retval = 2; } + if( retval ) return retval; + } + + if( std::fclose( stdout ) != 0 ) + { std::fprintf( stderr, "Error closing stdout: %s\n", + std::strerror( errno ) ); return 1; } + return 0; + } + + +File: lzip.info, Node: Concept index, Prev: Reference source code, Up: Top + +Concept index +************* + + +* Menu: + +* algorithm: Algorithm. (line 6) +* bugs: Problems. (line 6) +* examples: Examples. (line 6) +* file format: File format. (line 6) +* format of the LZMA stream: Stream format. (line 6) +* getting help: Problems. (line 6) +* introduction: Introduction. (line 6) +* invoking: Invoking lzip. (line 6) +* options: Invoking lzip. (line 6) +* output: Output. (line 6) +* quality assurance: Quality assurance. (line 6) +* reference source code: Reference source code. (line 6) +* trailing data: Trailing data. (line 6) +* usage: Invoking lzip. (line 6) +* version: Invoking lzip. (line 6) + + + +Tag Table: +Node: Top203 +Node: Introduction1198 +Node: Output6972 +Node: Invoking lzip8567 +Ref: --trailing-error9356 +Node: Quality assurance18682 +Node: Algorithm27705 +Node: File format31109 +Ref: coded-dict-size32538 +Node: Stream format33773 +Ref: what-is-coded36169 +Node: Trailing data45097 +Node: Examples47358 +Ref: concat-example48800 +Node: Problems50021 +Node: Reference source code50553 +Node: Concept index65411 + +End Tag Table + + +Local Variables: +coding: iso-8859-15 +End: diff --git a/doc/lzip.texi b/doc/lzip.texi new file mode 100644 index 0000000..144b525 --- /dev/null +++ b/doc/lzip.texi @@ -0,0 +1,1783 @@ +\input texinfo @c -*-texinfo-*- +@c %**start of header +@setfilename lzip.info +@documentencoding ISO-8859-15 +@settitle Lzip Manual +@finalout +@c %**end of header + +@set UPDATED 24 January 2022 +@set VERSION 1.23 + +@dircategory Compression +@direntry +* Lzip: (lzip). LZMA lossless data compressor +@end direntry + + +@ifnothtml +@titlepage +@title Lzip +@subtitle LZMA lossless data compressor +@subtitle for Lzip version @value{VERSION}, @value{UPDATED} +@author by Antonio Diaz Diaz + +@page +@vskip 0pt plus 1filll +@end titlepage + +@contents +@end ifnothtml + +@ifnottex +@node Top +@top + +This manual is for Lzip (version @value{VERSION}, @value{UPDATED}). + +@menu +* Introduction:: Purpose and features of lzip +* Output:: Meaning of lzip's output +* Invoking lzip:: Command line interface +* Quality assurance:: Design, development, and testing of lzip +* Algorithm:: How lzip compresses the data +* File format:: Detailed format of the compressed file +* Stream format:: Format of the LZMA stream in lzip files +* Trailing data:: Extra data appended to the file +* Examples:: A small tutorial with examples +* Problems:: Reporting bugs +* Reference source code:: Source code illustrating stream format +* Concept index:: Index of concepts +@end menu + +@sp 1 +Copyright @copyright{} 2008-2022 Antonio Diaz Diaz. + +This manual is free documentation: you have unlimited permission to copy, +distribute, and modify it. +@end ifnottex + + +@node Introduction +@chapter Introduction +@cindex introduction + +@uref{http://www.nongnu.org/lzip/lzip.html,,Lzip} +is a lossless data compressor with a user interface similar to the one +of gzip or bzip2. Lzip uses a simplified form of the 'Lempel-Ziv-Markov +chain-Algorithm' (LZMA) stream format and provides a 3 factor integrity +checking to maximize interoperability and optimize safety. Lzip can compress +about as fast as gzip @w{(lzip -0)} or compress most files more than bzip2 +@w{(lzip -9)}. Decompression speed is intermediate between gzip and bzip2. +Lzip is better than gzip and bzip2 from a data recovery perspective. Lzip +has been designed, written, and tested with great care to replace gzip and +bzip2 as the standard general-purpose compressed format for unix-like +systems. + +For compressing/decompressing large files on multiprocessor machines +@uref{http://www.nongnu.org/lzip/manual/plzip_manual.html,,plzip} can be +much faster than lzip at the cost of a slightly reduced compression ratio. +@ifnothtml +@xref{Top,plzip manual,,plzip}. +@end ifnothtml + +For creation and manipulation of compressed tar archives +@uref{http://www.nongnu.org/lzip/manual/tarlz_manual.html,,tarlz} can be more +efficient than using tar and plzip because tarlz is able to keep the +alignment between tar members and lzip members. +@ifnothtml +@xref{Top,tarlz manual,,tarlz}. +@end ifnothtml + +The lzip file format is designed for data sharing and long-term archiving, +taking into account both data integrity and decoder availability: + +@itemize @bullet +@item +The lzip format provides very safe integrity checking and some data +recovery means. The program +@uref{http://www.nongnu.org/lzip/manual/lziprecover_manual.html#Data-safety,,lziprecover} +can repair bit flip errors (one of the most common forms of data corruption) +in lzip files, and provides data recovery capabilities, including +error-checked merging of damaged copies of a file. +@ifnothtml +@xref{Data safety,,,lziprecover}. +@end ifnothtml + +@item +The lzip format is as simple as possible (but not simpler). The lzip +manual provides the source code of a simple decompressor along with a +detailed explanation of how it works, so that with the only help of the +lzip manual it would be possible for a digital archaeologist to extract +the data from a lzip file long after quantum computers eventually +render LZMA obsolete. + +@item +Additionally the lzip reference implementation is copylefted, which +guarantees that it will remain free forever. +@end itemize + +A nice feature of the lzip format is that a corrupt byte is easier to repair +the nearer it is from the beginning of the file. Therefore, with the help of +lziprecover, losing an entire archive just because of a corrupt byte near +the beginning is a thing of the past. + +The member trailer stores the 32-bit CRC of the original data, the size +of the original data, and the size of the member. These values, together +with the "End Of Stream" marker, provide a 3 factor integrity checking +which guarantees that the decompressed version of the data is identical +to the original. This guards against corruption of the compressed data, +and against undetected bugs in lzip (hopefully very unlikely). The +chances of data corruption going undetected are microscopic. Be aware, +though, that the check occurs upon decompression, so it can only tell +you that something is wrong. It can't help you recover the original +uncompressed data. + +Lzip uses the same well-defined exit status values used by bzip2, which +makes it safer than compressors returning ambiguous warning values (like +gzip) when it is used as a back end for other programs like tar or zutils. + +Lzip will automatically use for each file the largest dictionary size that +does not exceed neither the file size nor the limit given. Keep in mind that +the decompression memory requirement is affected at compression time by the +choice of dictionary size limit. + +The amount of memory required for compression is about 1 or 2 times the +dictionary size limit (1 if input file size is less than dictionary size +limit, else 2) plus 9 times the dictionary size really used. The option +@samp{-0} is special and only requires about @w{1.5 MiB} at most. The +amount of memory required for decompression is about @w{46 kB} larger +than the dictionary size really used. + +When compressing, lzip replaces every file given in the command line +with a compressed version of itself, with the name "original_name.lz". +When decompressing, lzip attempts to guess the name for the decompressed +file from that of the compressed file as follows: + +@multitable {anyothername} {becomes} {anyothername.out} +@item filename.lz @tab becomes @tab filename +@item filename.tlz @tab becomes @tab filename.tar +@item anyothername @tab becomes @tab anyothername.out +@end multitable + +(De)compressing a file is much like copying or moving it. Therefore lzip +preserves the access and modification dates, permissions, and, when +possible, ownership of the file just as @w{@samp{cp -p}} does. (If the user ID or +the group ID can't be duplicated, the file permission bits S_ISUID and +S_ISGID are cleared). + +Lzip is able to read from some types of non-regular files if either the +option @samp{-c} or the option @samp{-o} is specified. + +Lzip will refuse to read compressed data from a terminal or write compressed +data to a terminal, as this would be entirely incomprehensible and might +leave the terminal in an abnormal state. + +Lzip will correctly decompress a file which is the concatenation of two or +more compressed files. The result is the concatenation of the corresponding +decompressed files. Integrity testing of concatenated compressed files is +also supported. + +Lzip can produce multimember files, and lziprecover can safely recover the +undamaged members in case of file damage. Lzip can also split the compressed +output in volumes of a given size, even when reading from standard input. +This allows the direct creation of multivolume compressed tar archives. + +Lzip is able to compress and decompress streams of unlimited size by +automatically creating multimember output. The members so created are large, +about @w{2 PiB} each. + + +@node Output +@chapter Meaning of lzip's output +@cindex output + +The output of lzip looks like this: + +@example +lzip -v foo + foo: 6.676:1, 14.98% ratio, 85.02% saved, 450560 in, 67493 out. + +lzip -tvvv foo.lz + foo.lz: 6.676:1, 14.98% ratio, 85.02% saved. 450560 out, 67493 in. ok +@end example + +The meaning of each field is as follows: + +@table @code +@item N:1 +The compression ratio @w{(uncompressed_size / compressed_size)}, shown as +@w{N to 1}. + +@item ratio +The inverse compression ratio @w{(compressed_size / uncompressed_size)}, +shown as a percentage. A decimal ratio is easily obtained by moving the +decimal point two places to the left; @w{14.98% = 0.1498}. + +@item saved +The space saved by compression @w{(1 - ratio)}, shown as a percentage. + +@item in +Size of the input data. This is the uncompressed size when compressing, or +the compressed size when decompressing or testing. Note that lzip always +prints the uncompressed size before the compressed size when compressing, +decompressing, testing, or listing. + +@item out +Size of the output data. This is the compressed size when compressing, or +the decompressed size when decompressing or testing. + +@end table + +When decompressing or testing at verbosity level 4 (-vvvv), the dictionary +size used to compress the file and the CRC32 of the uncompressed data are +also shown. + +LANGUAGE NOTE: Uncompressed = not compressed = plain data; it may never have +been compressed. Decompressed is used to refer to data which have undergone +the process of decompression. + + +@node Invoking lzip +@chapter Invoking lzip +@cindex invoking +@cindex options +@cindex usage +@cindex version + +The format for running lzip is: + +@example +lzip [@var{options}] [@var{files}] +@end example + +@noindent +If no file names are specified, lzip compresses (or decompresses) from +standard input to standard output. A hyphen @samp{-} used as a @var{file} +argument means standard input. It can be mixed with other @var{files} and is +read just once, the first time it appears in the command line. + +lzip supports the following +@uref{http://www.nongnu.org/arg-parser/manual/arg_parser_manual.html#Argument-syntax,,options}: +@ifnothtml +@xref{Argument syntax,,,arg_parser}. +@end ifnothtml + +@table @code +@item -h +@itemx --help +Print an informative help message describing the options and exit. + +@item -V +@itemx --version +Print the version number of lzip on the standard output and exit. +This version number should be included in all bug reports. + +@anchor{--trailing-error} +@item -a +@itemx --trailing-error +Exit with error status 2 if any remaining input is detected after +decompressing the last member. Such remaining input is usually trailing +garbage that can be safely ignored. @xref{concat-example}. + +@item -b @var{bytes} +@itemx --member-size=@var{bytes} +When compressing, set the member size limit to @var{bytes}. It is advisable +to keep members smaller than RAM size so that they can be repaired with +lziprecover in case of corruption. A small member size may degrade +compression ratio, so use it only when needed. Valid values range from +@w{100 kB} to @w{2 PiB}. Defaults to @w{2 PiB}. + +@item -c +@itemx --stdout +Compress or decompress to standard output; keep input files unchanged. If +compressing several files, each file is compressed independently. (The +output consists of a sequence of independently compressed members). This +option (or @samp{-o}) is needed when reading from a named pipe (fifo) or +from a device. Use it also to recover as much of the decompressed data as +possible when decompressing a corrupt file. @samp{-c} overrides @samp{-o} +and @samp{-S}. @samp{-c} has no effect when testing or listing. + +@item -d +@itemx --decompress +Decompress the files specified. If a file does not exist, can't be opened, +or the destination file already exists and @samp{--force} has not been +specified, lzip continues decompressing the rest of the files and exits with +error status 1. If a file fails to decompress, or is a terminal, lzip exits +immediately with error status 2 without decompressing the rest of the files. +A terminal is considered an uncompressed file, and therefore invalid. + +@item -f +@itemx --force +Force overwrite of output files. + +@item -F +@itemx --recompress +When compressing, force re-compression of files whose name already has +the @samp{.lz} or @samp{.tlz} suffix. + +@item -k +@itemx --keep +Keep (don't delete) input files during compression or decompression. + +@item -l +@itemx --list +Print the uncompressed size, compressed size, and percentage saved of the +files specified. Trailing data are ignored. The values produced are correct +even for multimember files. If more than one file is given, a final line +containing the cumulative sizes is printed. With @samp{-v}, the dictionary +size, the number of members in the file, and the amount of trailing data (if +any) are also printed. With @samp{-vv}, the positions and sizes of each +member in multimember files are also printed. + +If any file is damaged, does not exist, can't be opened, or is not regular, +the final exit status will be @w{> 0}. @samp{-lq} can be used to verify +quickly (without decompressing) the structural integrity of the files +specified. (Use @samp{--test} to verify the data integrity). @samp{-alq} +additionally verifies that none of the files specified contain trailing data. + +@item -m @var{bytes} +@itemx --match-length=@var{bytes} +When compressing, set the match length limit in bytes. After a match +this long is found, the search is finished. Valid values range from 5 to +273. Larger values usually give better compression ratios but longer +compression times. + +@item -o @var{file} +@itemx --output=@var{file} +If @samp{-c} has not been also specified, write the (de)compressed output to +@var{file}; keep input files unchanged. If compressing several files, each +file is compressed independently. (The output consists of a sequence of +independently compressed members). This option (or @samp{-c}) is needed when +reading from a named pipe (fifo) or from a device. @w{@samp{-o -}} is +equivalent to @samp{-c}. @samp{-o} has no effect when testing or listing. + +In order to keep backward compatibility with lzip versions prior to 1.22, +when compressing from standard input and no other file names are given, the +extension @samp{.lz} is appended to @var{file} unless it already ends in +@samp{.lz} or @samp{.tlz}. This feature will be removed in a future version +of lzip. Meanwhile, redirection may be used instead of @samp{-o} to write +the compressed output to a file without the extension @samp{.lz} in its +name: @w{@samp{lzip < file > foo}}. + +When compressing and splitting the output in volumes, @var{file} is used as +a prefix, and several files named @samp{@var{file}00001.lz}, +@samp{@var{file}00002.lz}, etc, are created. In this case, only one input +file is allowed. + +@item -q +@itemx --quiet +Quiet operation. Suppress all messages. + +@item -s @var{bytes} +@itemx --dictionary-size=@var{bytes} +When compressing, set the dictionary size limit in bytes. Lzip will use +for each file the largest dictionary size that does not exceed neither +the file size nor this limit. Valid values range from @w{4 KiB} to +@w{512 MiB}. Values 12 to 29 are interpreted as powers of two, meaning +2^12 to 2^29 bytes. Dictionary sizes are quantized so that they can be +coded in just one byte (@pxref{coded-dict-size}). If the size specified +does not match one of the valid sizes, it will be rounded upwards by +adding up to @w{(@var{bytes} / 8)} to it. + +For maximum compression you should use a dictionary size limit as large +as possible, but keep in mind that the decompression memory requirement +is affected at compression time by the choice of dictionary size limit. + +@item -S @var{bytes} +@itemx --volume-size=@var{bytes} +When compressing, and @samp{-c} has not been also specified, split the +compressed output into several volume files with names +@samp{original_name00001.lz}, @samp{original_name00002.lz}, etc, and set the +volume size limit to @var{bytes}. Input files are kept unchanged. Each +volume is a complete, maybe multimember, lzip file. A small volume size may +degrade compression ratio, so use it only when needed. Valid values range +from @w{100 kB} to @w{4 EiB}. + +@item -t +@itemx --test +Check integrity of the files specified, but don't decompress them. This +really performs a trial decompression and throws away the result. Use it +together with @samp{-v} to see information about the files. If a file +fails the test, does not exist, can't be opened, or is a terminal, lzip +continues checking the rest of the files. A final diagnostic is shown at +verbosity level 1 or higher if any file fails the test when testing +multiple files. + +@item -v +@itemx --verbose +Verbose mode.@* +When compressing, show the compression ratio and size for each file +processed.@* +When decompressing or testing, further -v's (up to 4) increase the +verbosity level, showing status, compression ratio, dictionary size, +trailer contents (CRC, data size, member size), and up to 6 bytes of +trailing data (if any) both in hexadecimal and as a string of printable +ASCII characters.@* +Two or more @samp{-v} options show the progress of (de)compression. + +@item -0 .. -9 +Compression level. Set the compression parameters (dictionary size and +match length limit) as shown in the table below. The default compression +level is @samp{-6}, equivalent to @w{@samp{-s8MiB -m36}}. Note that +@samp{-9} can be much slower than @samp{-0}. These options have no +effect when decompressing, testing, or listing. + +The bidimensional parameter space of LZMA can't be mapped to a linear +scale optimal for all files. If your files are large, very repetitive, +etc, you may need to use the options @samp{--dictionary-size} and +@samp{--match-length} directly to achieve optimal performance. + +If several compression levels or @samp{-s} or @samp{-m} options are +given, the last setting is used. For example @w{@samp{-9 -s64MiB}} is +equivalent to @w{@samp{-s64MiB -m273}} + +@multitable {Level} {Dictionary size (-s)} {Match length limit (-m)} +@item Level @tab Dictionary size (-s) @tab Match length limit (-m) +@item -0 @tab 64 KiB @tab 16 bytes +@item -1 @tab 1 MiB @tab 5 bytes +@item -2 @tab 1.5 MiB @tab 6 bytes +@item -3 @tab 2 MiB @tab 8 bytes +@item -4 @tab 3 MiB @tab 12 bytes +@item -5 @tab 4 MiB @tab 20 bytes +@item -6 @tab 8 MiB @tab 36 bytes +@item -7 @tab 16 MiB @tab 68 bytes +@item -8 @tab 24 MiB @tab 132 bytes +@item -9 @tab 32 MiB @tab 273 bytes +@end multitable + +@item --fast +@itemx --best +Aliases for GNU gzip compatibility. + +@item --loose-trailing +When decompressing, testing, or listing, allow trailing data whose first +bytes are so similar to the magic bytes of a lzip header that they can +be confused with a corrupt header. Use this option if a file triggers a +"corrupt header" error and the cause is not indeed a corrupt header. + +@end table + +Numbers given as arguments to options may be followed by a multiplier +and an optional @samp{B} for "byte". + +Table of SI and binary prefixes (unit multipliers): + +@multitable {Prefix} {kilobyte (10^3 = 1000)} {|} {Prefix} {kibibyte (2^10 = 1024)} +@item Prefix @tab Value @tab | @tab Prefix @tab Value +@item k @tab kilobyte (10^3 = 1000) @tab | @tab Ki @tab kibibyte (2^10 = 1024) +@item M @tab megabyte (10^6) @tab | @tab Mi @tab mebibyte (2^20) +@item G @tab gigabyte (10^9) @tab | @tab Gi @tab gibibyte (2^30) +@item T @tab terabyte (10^12) @tab | @tab Ti @tab tebibyte (2^40) +@item P @tab petabyte (10^15) @tab | @tab Pi @tab pebibyte (2^50) +@item E @tab exabyte (10^18) @tab | @tab Ei @tab exbibyte (2^60) +@item Z @tab zettabyte (10^21) @tab | @tab Zi @tab zebibyte (2^70) +@item Y @tab yottabyte (10^24) @tab | @tab Yi @tab yobibyte (2^80) +@end multitable + +@sp 1 +Exit status: 0 for a normal exit, 1 for environmental problems (file not +found, invalid flags, I/O errors, etc), 2 to indicate a corrupt or invalid +input file, 3 for an internal consistency error (e.g., bug) which caused +lzip to panic. + + +@node Quality assurance +@chapter Design, development, and testing of lzip +@cindex quality assurance + +There are two ways of constructing a software design: One way is to make it +so simple that there are obviously no deficiencies and the other way is to +make it so complicated that there are no obvious deficiencies. The first +method is far more difficult.@* +--- C.A.R. Hoare + +Lzip is developed by volunteers who lack the resources required for +extensive testing in all circumstances. It is up to you to test lzip before +using it in mission-critical applications. However, a compressor like lzip +is not a toy, and maintaining it is not a hobby. Many people's data depend +on it. Therefore the lzip file format has been reviewed carefully and is +believed to be free from negligent design errors. + +Lzip has been designed, written, and tested with great care to replace gzip +and bzip2 as the standard general-purpose compressed format for unix-like +systems. This chapter describes the lessons learned from these previous +formats, and their application to the design of lzip. + +@sp 1 +@section Format design + +When gzip was designed in 1992, computers and operating systems were much +less capable than they are today. The designers of gzip tried to work around +some of those limitations, like 8.3 file names, with additional fields in +the file format. + +Today those limitations have mostly disappeared, and the format of gzip has +proved to be unnecessarily complicated. It includes fields that were never +used, others that have lost their usefulness, and finally others that have +become too limited. + +Bzip2 was designed 5 years later, and its format is simpler than the one of +gzip. + +Probably the worst defect of the gzip format from the point of view of data +safety is the variable size of its header. If the byte at offset 3 (flags) +of a gzip member gets corrupted, it may become difficult to recover the +data, even if the compressed blocks are intact, because it can't be known +with certainty where the compressed blocks begin. + +By contrast, the header of a lzip member has a fixed length of 6. The LZMA +stream in a lzip member always starts at offset 6, making it trivial to +recover the data even if the whole header becomes corrupt. + +Bzip2 also provides a header of fixed length and marks the begin and end of +each compressed block with six magic bytes, making it possible to find the +compressed blocks even in case of file damage. But bzip2 does not store the +size of each compressed block, as lzip does. + +Lziprecover is able to provide unique data recovery capabilities because the +lzip format is extraordinarily safe. The simple and safe design of the file +format complements the embedded error detection provided by the LZMA data +stream. Any distance larger than the dictionary size acts as a forbidden +symbol, allowing the decompressor to detect the approximate position of +errors, and leaving very little work for the check sequence (CRC and data +sizes) in the detection of errors. Lzip is usually able to detect all +possible bit flips in the compressed data without resorting to the check +sequence. It would be difficult to write an automatic recovery tool like +lziprecover for the gzip format. And, as far as I know, it has never been +written. + +Lzip, like gzip and bzip2, uses a CRC32 to check the integrity of the +decompressed data because it provides optimal accuracy in the detection of +errors up to a compressed size of about @w{16 GiB}, a size larger than that +of most files. In the case of lzip, the additional detection capability of +the decompressor reduces the probability of undetected errors several +million times more, resulting in a combined integrity checking optimally +accurate for any member size produced by lzip. Preliminary results suggest +that the lzip format is safe enough to be used in critical safety avionics +systems. + +The lzip format is designed for long-term archiving. Therefore it excludes +any unneeded features that may interfere with the future extraction of the +decompressed data. + +@sp 1 +@subsection Gzip format (mis)features not present in lzip + +@table @samp +@item Multiple algorithms + +Gzip provides a CM (Compression Method) field that has never been used +because it is a bad idea to begin with. New compression methods may require +additional fields, making it impossible to implement new methods and, at the +same time, keep the same format. This field does not solve the problem of +format proliferation; it just makes the problem less obvious. + +@item Optional fields in header + +Unless special precautions are taken, optional fields are generally a bad +idea because they produce a header of variable size. The gzip header has 2 +fields that, in addition to being optional, are zero-terminated. This means +that if any byte inside the field gets zeroed, or if the terminating zero +gets altered, gzip won't be able to find neither the header CRC nor the +compressed blocks. + +@item Optional CRC for the header + +Using an optional CRC for the header is not only a bad idea, it is an error; +it circumvents the Hamming distance (HD) of the CRC and may prevent the +extraction of perfectly good data. For example, if the CRC is used and the +bit enabling it is reset by a bit flip, the header will appear to be intact +(in spite of being corrupt) while the compressed blocks will appear to be +totally unrecoverable (in spite of being intact). Very misleading indeed. + +@item Metadata + +The gzip format stores some metadata, like the modification time of the +original file or the operating system on which compression took place. This +complicates reproducible compression (obtaining identical compressed output +from identical input). + +@end table + +@subsection Lzip format improvements over gzip and bzip2 + +@table @samp +@item 64-bit size field + +Probably the most frequently reported shortcoming of the gzip format is that +it only stores the least significant 32 bits of the uncompressed size. The +size of any file larger than @w{4 GiB} gets truncated. + +Bzip2 does not store the uncompressed size of the file. + +The lzip format provides a 64-bit field for the uncompressed size. +Additionally, lzip produces multimember output automatically when the size +is too large for a single member, allowing for an unlimited uncompressed +size. + +@item Distributed index + +The lzip format provides a distributed index that, among other things, helps +plzip to decompress several times faster than pigz and helps lziprecover do +its job. Neither the gzip format nor the bzip2 format do provide an index. + +A distributed index is safer and more scalable than a monolithic index. The +monolithic index introduces a single point of failure in the compressed file +and may limit the number of members or the total uncompressed size. + +@end table + +@section Quality of implementation + +@table @samp +@item Accurate and robust error detection + +The lzip format provides 3 factor integrity checking, and the decompressors +report mismatches in each factor separately. This method detects most false +positives for corruption. If just one byte in one factor fails but the other +two factors match the data, it probably means that the data are intact and +the corruption just affects the mismatching factor (CRC, data size, or +member size) in the member trailer. + +@item Multiple implementations + +Just like the lzip format provides 3 factor protection against undetected +data corruption, the development methodology of the lzip family of +compressors provides 3 factor protection against undetected programming +errors. + +Three related but independent compressor implementations, lzip, clzip, and +minilzip/lzlib, are developed concurrently. Every stable release of any of +them is tested to verify that it produces identical output to the other two. +This guarantees that all three implement the same algorithm, and makes it +unlikely that any of them may contain serious undiscovered errors. In fact, +no errors have been discovered in lzip since 2009. + +Additionally, the three implementations have been extensively tested with +@uref{http://www.nongnu.org/lzip/manual/lziprecover_manual.html#Unzcrash,,unzcrash}, +valgrind, and @samp{american fuzzy lop} without finding a single +vulnerability or false negative. +@ifnothtml +@xref{Unzcrash,,,lziprecover}. +@end ifnothtml + +@item Dictionary size + +Lzip automatically adapts the dictionary size to the size of each file. +In addition to reducing the amount of memory required for decompression, +this feature also minimizes the probability of being affected by RAM errors +during compression. @c key4_mask + +@item Exit status + +Returning a warning status of 2 is a design flaw of compress that leaked +into the design of gzip. Both bzip2 and lzip are free from this flaw. + +@end table + + +@node Algorithm +@chapter Algorithm +@cindex algorithm + +In spite of its name (Lempel-Ziv-Markov chain-Algorithm), LZMA is not a +concrete algorithm; it is more like "any algorithm using the LZMA coding +scheme". LZMA compression consists in describing the uncompressed data as a +succession of coding sequences from the set shown in Section @samp{What is +coded} (@pxref{what-is-coded}), and then encoding them using a range +encoder. For example, the option @samp{-0} of lzip uses the scheme in almost +the simplest way possible; issuing the longest match it can find, or a +literal byte if it can't find a match. Inversely, a much more elaborated way +of finding coding sequences of minimum size than the one currently used by +lzip could be developed, and the resulting sequence could also be coded +using the LZMA coding scheme. + +Lzip currently implements two variants of the LZMA algorithm: fast +(used by option @samp{-0}) and normal (used by all other compression levels). + +The high compression of LZMA comes from combining two basic, well-proven +compression ideas: sliding dictionaries (LZ77/78) and markov models (the +thing used by every compression algorithm that uses a range encoder or +similar order-0 entropy coder as its last stage) with segregation of +contexts according to what the bits are used for. + +Lzip is a two stage compressor. The first stage is a Lempel-Ziv coder, +which reduces redundancy by translating chunks of data to their +corresponding distance-length pairs. The second stage is a range encoder +that uses a different probability model for each type of data: +distances, lengths, literal bytes, etc. + +Here is how it works, step by step: + +1) The member header is written to the output stream. + +2) The first byte is coded literally, because there are no previous +bytes to which the match finder can refer to. + +3) The main encoder advances to the next byte in the input data and +calls the match finder. + +4) The match finder fills an array with the minimum distances before the +current byte where a match of a given length can be found. + +5) Go back to step 3 until a sequence (formed of pairs, repeated +distances, and literal bytes) of minimum price has been formed. Where the +price represents the number of output bits produced. + +6) The range encoder encodes the sequence produced by the main encoder +and sends the bytes produced to the output stream. + +7) Go back to step 3 until the input data are finished or until the +member or volume size limits are reached. + +8) The range encoder is flushed. + +9) The member trailer is written to the output stream. + +10) If there are more data to compress, go back to step 1. + +@sp 1 +During compression, lzip reads data in large blocks (one dictionary size at +a time). Therefore it may block for up to tens of seconds any process +feeding data to it through a pipe. This is normal. The blocking intervals +get longer with higher compression levels because dictionary size increases +(and compression speed decreases) with compression level. + +@noindent +The ideas embodied in lzip are due to (at least) the following people: +Abraham Lempel and Jacob Ziv (for the LZ algorithm), Andrey Markov (for the +definition of Markov chains), G.N.N. Martin (for the definition of range +encoding), Igor Pavlov (for putting all the above together in LZMA), and +Julian Seward (for bzip2's CLI). + + +@node File format +@chapter File format +@cindex file format + +Perfection is reached, not when there is no longer anything to add, but +when there is no longer anything to take away.@* +--- Antoine de Saint-Exupery + +@sp 1 +In the diagram below, a box like this: + +@verbatim ++---+ +| | <-- the vertical bars might be missing ++---+ +@end verbatim + +represents one byte; a box like this: + +@verbatim ++==============+ +| | ++==============+ +@end verbatim + +represents a variable number of bytes. + +@sp 1 +A lzip file consists of a series of independent "members" (compressed data +sets). The members simply appear one after another in the file, with no +additional information before, between, or after them. Each member can +encode in compressed form up to @w{16 EiB - 1 byte} of uncompressed data. +The size of a multimember file is unlimited. + +Each member has the following structure: + +@verbatim ++--+--+--+--+----+----+=============+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +| ID string | VN | DS | LZMA stream | CRC32 | Data size | Member size | ++--+--+--+--+----+----+=============+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +@end verbatim + +All multibyte values are stored in little endian order. + +@table @samp +@item ID string (the "magic" bytes) +A four byte string, identifying the lzip format, with the value "LZIP" +(0x4C, 0x5A, 0x49, 0x50). + +@item VN (version number, 1 byte) +Just in case something needs to be modified in the future. 1 for now. + +@anchor{coded-dict-size} +@item DS (coded dictionary size, 1 byte) +The dictionary size is calculated by taking a power of 2 (the base size) +and subtracting from it a fraction between 0/16 and 7/16 of the base size.@* +Bits 4-0 contain the base 2 logarithm of the base size (12 to 29).@* +Bits 7-5 contain the numerator of the fraction (0 to 7) to subtract +from the base size to obtain the dictionary size.@* +Example: 0xD3 = 2^19 - 6 * 2^15 = 512 KiB - 6 * 32 KiB = 320 KiB@* +Valid values for dictionary size range from 4 KiB to 512 MiB. + +@item LZMA stream +The LZMA stream, finished by an "End Of Stream" marker. Uses default values +for encoder properties. @xref{Stream format}, for a complete description. + +@item CRC32 (4 bytes) +Cyclic Redundancy Check (CRC) of the original uncompressed data. + +@item Data size (8 bytes) +Size of the original uncompressed data. + +@item Member size (8 bytes) +Total size of the member, including header and trailer. This field acts +as a distributed index, allows the verification of stream integrity, and +facilitates the safe recovery of undamaged members from multimember files. +Member size should be limited to @w{2 PiB} to prevent the data size field +from overflowing. + +@end table + + +@node Stream format +@chapter Format of the LZMA stream in lzip files +@cindex format of the LZMA stream + +The LZMA algorithm has three parameters, called "special LZMA +properties", to adjust it for some kinds of binary data. These +parameters are: @samp{literal_context_bits} (with a default value of 3), +@samp{literal_pos_state_bits} (with a default value of 0), and +@samp{pos_state_bits} (with a default value of 2). As a general purpose +compressor, lzip only uses the default values for these parameters. In +particular @samp{literal_pos_state_bits} has been optimized away and +does not even appear in the code. + +Lzip finishes the LZMA stream with an "End Of Stream" (EOS) marker (the +distance-length pair @w{0xFFFFFFFFU, 2}), which in conjunction with the +@samp{member size} field in the member trailer allows the verification of +stream integrity. The EOS marker is the only marker allowed in lzip files. +The LZMA stream in lzip files always has these two features (default +properties and EOS marker) and is referred to in this document as +LZMA-302eos. This simplified form of the LZMA stream format has been chosen +to maximize interoperability and safety. + +The second stage of LZMA is a range encoder that uses a different +probability model for each type of symbol: distances, lengths, literal +bytes, etc. Range encoding conceptually encodes all the symbols of the +message into one number. Unlike Huffman coding, which assigns to each +symbol a bit-pattern and concatenates all the bit-patterns together, +range encoding can compress one symbol to less than one bit. Therefore +the compressed data produced by a range encoder can't be split in pieces +that could be described individually. + +It seems that the only way of describing the LZMA-302eos stream is to +describe the algorithm that decodes it. And given the many details +about the range decoder that need to be described accurately, the source +code of a real decompressor seems the only appropriate reference to use. + +What follows is a description of the decoding algorithm for LZMA-302eos +streams using as reference the source code of "lzd", an educational +decompressor for lzip files which can be downloaded from the lzip download +directory. Lzd is written in C++11 and its source code is included in +appendix A. @xref{Reference source code}. + +@sp 1 +@section What is coded + +@anchor{what-is-coded} +The LZMA stream includes literals, matches, and repeated matches (matches +reusing a recently used distance). There are 7 different coding sequences: + +@multitable @columnfractions .35 .14 .51 +@headitem Bit sequence @tab Name @tab Description +@item 0 + byte @tab literal @tab literal byte +@item 1 + 0 + len + dis @tab match @tab distance-length pair +@item 1 + 1 + 0 + 0 @tab shortrep @tab 1 byte match at latest used distance +@item 1 + 1 + 0 + 1 + len @tab rep0 @tab len bytes match at latest used distance +@item 1 + 1 + 1 + 0 + len @tab rep1 @tab len bytes match at second +latest used distance +@item 1 + 1 + 1 + 1 + 0 + len @tab rep2 @tab len bytes match at third +latest used distance +@item 1 + 1 + 1 + 1 + 1 + len @tab rep3 @tab len bytes match at fourth +latest used distance +@end multitable + +@sp 1 +In the following tables, multibit sequences are coded in normal order, +from most significant bit (MSB) to least significant bit (LSB), except +where noted otherwise. + +Lengths (the @samp{len} in the table above) are coded as follows: + +@multitable @columnfractions .5 .5 +@headitem Bit sequence @tab Description +@item 0 + 3 bits @tab lengths from 2 to 9 +@item 1 + 0 + 3 bits @tab lengths from 10 to 17 +@item 1 + 1 + 8 bits @tab lengths from 18 to 273 +@end multitable + +@sp 1 +The coding of distances is a little more complicated, so I'll begin by +explaining a simpler version of the encoding. + +Imagine you need to encode a number from 0 to @w{2^32 - 1}, and you want to +do it in a way that produces shorter codes for the smaller numbers. You may +first encode the position of the most significant bit that is set to 1, +which you may find by making a bit scan from the left (from the MSB). A +position of 0 means that the number is 0 (no bit is set), 1 means the LSB is +the first bit set (the number is 1), and 32 means the MSB is set (i.e., the +number is @w{>= 0x80000000}). Then, if the position is @w{>= 2}, you encode +the remaining @w{position - 1} bits. Let's call these bits "direct bits" +because they are coded directly by value instead of indirectly by position. + +The inconvenient of this simple method is that it needs 6 bits to encode the +position, but it just uses 33 of the 64 possible values, wasting almost half +of the codes. + +The intelligent trick of LZMA is that it encodes in what it calls a "slot" +the position of the most significant bit set, along with the value of the +next bit, using the same 6 bits that would take to encode the position +alone. This seems to need 66 slots (twice the number of positions), but for +positions 0 and 1 there is no next bit, so the number of slots needed is 64 +(0 to 63). + +The 6 bits representing this "slot number" are then context-coded. If +the distance is @w{>= 4}, the remaining bits are encoded as follows. +@samp{direct_bits} is the amount of remaining bits (from 1 to 30) needed +to form a complete distance, and is calculated as @w{(slot >> 1) - 1}. +If a distance needs 6 or more direct_bits, the last 4 bits are encoded +separately. The last piece (all the direct_bits for distances 4 to 127, +or the last 4 bits for distances @w{>= 128}) is context-coded in reverse +order (from LSB to MSB). For distances @w{>= 128}, the +@w{@samp{direct_bits - 4}} part is encoded with fixed 0.5 probability. + +@multitable @columnfractions .5 .5 +@headitem Bit sequence @tab Description +@item slot @tab distances from 0 to 3 +@item slot + direct_bits @tab distances from 4 to 127 +@item slot + (direct_bits - 4) + 4 bits @tab distances from 128 to +2^32 - 1 +@end multitable + +@sp 1 +@section The coding contexts + +These contexts (@samp{Bit_model} in the source), are integers or arrays +of integers representing the probability of the corresponding bit being 0. + +The indices used in these arrays are: + +@table @samp +@item state +A state machine (@samp{State} in the source) with 12 states (0 to 11), +coding the latest 2 to 4 types of sequences processed. The initial state +is 0. + +@item pos_state +Value of the 2 least significant bits of the current position in the +decoded data. + +@item literal_state +Value of the 3 most significant bits of the latest byte decoded. + +@item len_state +Coded value of the current match length @w{(length - 2)}, with a maximum +of 3. The resulting value is in the range 0 to 3. + +@end table + + +The types of previous sequences corresponding to each state are shown in the +following table. @samp{!literal} is any sequence except a literal byte. +@samp{rep} is any one of @samp{rep0}, @samp{rep1}, @samp{rep2}, or +@samp{rep3}. The last type in each line is the most recent. + +@multitable {State} {rep or (!literal, shortrep), literal, literal} +@headitem State @tab Types of previous sequences +@item 0 @tab literal, literal, literal +@item 1 @tab match, literal, literal +@item 2 @tab rep or (!literal, shortrep), literal, literal +@item 3 @tab literal, shortrep, literal, literal +@item 4 @tab match, literal +@item 5 @tab rep or (!literal, shortrep), literal +@item 6 @tab literal, shortrep, literal +@item 7 @tab literal, match +@item 8 @tab literal, rep +@item 9 @tab literal, shortrep +@item 10 @tab !literal, match +@item 11 @tab !literal, (rep or shortrep) +@end multitable + +@sp 1 +The contexts for decoding the type of coding sequence are: + +@multitable @columnfractions .2 .35 .45 +@headitem Name @tab Indices @tab Used when +@item bm_match @tab state, pos_state @tab sequence start +@item bm_rep @tab state @tab after sequence 1 +@item bm_rep0 @tab state @tab after sequence 11 +@item bm_rep1 @tab state @tab after sequence 111 +@item bm_rep2 @tab state @tab after sequence 1111 +@item bm_len @tab state, pos_state @tab after sequence 110 +@end multitable + +@sp 1 +The contexts for decoding distances are: + +@multitable @columnfractions .2 .3 .5 +@headitem Name @tab Indices @tab Used when +@item bm_dis_slot @tab len_state, bit tree @tab distance start +@item bm_dis @tab reverse bit tree @tab after slots 4 to 13 +@item bm_align @tab reverse bit tree @tab for distances >= 128, after +fixed probability bits +@end multitable + +@sp 1 +There are two separate sets of contexts for lengths (@samp{Len_model} in +the source). One for normal matches, the other for repeated matches. The +contexts in each Len_model are (see @samp{decode_len} in the source): + +@multitable @columnfractions .2 .4 .4 +@headitem Name @tab Indices @tab Used when +@item choice1 @tab none @tab length start +@item choice2 @tab none @tab after sequence 1 +@item bm_low @tab pos_state, bit tree @tab after sequence 0 +@item bm_mid @tab pos_state, bit tree @tab after sequence 10 +@item bm_high @tab bit tree @tab after sequence 11 +@end multitable + +@sp 1 +The context array @samp{bm_literal} is special. In principle it acts as +a normal bit tree context, the one selected by @samp{literal_state}. But +if the previous decoded byte was not a literal, two other bit tree +contexts are used depending on the value of each bit in +@samp{match_byte} (the byte at the latest used distance), until a bit is +decoded that is different from its corresponding bit in +@samp{match_byte}. After the first difference is found, the rest of the +byte is decoded using the normal bit tree context. (See +@samp{decode_matched} in the source). + +@sp 1 +@section The range decoder + +The LZMA stream is consumed one byte at a time by the range decoder. +(See @samp{normalize} in the source). Every byte consumed produces a +variable number of decoded bits, depending on how well these bits agree +with their context. (See @samp{decode_bit} in the source). + +The range decoder state consists of two unsigned 32-bit variables: +@samp{range} (representing the most significant part of the range size +not yet decoded) and @samp{code} (representing the current point within +@samp{range}). @samp{range} is initialized to @w{2^32 - 1}, and +@samp{code} is initialized to 0. + +The range encoder produces a first 0 byte that must be ignored by the +range decoder. This is done by shifting 5 bytes in the initialization of +@samp{code} instead of 4. (See the @samp{Range_decoder} constructor in +the source). + +@sp 1 +@section Decoding and verifying the LZMA stream + +After decoding the member header and obtaining the dictionary size, the +range decoder is initialized and then the LZMA decoder enters a loop +(see @samp{decode_member} in the source) where it invokes the range +decoder with the appropriate contexts to decode the different coding +sequences (matches, repeated matches, and literal bytes), until the "End +Of Stream" marker is decoded. + +Once the "End Of Stream" marker has been decoded, the decompressor reads and +decodes the member trailer, and verifies that the three integrity factors +stored there (CRC, data size, and member size) match those computed from the +data. + + +@node Trailing data +@chapter Extra data appended to the file +@cindex trailing data + +Sometimes extra data are found appended to a lzip file after the last +member. Such trailing data may be: + +@itemize @bullet +@item +Padding added to make the file size a multiple of some block size, for +example when writing to a tape. It is safe to append any amount of +padding zero bytes to a lzip file. + +@item +Useful data added by the user; a cryptographically secure hash, a +description of file contents, etc. It is safe to append any amount of +text to a lzip file as long as none of the first four bytes of the text +match the corresponding byte in the string "LZIP", and the text does not +contain any zero bytes (null characters). Nonzero bytes and zero bytes +can't be safely mixed in trailing data. + +@item +Garbage added by some not totally successful copy operation. + +@item +Malicious data added to the file in order to make its total size and +hash value (for a chosen hash) coincide with those of another file. + +@item +In rare cases, trailing data could be the corrupt header of another +member. In multimember or concatenated files the probability of +corruption happening in the magic bytes is 5 times smaller than the +probability of getting a false positive caused by the corruption of the +integrity information itself. Therefore it can be considered to be below +the noise level. Additionally, the test used by lzip to discriminate +trailing data from a corrupt header has a Hamming distance (HD) of 3, +and the 3 bit flips must happen in different magic bytes for the test to +fail. In any case, the option @samp{--trailing-error} guarantees that +any corrupt header will be detected. +@end itemize + +Trailing data are in no way part of the lzip file format, but tools +reading lzip files are expected to behave as correctly and usefully as +possible in the presence of trailing data. + +Trailing data can be safely ignored in most cases. In some cases, like +that of user-added data, they are expected to be ignored. In those cases +where a file containing trailing data must be rejected, the option +@samp{--trailing-error} can be used. @xref{--trailing-error}. + + +@node Examples +@chapter A small tutorial with examples +@cindex examples + +WARNING! Even if lzip is bug-free, other causes may result in a corrupt +compressed file (bugs in the system libraries, memory errors, etc). +Therefore, if the data you are going to compress are important, give the +option @samp{--keep} to lzip and don't remove the original file until you +verify the compressed file with a command like +@w{@samp{lzip -cd file.lz | cmp file -}}. Most RAM errors happening during +compression can only be detected by comparing the compressed file with the +original because the corruption happens before lzip compresses the RAM +contents, resulting in a valid compressed file containing wrong data. + +@sp 1 +@noindent +Example 1: Extract all the files from archive @samp{foo.tar.lz}. + +@example + tar -xf foo.tar.lz +or + lzip -cd foo.tar.lz | tar -xf - +@end example + +@sp 1 +@noindent +Example 2: Replace a regular file with its compressed version @samp{file.lz} +and show the compression ratio. + +@example +lzip -v file +@end example + +@sp 1 +@noindent +Example 3: Like example 2 but the created @samp{file.lz} is multimember with +a member size of @w{1 MiB}. The compression ratio is not shown. + +@example +lzip -b 1MiB file +@end example + +@sp 1 +@noindent +Example 4: Restore a regular file from its compressed version +@samp{file.lz}. If the operation is successful, @samp{file.lz} is removed. + +@example +lzip -d file.lz +@end example + +@sp 1 +@noindent +Example 5: Verify the integrity of the compressed file @samp{file.lz} and +show status. + +@example +lzip -tv file.lz +@end example + +@sp 1 +@anchor{concat-example} +@noindent +Example 6: The right way of concatenating the decompressed output of two or +more compressed files. @xref{Trailing data}. + +@example +Don't do this + cat file1.lz file2.lz file3.lz | lzip -d - +Do this instead + lzip -cd file1.lz file2.lz file3.lz +@end example + +@sp 1 +@noindent +Example 7: Decompress @samp{file.lz} partially until @w{10 KiB} of +decompressed data are produced. + +@example +lzip -cd file.lz | dd bs=1024 count=10 +@end example + +@sp 1 +@noindent +Example 8: Decompress @samp{file.lz} partially from decompressed byte at +offset 10000 to decompressed byte at offset 14999 (5000 bytes are produced). + +@example +lzip -cd file.lz | dd bs=1000 skip=10 count=5 +@end example + +@sp 1 +@noindent +Example 9: Compress a whole device in /dev/sdc and send the output to +@samp{file.lz}. + +@example + lzip -c /dev/sdc > file.lz +or + lzip /dev/sdc -o file.lz +@end example + +@sp 1 +@noindent +Example 10: Create a multivolume compressed tar archive with a volume size +of @w{1440 KiB}. + +@example +tar -c some_directory | lzip -S 1440KiB -o volume_name - +@end example + +@sp 1 +@noindent +Example 11: Extract a multivolume compressed tar archive. + +@example +lzip -cd volume_name*.lz | tar -xf - +@end example + +@sp 1 +@noindent +Example 12: Create a multivolume compressed backup of a large database file +with a volume size of @w{650 MB}, where each volume is a multimember file +with a member size of @w{32 MiB}. + +@example +lzip -b 32MiB -S 650MB big_db +@end example + + +@node Problems +@chapter Reporting bugs +@cindex bugs +@cindex getting help + +There are probably bugs in lzip. There are certainly errors and +omissions in this manual. If you report them, they will get fixed. If +you don't, no one will ever know about them and they will remain unfixed +for all eternity, if not longer. + +If you find a bug in lzip, please send electronic mail to +@email{lzip-bug@@nongnu.org}. Include the version number, which you can +find by running @w{@samp{lzip --version}}. + + +@node Reference source code +@appendix Reference source code +@cindex reference source code + +@verbatim +/* Lzd - Educational decompressor for the lzip format + Copyright (C) 2013-2022 Antonio Diaz Diaz. + + This program is free software. Redistribution and use in source and + binary forms, with or without modification, are permitted provided + that the following conditions are met: + + 1. Redistributions of source code must retain the above copyright + notice, this list of conditions, and the following disclaimer. + + 2. Redistributions in binary form must reproduce the above copyright + notice, this list of conditions, and the following disclaimer in the + documentation and/or other materials provided with the distribution. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. +*/ +/* + Exit status: 0 for a normal exit, 1 for environmental problems + (file not found, invalid flags, I/O errors, etc), 2 to indicate a + corrupt or invalid input file. +*/ + +#include <algorithm> +#include <cerrno> +#include <cstdio> +#include <cstdlib> +#include <cstring> +#include <stdint.h> +#include <unistd.h> +#if defined __MSVCRT__ || defined __OS2__ || defined __DJGPP__ +#include <fcntl.h> +#include <io.h> +#endif + + +class State + { + int st; + +public: + enum { states = 12 }; + State() : st( 0 ) {} + int operator()() const { return st; } + bool is_char() const { return st < 7; } + + void set_char() + { + const int next[states] = { 0, 0, 0, 0, 1, 2, 3, 4, 5, 6, 4, 5 }; + st = next[st]; + } + void set_match() { st = ( st < 7 ) ? 7 : 10; } + void set_rep() { st = ( st < 7 ) ? 8 : 11; } + void set_short_rep() { st = ( st < 7 ) ? 9 : 11; } + }; + + +enum { + min_dictionary_size = 1 << 12, + max_dictionary_size = 1 << 29, + literal_context_bits = 3, + literal_pos_state_bits = 0, // not used + pos_state_bits = 2, + pos_states = 1 << pos_state_bits, + pos_state_mask = pos_states - 1, + + len_states = 4, + dis_slot_bits = 6, + start_dis_model = 4, + end_dis_model = 14, + modeled_distances = 1 << ( end_dis_model / 2 ), // 128 + dis_align_bits = 4, + dis_align_size = 1 << dis_align_bits, + + len_low_bits = 3, + len_mid_bits = 3, + len_high_bits = 8, + len_low_symbols = 1 << len_low_bits, + len_mid_symbols = 1 << len_mid_bits, + len_high_symbols = 1 << len_high_bits, + max_len_symbols = len_low_symbols + len_mid_symbols + len_high_symbols, + + min_match_len = 2, // must be 2 + + bit_model_move_bits = 5, + bit_model_total_bits = 11, + bit_model_total = 1 << bit_model_total_bits }; + +struct Bit_model + { + int probability; + Bit_model() : probability( bit_model_total / 2 ) {} + }; + +struct Len_model + { + Bit_model choice1; + Bit_model choice2; + Bit_model bm_low[pos_states][len_low_symbols]; + Bit_model bm_mid[pos_states][len_mid_symbols]; + Bit_model bm_high[len_high_symbols]; + }; + + +class CRC32 + { + uint32_t data[256]; // Table of CRCs of all 8-bit messages. + +public: + CRC32() + { + for( unsigned n = 0; n < 256; ++n ) + { + unsigned c = n; + for( int k = 0; k < 8; ++k ) + { if( c & 1 ) c = 0xEDB88320U ^ ( c >> 1 ); else c >>= 1; } + data[n] = c; + } + } + + void update_buf( uint32_t & crc, const uint8_t * const buffer, + const int size ) const + { + for( int i = 0; i < size; ++i ) + crc = data[(crc^buffer[i])&0xFF] ^ ( crc >> 8 ); + } + }; + +const CRC32 crc32; + + +typedef uint8_t Lzip_header[6]; // 0-3 magic bytes + // 4 version + // 5 coded dictionary size +typedef uint8_t Lzip_trailer[20]; + // 0-3 CRC32 of the uncompressed data + // 4-11 size of the uncompressed data + // 12-19 member size including header and trailer + +class Range_decoder + { + unsigned long long member_pos; + uint32_t code; + uint32_t range; + +public: + Range_decoder() : member_pos( 6 ), code( 0 ), range( 0xFFFFFFFFU ) + { + for( int i = 0; i < 5; ++i ) code = ( code << 8 ) | get_byte(); + } + + uint8_t get_byte() { ++member_pos; return std::getc( stdin ); } + unsigned long long member_position() const { return member_pos; } + + unsigned decode( const int num_bits ) + { + unsigned symbol = 0; + for( int i = num_bits; i > 0; --i ) + { + range >>= 1; + symbol <<= 1; + if( code >= range ) { code -= range; symbol |= 1; } + if( range <= 0x00FFFFFFU ) // normalize + { range <<= 8; code = ( code << 8 ) | get_byte(); } + } + return symbol; + } + + unsigned decode_bit( Bit_model & bm ) + { + unsigned symbol; + const uint32_t bound = ( range >> bit_model_total_bits ) * bm.probability; + if( code < bound ) + { + range = bound; + bm.probability += + ( bit_model_total - bm.probability ) >> bit_model_move_bits; + symbol = 0; + } + else + { + range -= bound; + code -= bound; + bm.probability -= bm.probability >> bit_model_move_bits; + symbol = 1; + } + if( range <= 0x00FFFFFFU ) // normalize + { range <<= 8; code = ( code << 8 ) | get_byte(); } + return symbol; + } + + unsigned decode_tree( Bit_model bm[], const int num_bits ) + { + unsigned symbol = 1; + for( int i = 0; i < num_bits; ++i ) + symbol = ( symbol << 1 ) | decode_bit( bm[symbol] ); + return symbol - ( 1 << num_bits ); + } + + unsigned decode_tree_reversed( Bit_model bm[], const int num_bits ) + { + unsigned symbol = decode_tree( bm, num_bits ); + unsigned reversed_symbol = 0; + for( int i = 0; i < num_bits; ++i ) + { + reversed_symbol = ( reversed_symbol << 1 ) | ( symbol & 1 ); + symbol >>= 1; + } + return reversed_symbol; + } + + unsigned decode_matched( Bit_model bm[], const unsigned match_byte ) + { + unsigned symbol = 1; + for( int i = 7; i >= 0; --i ) + { + const unsigned match_bit = ( match_byte >> i ) & 1; + const unsigned bit = decode_bit( bm[symbol+(match_bit<<8)+0x100] ); + symbol = ( symbol << 1 ) | bit; + if( match_bit != bit ) + { + while( symbol < 0x100 ) + symbol = ( symbol << 1 ) | decode_bit( bm[symbol] ); + break; + } + } + return symbol & 0xFF; + } + + unsigned decode_len( Len_model & lm, const int pos_state ) + { + if( decode_bit( lm.choice1 ) == 0 ) + return decode_tree( lm.bm_low[pos_state], len_low_bits ); + if( decode_bit( lm.choice2 ) == 0 ) + return len_low_symbols + + decode_tree( lm.bm_mid[pos_state], len_mid_bits ); + return len_low_symbols + len_mid_symbols + + decode_tree( lm.bm_high, len_high_bits ); + } + }; + + +class LZ_decoder + { + unsigned long long partial_data_pos; + Range_decoder rdec; + const unsigned dictionary_size; + uint8_t * const buffer; // output buffer + unsigned pos; // current pos in buffer + unsigned stream_pos; // first byte not yet written to stdout + uint32_t crc_; + bool pos_wrapped; + + void flush_data(); + + uint8_t peek( const unsigned distance ) const + { + if( pos > distance ) return buffer[pos - distance - 1]; + if( pos_wrapped ) return buffer[dictionary_size + pos - distance - 1]; + return 0; // prev_byte of first byte + } + + void put_byte( const uint8_t b ) + { + buffer[pos] = b; + if( ++pos >= dictionary_size ) flush_data(); + } + +public: + explicit LZ_decoder( const unsigned dict_size ) + : + partial_data_pos( 0 ), + dictionary_size( dict_size ), + buffer( new uint8_t[dictionary_size] ), + pos( 0 ), + stream_pos( 0 ), + crc_( 0xFFFFFFFFU ), + pos_wrapped( false ) + {} + + ~LZ_decoder() { delete[] buffer; } + + unsigned crc() const { return crc_ ^ 0xFFFFFFFFU; } + unsigned long long data_position() const + { return partial_data_pos + pos; } + uint8_t get_byte() { return rdec.get_byte(); } + unsigned long long member_position() const + { return rdec.member_position(); } + + bool decode_member(); + }; + + +void LZ_decoder::flush_data() + { + if( pos > stream_pos ) + { + const unsigned size = pos - stream_pos; + crc32.update_buf( crc_, buffer + stream_pos, size ); + if( std::fwrite( buffer + stream_pos, 1, size, stdout ) != size ) + { std::fprintf( stderr, "Write error: %s\n", std::strerror( errno ) ); + std::exit( 1 ); } + if( pos >= dictionary_size ) + { partial_data_pos += pos; pos = 0; pos_wrapped = true; } + stream_pos = pos; + } + } + + +bool LZ_decoder::decode_member() // Returns false if error + { + Bit_model bm_literal[1<<literal_context_bits][0x300]; + Bit_model bm_match[State::states][pos_states]; + Bit_model bm_rep[State::states]; + Bit_model bm_rep0[State::states]; + Bit_model bm_rep1[State::states]; + Bit_model bm_rep2[State::states]; + Bit_model bm_len[State::states][pos_states]; + Bit_model bm_dis_slot[len_states][1<<dis_slot_bits]; + Bit_model bm_dis[modeled_distances-end_dis_model+1]; + Bit_model bm_align[dis_align_size]; + Len_model match_len_model; + Len_model rep_len_model; + unsigned rep0 = 0; // rep[0-3] latest four distances + unsigned rep1 = 0; // used for efficient coding of + unsigned rep2 = 0; // repeated distances + unsigned rep3 = 0; + State state; + + while( !std::feof( stdin ) && !std::ferror( stdin ) ) + { + const int pos_state = data_position() & pos_state_mask; + if( rdec.decode_bit( bm_match[state()][pos_state] ) == 0 ) // 1st bit + { + // literal byte + const uint8_t prev_byte = peek( 0 ); + const int literal_state = prev_byte >> ( 8 - literal_context_bits ); + Bit_model * const bm = bm_literal[literal_state]; + if( state.is_char() ) + put_byte( rdec.decode_tree( bm, 8 ) ); + else + put_byte( rdec.decode_matched( bm, peek( rep0 ) ) ); + state.set_char(); + continue; + } + // match or repeated match + int len; + if( rdec.decode_bit( bm_rep[state()] ) != 0 ) // 2nd bit + { + if( rdec.decode_bit( bm_rep0[state()] ) == 0 ) // 3rd bit + { + if( rdec.decode_bit( bm_len[state()][pos_state] ) == 0 ) // 4th bit + { state.set_short_rep(); put_byte( peek( rep0 ) ); continue; } + } + else + { + unsigned distance; + if( rdec.decode_bit( bm_rep1[state()] ) == 0 ) // 4th bit + distance = rep1; + else + { + if( rdec.decode_bit( bm_rep2[state()] ) == 0 ) // 5th bit + distance = rep2; + else + { distance = rep3; rep3 = rep2; } + rep2 = rep1; + } + rep1 = rep0; + rep0 = distance; + } + state.set_rep(); + len = min_match_len + rdec.decode_len( rep_len_model, pos_state ); + } + else // match + { + rep3 = rep2; rep2 = rep1; rep1 = rep0; + len = min_match_len + rdec.decode_len( match_len_model, pos_state ); + const int len_state = std::min( len - min_match_len, len_states - 1 ); + rep0 = rdec.decode_tree( bm_dis_slot[len_state], dis_slot_bits ); + if( rep0 >= start_dis_model ) + { + const unsigned dis_slot = rep0; + const int direct_bits = ( dis_slot >> 1 ) - 1; + rep0 = ( 2 | ( dis_slot & 1 ) ) << direct_bits; + if( dis_slot < end_dis_model ) + rep0 += rdec.decode_tree_reversed( bm_dis + ( rep0 - dis_slot ), + direct_bits ); + else + { + rep0 += + rdec.decode( direct_bits - dis_align_bits ) << dis_align_bits; + rep0 += rdec.decode_tree_reversed( bm_align, dis_align_bits ); + if( rep0 == 0xFFFFFFFFU ) // marker found + { + flush_data(); + return ( len == min_match_len ); // End Of Stream marker + } + } + } + state.set_match(); + if( rep0 >= dictionary_size || ( rep0 >= pos && !pos_wrapped ) ) + { flush_data(); return false; } + } + for( int i = 0; i < len; ++i ) put_byte( peek( rep0 ) ); + } + flush_data(); + return false; + } + + +int main( const int argc, const char * const argv[] ) + { + if( argc > 2 || ( argc == 2 && std::strcmp( argv[1], "-d" ) != 0 ) ) + { + std::printf( + "Lzd %s - Educational decompressor for the lzip format.\n" + "Study the source to learn how a lzip decompressor works.\n" + "See the lzip manual for an explanation of the code.\n" + "\nUsage: %s [-d] < file.lz > file\n" + "Lzd decompresses from standard input to standard output.\n" + "\nCopyright (C) 2022 Antonio Diaz Diaz.\n" + "License 2-clause BSD.\n" + "This is free software: you are free to change and redistribute it.\n" + "There is NO WARRANTY, to the extent permitted by law.\n" + "Report bugs to lzip-bug@nongnu.org\n" + "Lzd home page: http://www.nongnu.org/lzip/lzd.html\n", + PROGVERSION, argv[0] ); + return 0; + } + +#if defined __MSVCRT__ || defined __OS2__ || defined __DJGPP__ + setmode( STDIN_FILENO, O_BINARY ); + setmode( STDOUT_FILENO, O_BINARY ); +#endif + + for( bool first_member = true; ; first_member = false ) + { + Lzip_header header; // verify header + for( int i = 0; i < 6; ++i ) header[i] = std::getc( stdin ); + if( std::feof( stdin ) || std::memcmp( header, "LZIP\x01", 5 ) != 0 ) + { + if( first_member ) + { std::fputs( "Bad magic number (file not in lzip format).\n", + stderr ); return 2; } + break; // ignore trailing data + } + unsigned dict_size = 1 << ( header[5] & 0x1F ); + dict_size -= ( dict_size / 16 ) * ( ( header[5] >> 5 ) & 7 ); + if( dict_size < min_dictionary_size || dict_size > max_dictionary_size ) + { std::fputs( "Invalid dictionary size in member header.\n", stderr ); + return 2; } + + LZ_decoder decoder( dict_size ); // decode LZMA stream + if( !decoder.decode_member() ) + { std::fputs( "Data error\n", stderr ); return 2; } + + Lzip_trailer trailer; // verify trailer + for( int i = 0; i < 20; ++i ) trailer[i] = decoder.get_byte(); + int retval = 0; + unsigned crc = 0; + for( int i = 3; i >= 0; --i ) crc = ( crc << 8 ) + trailer[i]; + if( crc != decoder.crc() ) + { std::fputs( "CRC mismatch\n", stderr ); retval = 2; } + + unsigned long long data_size = 0; + for( int i = 11; i >= 4; --i ) + data_size = ( data_size << 8 ) + trailer[i]; + if( data_size != decoder.data_position() ) + { std::fputs( "Data size mismatch\n", stderr ); retval = 2; } + + unsigned long long member_size = 0; + for( int i = 19; i >= 12; --i ) + member_size = ( member_size << 8 ) + trailer[i]; + if( member_size != decoder.member_position() ) + { std::fputs( "Member size mismatch\n", stderr ); retval = 2; } + if( retval ) return retval; + } + + if( std::fclose( stdout ) != 0 ) + { std::fprintf( stderr, "Error closing stdout: %s\n", + std::strerror( errno ) ); return 1; } + return 0; + } +@end verbatim + + +@node Concept index +@unnumbered Concept index + +@printindex cp + +@bye diff --git a/encoder.cc b/encoder.cc new file mode 100644 index 0000000..afb38d2 --- /dev/null +++ b/encoder.cc @@ -0,0 +1,594 @@ +/* Lzip - LZMA lossless data compressor + Copyright (C) 2008-2022 Antonio Diaz Diaz. + + This program is free software: you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation, either version 2 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program. If not, see <http://www.gnu.org/licenses/>. +*/ + +#define _FILE_OFFSET_BITS 64 + +#include <algorithm> +#include <cerrno> +#include <cstdlib> +#include <cstring> +#include <string> +#include <vector> +#include <stdint.h> + +#include "lzip.h" +#include "encoder_base.h" +#include "encoder.h" + + +const CRC32 crc32; + + +int LZ_encoder::get_match_pairs( Pair * pairs ) + { + int len_limit = match_len_limit; + if( len_limit > available_bytes() ) + { + len_limit = available_bytes(); + if( len_limit < 4 ) return 0; + } + + int maxlen = 3; // only used if pairs != 0 + int num_pairs = 0; + const int min_pos = ( pos > dictionary_size ) ? pos - dictionary_size : 0; + const uint8_t * const data = ptr_to_current_pos(); + + unsigned tmp = crc32[data[0]] ^ data[1]; + const int key2 = tmp & ( num_prev_positions2 - 1 ); + tmp ^= (unsigned)data[2] << 8; + const int key3 = num_prev_positions2 + ( tmp & ( num_prev_positions3 - 1 ) ); + const int key4 = num_prev_positions23 + + ( ( tmp ^ ( crc32[data[3]] << 5 ) ) & key4_mask ); + + if( pairs ) + { + const int np2 = prev_positions[key2]; + const int np3 = prev_positions[key3]; + if( np2 > min_pos && buffer[np2-1] == data[0] ) + { + pairs[0].dis = pos - np2; + pairs[0].len = maxlen = 2 + ( np2 == np3 ); + num_pairs = 1; + } + if( np2 != np3 && np3 > min_pos && buffer[np3-1] == data[0] ) + { + maxlen = 3; + pairs[num_pairs++].dis = pos - np3; + } + if( num_pairs > 0 ) + { + const int delta = pairs[num_pairs-1].dis + 1; + while( maxlen < len_limit && data[maxlen-delta] == data[maxlen] ) + ++maxlen; + pairs[num_pairs-1].len = maxlen; + if( maxlen < 3 ) maxlen = 3; + if( maxlen >= len_limit ) pairs = 0; // done. now just skip + } + } + + const int pos1 = pos + 1; + prev_positions[key2] = pos1; + prev_positions[key3] = pos1; + int newpos1 = prev_positions[key4]; + prev_positions[key4] = pos1; + + int32_t * ptr0 = pos_array + ( cyclic_pos << 1 ); + int32_t * ptr1 = ptr0 + 1; + int len = 0, len0 = 0, len1 = 0; + + for( int count = cycles; ; ) + { + if( newpos1 <= min_pos || --count < 0 ) { *ptr0 = *ptr1 = 0; break; } + + const int delta = pos1 - newpos1; + int32_t * const newptr = pos_array + + ( ( cyclic_pos - delta + + ( ( cyclic_pos >= delta ) ? 0 : dictionary_size + 1 ) ) << 1 ); + if( data[len-delta] == data[len] ) + { + while( ++len < len_limit && data[len-delta] == data[len] ) {} + if( pairs && maxlen < len ) + { + pairs[num_pairs].dis = delta - 1; + pairs[num_pairs].len = maxlen = len; + ++num_pairs; + } + if( len >= len_limit ) + { + *ptr0 = newptr[0]; + *ptr1 = newptr[1]; + break; + } + } + if( data[len-delta] < data[len] ) + { + *ptr0 = newpos1; + ptr0 = newptr + 1; + newpos1 = *ptr0; + len0 = len; if( len1 < len ) len = len1; + } + else + { + *ptr1 = newpos1; + ptr1 = newptr; + newpos1 = *ptr1; + len1 = len; if( len0 < len ) len = len0; + } + } + return num_pairs; + } + + +void LZ_encoder::update_distance_prices() + { + for( int dis = start_dis_model; dis < modeled_distances; ++dis ) + { + const int dis_slot = dis_slots[dis]; + const int direct_bits = ( dis_slot >> 1 ) - 1; + const int base = ( 2 | ( dis_slot & 1 ) ) << direct_bits; + const int price = price_symbol_reversed( bm_dis + ( base - dis_slot ), + dis - base, direct_bits ); + for( int len_state = 0; len_state < len_states; ++len_state ) + dis_prices[len_state][dis] = price; + } + + for( int len_state = 0; len_state < len_states; ++len_state ) + { + int * const dsp = dis_slot_prices[len_state]; + const Bit_model * const bmds = bm_dis_slot[len_state]; + int slot = 0; + for( ; slot < end_dis_model; ++slot ) + dsp[slot] = price_symbol6( bmds, slot ); + for( ; slot < num_dis_slots; ++slot ) + dsp[slot] = price_symbol6( bmds, slot ) + + (((( slot >> 1 ) - 1 ) - dis_align_bits ) << price_shift_bits ); + + int * const dp = dis_prices[len_state]; + int dis = 0; + for( ; dis < start_dis_model; ++dis ) + dp[dis] = dsp[dis]; + for( ; dis < modeled_distances; ++dis ) + dp[dis] += dsp[dis_slots[dis]]; + } + } + + +/* Return the number of bytes advanced (ahead). + trials[0]..trials[ahead-1] contain the steps to encode. + ( trials[0].dis4 == -1 ) means literal. + A match/rep longer or equal than match_len_limit finishes the sequence. +*/ +int LZ_encoder::sequence_optimizer( const int reps[num_rep_distances], + const State state ) + { + int num_pairs, num_trials; + + if( pending_num_pairs > 0 ) // from previous call + { + num_pairs = pending_num_pairs; + pending_num_pairs = 0; + } + else + num_pairs = read_match_distances(); + const int main_len = ( num_pairs > 0 ) ? pairs[num_pairs-1].len : 0; + + int replens[num_rep_distances]; + int rep_index = 0; + for( int i = 0; i < num_rep_distances; ++i ) + { + replens[i] = true_match_len( 0, reps[i] + 1 ); + if( replens[i] > replens[rep_index] ) rep_index = i; + } + if( replens[rep_index] >= match_len_limit ) + { + trials[0].price = replens[rep_index]; + trials[0].dis4 = rep_index; + move_and_update( replens[rep_index] ); + return replens[rep_index]; + } + + if( main_len >= match_len_limit ) + { + trials[0].price = main_len; + trials[0].dis4 = pairs[num_pairs-1].dis + num_rep_distances; + move_and_update( main_len ); + return main_len; + } + + const int pos_state = data_position() & pos_state_mask; + const uint8_t prev_byte = peek( 1 ); + const uint8_t cur_byte = peek( 0 ); + const uint8_t match_byte = peek( reps[0] + 1 ); + + trials[1].price = price0( bm_match[state()][pos_state] ); + if( state.is_char() ) + trials[1].price += price_literal( prev_byte, cur_byte ); + else + trials[1].price += price_matched( prev_byte, cur_byte, match_byte ); + trials[1].dis4 = -1; // literal + + const int match_price = price1( bm_match[state()][pos_state] ); + const int rep_match_price = match_price + price1( bm_rep[state()] ); + + if( match_byte == cur_byte ) + trials[1].update( rep_match_price + price_shortrep( state, pos_state ), 0, 0 ); + + num_trials = std::max( main_len, replens[rep_index] ); + + if( num_trials < min_match_len ) + { + trials[0].price = 1; + trials[0].dis4 = trials[1].dis4; + move_pos(); + return 1; + } + + trials[0].state = state; + for( int i = 0; i < num_rep_distances; ++i ) + trials[0].reps[i] = reps[i]; + + for( int len = min_match_len; len <= num_trials; ++len ) + trials[len].price = infinite_price; + + for( int rep = 0; rep < num_rep_distances; ++rep ) + { + if( replens[rep] < min_match_len ) continue; + const int price = rep_match_price + price_rep( rep, state, pos_state ); + for( int len = min_match_len; len <= replens[rep]; ++len ) + trials[len].update( price + rep_len_prices.price( len, pos_state ), + rep, 0 ); + } + + if( main_len > replens[0] ) + { + const int normal_match_price = match_price + price0( bm_rep[state()] ); + int i = 0, len = std::max( replens[0] + 1, (int)min_match_len ); + while( len > pairs[i].len ) ++i; + while( true ) + { + const int dis = pairs[i].dis; + trials[len].update( normal_match_price + price_pair( dis, len, pos_state ), + dis + num_rep_distances, 0 ); + if( ++len > pairs[i].len && ++i >= num_pairs ) break; + } + } + + int cur = 0; + while( true ) // price optimization loop + { + move_pos(); + if( ++cur >= num_trials ) // no more initialized trials + { + backward( cur ); + return cur; + } + + const int num_pairs = read_match_distances(); + const int newlen = ( num_pairs > 0 ) ? pairs[num_pairs-1].len : 0; + if( newlen >= match_len_limit ) + { + pending_num_pairs = num_pairs; + backward( cur ); + return cur; + } + + // give final values to current trial + Trial & cur_trial = trials[cur]; + State cur_state; + { + const int dis4 = cur_trial.dis4; + int prev_index = cur_trial.prev_index; + const int prev_index2 = cur_trial.prev_index2; + + if( prev_index2 == single_step_trial ) + { + cur_state = trials[prev_index].state; + if( prev_index + 1 == cur ) // len == 1 + { + if( dis4 == 0 ) cur_state.set_short_rep(); + else cur_state.set_char(); // literal + } + else if( dis4 < num_rep_distances ) cur_state.set_rep(); + else cur_state.set_match(); + } + else + { + if( prev_index2 == dual_step_trial ) // dis4 == 0 (rep0) + --prev_index; + else // prev_index2 >= 0 + prev_index = prev_index2; + cur_state.set_char_rep(); + } + cur_trial.state = cur_state; + for( int i = 0; i < num_rep_distances; ++i ) + cur_trial.reps[i] = trials[prev_index].reps[i]; + mtf_reps( dis4, cur_trial.reps ); // literal is ignored + } + + const int pos_state = data_position() & pos_state_mask; + const uint8_t prev_byte = peek( 1 ); + const uint8_t cur_byte = peek( 0 ); + const uint8_t match_byte = peek( cur_trial.reps[0] + 1 ); + + int next_price = cur_trial.price + + price0( bm_match[cur_state()][pos_state] ); + if( cur_state.is_char() ) + next_price += price_literal( prev_byte, cur_byte ); + else + next_price += price_matched( prev_byte, cur_byte, match_byte ); + + // try last updates to next trial + Trial & next_trial = trials[cur+1]; + + next_trial.update( next_price, -1, cur ); // literal + + const int match_price = cur_trial.price + price1( bm_match[cur_state()][pos_state] ); + const int rep_match_price = match_price + price1( bm_rep[cur_state()] ); + + if( match_byte == cur_byte && next_trial.dis4 != 0 && + next_trial.prev_index2 == single_step_trial ) + { + const int price = rep_match_price + price_shortrep( cur_state, pos_state ); + if( price <= next_trial.price ) + { + next_trial.price = price; + next_trial.dis4 = 0; // rep0 + next_trial.prev_index = cur; + } + } + + const int triable_bytes = + std::min( available_bytes(), max_num_trials - 1 - cur ); + if( triable_bytes < min_match_len ) continue; + + const int len_limit = std::min( match_len_limit, triable_bytes ); + + // try literal + rep0 + if( match_byte != cur_byte && next_trial.prev_index != cur ) + { + const uint8_t * const data = ptr_to_current_pos(); + const int dis = cur_trial.reps[0] + 1; + const int limit = std::min( match_len_limit + 1, triable_bytes ); + int len = 1; + while( len < limit && data[len-dis] == data[len] ) ++len; + if( --len >= min_match_len ) + { + const int pos_state2 = ( pos_state + 1 ) & pos_state_mask; + State state2 = cur_state; state2.set_char(); + const int price = next_price + + price1( bm_match[state2()][pos_state2] ) + + price1( bm_rep[state2()] ) + + price_rep0_len( len, state2, pos_state2 ); + while( num_trials < cur + 1 + len ) + trials[++num_trials].price = infinite_price; + trials[cur+1+len].update2( price, cur + 1 ); + } + } + + int start_len = min_match_len; + + // try rep distances + for( int rep = 0; rep < num_rep_distances; ++rep ) + { + const uint8_t * const data = ptr_to_current_pos(); + const int dis = cur_trial.reps[rep] + 1; + int len; + + if( data[0-dis] != data[0] || data[1-dis] != data[1] ) continue; + for( len = min_match_len; len < len_limit; ++len ) + if( data[len-dis] != data[len] ) break; + while( num_trials < cur + len ) + trials[++num_trials].price = infinite_price; + int price = rep_match_price + price_rep( rep, cur_state, pos_state ); + for( int i = min_match_len; i <= len; ++i ) + trials[cur+i].update( price + rep_len_prices.price( i, pos_state ), + rep, cur ); + + if( rep == 0 ) start_len = len + 1; // discard shorter matches + + // try rep + literal + rep0 + int len2 = len + 1; + const int limit = std::min( match_len_limit + len2, triable_bytes ); + while( len2 < limit && data[len2-dis] == data[len2] ) ++len2; + len2 -= len + 1; + if( len2 < min_match_len ) continue; + + int pos_state2 = ( pos_state + len ) & pos_state_mask; + State state2 = cur_state; state2.set_rep(); + price += rep_len_prices.price( len, pos_state ) + + price0( bm_match[state2()][pos_state2] ) + + price_matched( data[len-1], data[len], data[len-dis] ); + pos_state2 = ( pos_state2 + 1 ) & pos_state_mask; + state2.set_char(); + price += price1( bm_match[state2()][pos_state2] ) + + price1( bm_rep[state2()] ) + + price_rep0_len( len2, state2, pos_state2 ); + while( num_trials < cur + len + 1 + len2 ) + trials[++num_trials].price = infinite_price; + trials[cur+len+1+len2].update3( price, rep, cur + len + 1, cur ); + } + + // try matches + if( newlen >= start_len && newlen <= len_limit ) + { + const int normal_match_price = match_price + + price0( bm_rep[cur_state()] ); + + while( num_trials < cur + newlen ) + trials[++num_trials].price = infinite_price; + + int i = 0; + while( pairs[i].len < start_len ) ++i; + int dis = pairs[i].dis; + for( int len = start_len; ; ++len ) + { + int price = normal_match_price + price_pair( dis, len, pos_state ); + trials[cur+len].update( price, dis + num_rep_distances, cur ); + + // try match + literal + rep0 + if( len == pairs[i].len ) + { + const uint8_t * const data = ptr_to_current_pos(); + const int dis2 = dis + 1; + int len2 = len + 1; + const int limit = std::min( match_len_limit + len2, triable_bytes ); + while( len2 < limit && data[len2-dis2] == data[len2] ) ++len2; + len2 -= len + 1; + if( len2 >= min_match_len ) + { + int pos_state2 = ( pos_state + len ) & pos_state_mask; + State state2 = cur_state; state2.set_match(); + price += price0( bm_match[state2()][pos_state2] ) + + price_matched( data[len-1], data[len], data[len-dis2] ); + pos_state2 = ( pos_state2 + 1 ) & pos_state_mask; + state2.set_char(); + price += price1( bm_match[state2()][pos_state2] ) + + price1( bm_rep[state2()] ) + + price_rep0_len( len2, state2, pos_state2 ); + + while( num_trials < cur + len + 1 + len2 ) + trials[++num_trials].price = infinite_price; + trials[cur+len+1+len2].update3( price, dis + num_rep_distances, + cur + len + 1, cur ); + } + if( ++i >= num_pairs ) break; + dis = pairs[i].dis; + } + } + } + } + } + + +bool LZ_encoder::encode_member( const unsigned long long member_size ) + { + const unsigned long long member_size_limit = + member_size - Lzip_trailer::size - max_marker_size; + const bool best = ( match_len_limit > 12 ); + const int dis_price_count = best ? 1 : 512; + const int align_price_count = best ? 1 : dis_align_size; + const int price_count = ( match_len_limit > 36 ) ? 1013 : 4093; + int price_counter = 0; // counters may decrement below 0 + int dis_price_counter = 0; + int align_price_counter = 0; + int reps[num_rep_distances]; + State state; + for( int i = 0; i < num_rep_distances; ++i ) reps[i] = 0; + + if( data_position() != 0 || renc.member_position() != Lzip_header::size ) + return false; // can be called only once + + if( !data_finished() ) // encode first byte + { + const uint8_t prev_byte = 0; + const uint8_t cur_byte = peek( 0 ); + renc.encode_bit( bm_match[state()][0], 0 ); + encode_literal( prev_byte, cur_byte ); + crc32.update_byte( crc_, cur_byte ); + get_match_pairs(); + move_pos(); + } + + while( !data_finished() ) + { + if( price_counter <= 0 && pending_num_pairs == 0 ) + { + price_counter = price_count; // recalculate prices every these bytes + if( dis_price_counter <= 0 ) + { dis_price_counter = dis_price_count; update_distance_prices(); } + if( align_price_counter <= 0 ) + { + align_price_counter = align_price_count; + for( int i = 0; i < dis_align_size; ++i ) + align_prices[i] = price_symbol_reversed( bm_align, i, dis_align_bits ); + } + match_len_prices.update_prices(); + rep_len_prices.update_prices(); + } + + int ahead = sequence_optimizer( reps, state ); + price_counter -= ahead; + + for( int i = 0; ahead > 0; ) + { + const int pos_state = ( data_position() - ahead ) & pos_state_mask; + const int len = trials[i].price; + int dis = trials[i].dis4; + + bool bit = ( dis < 0 ); + renc.encode_bit( bm_match[state()][pos_state], !bit ); + if( bit ) // literal byte + { + const uint8_t prev_byte = peek( ahead + 1 ); + const uint8_t cur_byte = peek( ahead ); + crc32.update_byte( crc_, cur_byte ); + if( state.is_char_set_char() ) + encode_literal( prev_byte, cur_byte ); + else + { + const uint8_t match_byte = peek( ahead + reps[0] + 1 ); + encode_matched( prev_byte, cur_byte, match_byte ); + } + } + else // match or repeated match + { + crc32.update_buf( crc_, ptr_to_current_pos() - ahead, len ); + mtf_reps( dis, reps ); + bit = ( dis < num_rep_distances ); + renc.encode_bit( bm_rep[state()], bit ); + if( bit ) // repeated match + { + bit = ( dis == 0 ); + renc.encode_bit( bm_rep0[state()], !bit ); + if( bit ) + renc.encode_bit( bm_len[state()][pos_state], len > 1 ); + else + { + renc.encode_bit( bm_rep1[state()], dis > 1 ); + if( dis > 1 ) + renc.encode_bit( bm_rep2[state()], dis > 2 ); + } + if( len == 1 ) state.set_short_rep(); + else + { + renc.encode_len( rep_len_model, len, pos_state ); + rep_len_prices.decrement_counter( pos_state ); + state.set_rep(); + } + } + else // match + { + dis -= num_rep_distances; + encode_pair( dis, len, pos_state ); + if( dis >= modeled_distances ) --align_price_counter; + --dis_price_counter; + match_len_prices.decrement_counter( pos_state ); + state.set_match(); + } + } + ahead -= len; i += len; + if( renc.member_position() >= member_size_limit ) + { + if( !dec_pos( ahead ) ) return false; + full_flush( state ); + return true; + } + } + } + full_flush( state ); + return true; + } diff --git a/encoder.h b/encoder.h new file mode 100644 index 0000000..862b524 --- /dev/null +++ b/encoder.h @@ -0,0 +1,290 @@ +/* Lzip - LZMA lossless data compressor + Copyright (C) 2008-2022 Antonio Diaz Diaz. + + This program is free software: you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation, either version 2 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program. If not, see <http://www.gnu.org/licenses/>. +*/ + +class Len_prices + { + const Len_model & lm; + const int len_symbols; + const int count; + int prices[pos_states][max_len_symbols]; + int counters[pos_states]; // may decrement below 0 + + void update_low_mid_prices( const int pos_state ) + { + int * const pps = prices[pos_state]; + int tmp = price0( lm.choice1 ); + int len = 0; + for( ; len < len_low_symbols && len < len_symbols; ++len ) + pps[len] = tmp + price_symbol3( lm.bm_low[pos_state], len ); + if( len >= len_symbols ) return; + tmp = price1( lm.choice1 ) + price0( lm.choice2 ); + for( ; len < len_low_symbols + len_mid_symbols && len < len_symbols; ++len ) + pps[len] = tmp + + price_symbol3( lm.bm_mid[pos_state], len - len_low_symbols ); + } + + void update_high_prices() + { + const int tmp = price1( lm.choice1 ) + price1( lm.choice2 ); + for( int len = len_low_symbols + len_mid_symbols; len < len_symbols; ++len ) + // using 4 slots per value makes "price" faster + prices[3][len] = prices[2][len] = prices[1][len] = prices[0][len] = tmp + + price_symbol8( lm.bm_high, len - len_low_symbols - len_mid_symbols ); + } + +public: + void reset() { for( int i = 0; i < pos_states; ++i ) counters[i] = 0; } + + Len_prices( const Len_model & m, const int match_len_limit ) + : + lm( m ), + len_symbols( match_len_limit + 1 - min_match_len ), + count( ( match_len_limit > 12 ) ? 1 : len_symbols ) + { reset(); } + + void decrement_counter( const int pos_state ) { --counters[pos_state]; } + + void update_prices() + { + bool high_pending = false; + for( int pos_state = 0; pos_state < pos_states; ++pos_state ) + if( counters[pos_state] <= 0 ) + { counters[pos_state] = count; + update_low_mid_prices( pos_state ); high_pending = true; } + if( high_pending && len_symbols > len_low_symbols + len_mid_symbols ) + update_high_prices(); + } + + int price( const int len, const int pos_state ) const + { return prices[pos_state][len - min_match_len]; } + }; + + +class LZ_encoder : public LZ_encoder_base + { + struct Pair // distance-length pair + { + int dis; + int len; + }; + + enum { infinite_price = 0x0FFFFFFF, + max_num_trials = 1 << 13, + single_step_trial = -2, + dual_step_trial = -1 }; + + struct Trial + { + State state; + int price; // dual use var; cumulative price, match length + int dis4; // -1 for literal, or rep, or match distance + 4 + int prev_index; // index of prev trial in trials[] + int prev_index2; // -2 trial is single step + // -1 literal + rep0 + // >= 0 ( rep or match ) + literal + rep0 + int reps[num_rep_distances]; + + void update( const int pr, const int distance4, const int p_i ) + { + if( pr < price ) + { price = pr; dis4 = distance4; prev_index = p_i; + prev_index2 = single_step_trial; } + } + + void update2( const int pr, const int p_i ) + { + if( pr < price ) + { price = pr; dis4 = 0; prev_index = p_i; + prev_index2 = dual_step_trial; } + } + + void update3( const int pr, const int distance4, const int p_i, + const int p_i2 ) + { + if( pr < price ) + { price = pr; dis4 = distance4; prev_index = p_i; + prev_index2 = p_i2; } + } + }; + + const int cycles; + const int match_len_limit; + Len_prices match_len_prices; + Len_prices rep_len_prices; + int pending_num_pairs; + Pair pairs[max_match_len+1]; + Trial trials[max_num_trials]; + + int dis_slot_prices[len_states][2*max_dictionary_bits]; + int dis_prices[len_states][modeled_distances]; + int align_prices[dis_align_size]; + const int num_dis_slots; + + bool dec_pos( const int ahead ) + { + if( ahead < 0 || pos < ahead ) return false; + pos -= ahead; + if( cyclic_pos < ahead ) cyclic_pos += dictionary_size + 1; + cyclic_pos -= ahead; + return true; + } + + int get_match_pairs( Pair * pairs = 0 ); + void update_distance_prices(); + + // move-to-front dis in/into reps; do nothing if( dis4 <= 0 ) + static void mtf_reps( const int dis4, int reps[num_rep_distances] ) + { + if( dis4 >= num_rep_distances ) // match + { + reps[3] = reps[2]; reps[2] = reps[1]; reps[1] = reps[0]; + reps[0] = dis4 - num_rep_distances; + } + else if( dis4 > 0 ) // repeated match + { + const int distance = reps[dis4]; + for( int i = dis4; i > 0; --i ) reps[i] = reps[i-1]; + reps[0] = distance; + } + } + + int price_shortrep( const State state, const int pos_state ) const + { + return price0( bm_rep0[state()] ) + price0( bm_len[state()][pos_state] ); + } + + int price_rep( const int rep, const State state, const int pos_state ) const + { + if( rep == 0 ) return price0( bm_rep0[state()] ) + + price1( bm_len[state()][pos_state] ); + int price = price1( bm_rep0[state()] ); + if( rep == 1 ) + price += price0( bm_rep1[state()] ); + else + { + price += price1( bm_rep1[state()] ); + price += price_bit( bm_rep2[state()], rep - 2 ); + } + return price; + } + + int price_rep0_len( const int len, const State state, const int pos_state ) const + { + return price_rep( 0, state, pos_state ) + + rep_len_prices.price( len, pos_state ); + } + + int price_pair( const int dis, const int len, const int pos_state ) const + { + const int price = match_len_prices.price( len, pos_state ); + const int len_state = get_len_state( len ); + if( dis < modeled_distances ) + return price + dis_prices[len_state][dis]; + else + return price + dis_slot_prices[len_state][get_slot( dis )] + + align_prices[dis & (dis_align_size - 1)]; + } + + int read_match_distances() + { + const int num_pairs = get_match_pairs( pairs ); + if( num_pairs > 0 ) + { + const int len = pairs[num_pairs-1].len; + if( len == match_len_limit && len < max_match_len ) + pairs[num_pairs-1].len = + true_match_len( len, pairs[num_pairs-1].dis + 1 ); + } + return num_pairs; + } + + void move_and_update( int n ) + { + while( true ) + { + move_pos(); + if( --n <= 0 ) break; + get_match_pairs(); + } + } + + void backward( int cur ) + { + int dis4 = trials[cur].dis4; + while( cur > 0 ) + { + const int prev_index = trials[cur].prev_index; + Trial & prev_trial = trials[prev_index]; + + if( trials[cur].prev_index2 != single_step_trial ) + { + prev_trial.dis4 = -1; // literal + prev_trial.prev_index = prev_index - 1; + prev_trial.prev_index2 = single_step_trial; + if( trials[cur].prev_index2 >= 0 ) + { + Trial & prev_trial2 = trials[prev_index-1]; + prev_trial2.dis4 = dis4; dis4 = 0; // rep0 + prev_trial2.prev_index = trials[cur].prev_index2; + prev_trial2.prev_index2 = single_step_trial; + } + } + prev_trial.price = cur - prev_index; // len + cur = dis4; dis4 = prev_trial.dis4; prev_trial.dis4 = cur; + cur = prev_index; + } + } + + int sequence_optimizer( const int reps[num_rep_distances], + const State state ); + + enum { before_size = max_num_trials, + // bytes to keep in buffer after pos + after_size = ( 2 * max_match_len ) + 1, + dict_factor = 2, + num_prev_positions3 = 1 << 16, + num_prev_positions2 = 1 << 10, + num_prev_positions23 = num_prev_positions2 + num_prev_positions3, + pos_array_factor = 2 }; + +public: + LZ_encoder( const int dict_size, const int len_limit, + const int ifd, const int outfd ) + : + LZ_encoder_base( before_size, dict_size, after_size, dict_factor, + num_prev_positions23, pos_array_factor, ifd, outfd ), + cycles( ( len_limit < max_match_len ) ? 16 + ( len_limit / 2 ) : 256 ), + match_len_limit( len_limit ), + match_len_prices( match_len_model, match_len_limit ), + rep_len_prices( rep_len_model, match_len_limit ), + pending_num_pairs( 0 ), + num_dis_slots( 2 * real_bits( dictionary_size - 1 ) ) + { + trials[1].prev_index = 0; + trials[1].prev_index2 = single_step_trial; + } + + void reset() + { + LZ_encoder_base::reset(); + match_len_prices.reset(); + rep_len_prices.reset(); + pending_num_pairs = 0; + } + + bool encode_member( const unsigned long long member_size ); + }; diff --git a/encoder_base.cc b/encoder_base.cc new file mode 100644 index 0000000..a239d1f --- /dev/null +++ b/encoder_base.cc @@ -0,0 +1,193 @@ +/* Lzip - LZMA lossless data compressor + Copyright (C) 2008-2022 Antonio Diaz Diaz. + + This program is free software: you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation, either version 2 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program. If not, see <http://www.gnu.org/licenses/>. +*/ + +#define _FILE_OFFSET_BITS 64 + +#include <algorithm> +#include <cerrno> +#include <cstdlib> +#include <cstring> +#include <new> +#include <string> +#include <vector> +#include <stdint.h> + +#include "lzip.h" +#include "encoder_base.h" + + +Dis_slots dis_slots; +Prob_prices prob_prices; + + +bool Matchfinder_base::read_block() + { + if( !at_stream_end && stream_pos < buffer_size ) + { + const int size = buffer_size - stream_pos; + const int rd = readblock( infd, buffer + stream_pos, size ); + stream_pos += rd; + if( rd != size && errno ) throw Error( "Read error" ); + if( rd < size ) { at_stream_end = true; pos_limit = buffer_size; } + } + return pos < stream_pos; + } + + +void Matchfinder_base::normalize_pos() + { + if( pos > stream_pos ) + internal_error( "pos > stream_pos in normalize_pos." ); + if( !at_stream_end ) + { + // offset is int32_t for the std::min below + const int32_t offset = pos - before_size - dictionary_size; + const int size = stream_pos - offset; + std::memmove( buffer, buffer + offset, size ); + partial_data_pos += offset; + pos -= offset; // pos = before_size + dictionary_size + stream_pos -= offset; + for( int i = 0; i < num_prev_positions; ++i ) + prev_positions[i] -= std::min( prev_positions[i], offset ); + for( int i = 0; i < pos_array_size; ++i ) + pos_array[i] -= std::min( pos_array[i], offset ); + read_block(); + } + } + + +Matchfinder_base::Matchfinder_base( const int before_size_, + const int dict_size, const int after_size, + const int dict_factor, const int num_prev_positions23_, + const int pos_array_factor, const int ifd ) + : + partial_data_pos( 0 ), + before_size( before_size_ ), + pos( 0 ), + cyclic_pos( 0 ), + stream_pos( 0 ), + num_prev_positions23( num_prev_positions23_ ), + infd( ifd ), + at_stream_end( false ) + { + const int buffer_size_limit = + ( dict_factor * dict_size ) + before_size + after_size; + buffer_size = std::max( 65536, dict_size ); + buffer = (uint8_t *)std::malloc( buffer_size ); + if( !buffer ) throw std::bad_alloc(); + if( read_block() && !at_stream_end && buffer_size < buffer_size_limit ) + { + uint8_t * const tmp = (uint8_t *)std::realloc( buffer, buffer_size_limit ); + if( !tmp ) { std::free( buffer ); throw std::bad_alloc(); } + buffer = tmp; + buffer_size = buffer_size_limit; + read_block(); + } + if( at_stream_end && stream_pos < dict_size ) + dictionary_size = std::max( (int)min_dictionary_size, stream_pos ); + else + dictionary_size = dict_size; + pos_limit = buffer_size; + if( !at_stream_end ) pos_limit -= after_size; + unsigned size = 1 << std::max( 16, real_bits( dictionary_size - 1 ) - 2 ); + if( dictionary_size > 1 << 26 ) size >>= 1; // 64 MiB + key4_mask = size - 1; // increases with dictionary size + size += num_prev_positions23; + num_prev_positions = size; + + pos_array_size = pos_array_factor * ( dictionary_size + 1 ); + size += pos_array_size; + if( size * sizeof prev_positions[0] <= size ) prev_positions = 0; + else prev_positions = new( std::nothrow ) int32_t[size]; + if( !prev_positions ) { std::free( buffer ); throw std::bad_alloc(); } + pos_array = prev_positions + num_prev_positions; + for( int i = 0; i < num_prev_positions; ++i ) prev_positions[i] = 0; + } + + +void Matchfinder_base::reset() + { + if( stream_pos > pos ) + std::memmove( buffer, buffer + pos, stream_pos - pos ); + partial_data_pos = 0; + stream_pos -= pos; + pos = 0; + cyclic_pos = 0; + read_block(); + if( at_stream_end && stream_pos < dictionary_size ) + { + dictionary_size = std::max( (int)min_dictionary_size, stream_pos ); + int size = 1 << std::max( 16, real_bits( dictionary_size - 1 ) - 2 ); + if( dictionary_size > 1 << 26 ) size >>= 1; // 64 MiB + key4_mask = size - 1; + size += num_prev_positions23; + num_prev_positions = size; + pos_array = prev_positions + num_prev_positions; + } + for( int i = 0; i < num_prev_positions; ++i ) prev_positions[i] = 0; + } + + +void Range_encoder::flush_data() + { + if( pos > 0 ) + { + if( outfd >= 0 && writeblock( outfd, buffer, pos ) != pos ) + throw Error( "Write error" ); + partial_member_pos += pos; + pos = 0; + show_cprogress(); + } + } + + +// End Of Stream marker => (dis == 0xFFFFFFFFU, len == min_match_len) +void LZ_encoder_base::full_flush( const State state ) + { + const int pos_state = data_position() & pos_state_mask; + renc.encode_bit( bm_match[state()][pos_state], 1 ); + renc.encode_bit( bm_rep[state()], 0 ); + encode_pair( 0xFFFFFFFFU, min_match_len, pos_state ); + renc.flush(); + Lzip_trailer trailer; + trailer.data_crc( crc() ); + trailer.data_size( data_position() ); + trailer.member_size( renc.member_position() + Lzip_trailer::size ); + for( int i = 0; i < Lzip_trailer::size; ++i ) + renc.put_byte( trailer.data[i] ); + renc.flush_data(); + } + + +void LZ_encoder_base::reset() + { + Matchfinder_base::reset(); + crc_ = 0xFFFFFFFFU; + bm_literal[0][0].reset( (1 << literal_context_bits) * 0x300 ); + bm_match[0][0].reset( State::states * pos_states ); + bm_rep[0].reset( State::states ); + bm_rep0[0].reset( State::states ); + bm_rep1[0].reset( State::states ); + bm_rep2[0].reset( State::states ); + bm_len[0][0].reset( State::states * pos_states ); + bm_dis_slot[0][0].reset( len_states * (1 << dis_slot_bits) ); + bm_dis[0].reset( modeled_distances - end_dis_model + 1 ); + bm_align[0].reset( dis_align_size ); + match_len_model.reset(); + rep_len_model.reset(); + renc.reset( dictionary_size ); + } diff --git a/encoder_base.h b/encoder_base.h new file mode 100644 index 0000000..b3dd9e6 --- /dev/null +++ b/encoder_base.h @@ -0,0 +1,483 @@ +/* Lzip - LZMA lossless data compressor + Copyright (C) 2008-2022 Antonio Diaz Diaz. + + This program is free software: you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation, either version 2 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program. If not, see <http://www.gnu.org/licenses/>. +*/ + +enum { price_shift_bits = 6, + price_step_bits = 2, + price_step = 1 << price_step_bits }; + +class Dis_slots + { + uint8_t data[1<<10]; + +public: + void init() + { + for( int slot = 0; slot < 4; ++slot ) data[slot] = slot; + for( int i = 4, size = 2, slot = 4; slot < 20; slot += 2 ) + { + std::memset( &data[i], slot, size ); + std::memset( &data[i+size], slot + 1, size ); + size <<= 1; + i += size; + } + } + + uint8_t operator[]( const int dis ) const { return data[dis]; } + }; + +extern Dis_slots dis_slots; + +inline uint8_t get_slot( const unsigned dis ) + { + if( dis < (1 << 10) ) return dis_slots[dis]; + if( dis < (1 << 19) ) return dis_slots[dis>> 9] + 18; + if( dis < (1 << 28) ) return dis_slots[dis>>18] + 36; + return dis_slots[dis>>27] + 54; + } + + +class Prob_prices + { + short data[bit_model_total >> price_step_bits]; + +public: + void init() + { + for( int i = 0; i < bit_model_total >> price_step_bits; ++i ) + { + unsigned val = ( i * price_step ) + ( price_step / 2 ); + int bits = 0; // base 2 logarithm of val + for( int j = 0; j < price_shift_bits; ++j ) + { + val = val * val; + bits <<= 1; + while( val >= 1 << 16 ) { val >>= 1; ++bits; } + } + bits += 15; // remaining bits in val + data[i] = ( bit_model_total_bits << price_shift_bits ) - bits; + } + } + + int operator[]( const int probability ) const + { return data[probability >> price_step_bits]; } + }; + +extern Prob_prices prob_prices; + + +inline int price0( const Bit_model bm ) + { return prob_prices[bm.probability]; } + +inline int price1( const Bit_model bm ) + { return prob_prices[bit_model_total - bm.probability]; } + +inline int price_bit( const Bit_model bm, const bool bit ) + { return ( bit ? price1( bm ) : price0( bm ) ); } + + +inline int price_symbol3( const Bit_model bm[], int symbol ) + { + bool bit = symbol & 1; + symbol |= 8; symbol >>= 1; + int price = price_bit( bm[symbol], bit ); + bit = symbol & 1; symbol >>= 1; price += price_bit( bm[symbol], bit ); + return price + price_bit( bm[1], symbol & 1 ); + } + + +inline int price_symbol6( const Bit_model bm[], unsigned symbol ) + { + bool bit = symbol & 1; + symbol |= 64; symbol >>= 1; + int price = price_bit( bm[symbol], bit ); + bit = symbol & 1; symbol >>= 1; price += price_bit( bm[symbol], bit ); + bit = symbol & 1; symbol >>= 1; price += price_bit( bm[symbol], bit ); + bit = symbol & 1; symbol >>= 1; price += price_bit( bm[symbol], bit ); + bit = symbol & 1; symbol >>= 1; price += price_bit( bm[symbol], bit ); + return price + price_bit( bm[1], symbol & 1 ); + } + + +inline int price_symbol8( const Bit_model bm[], int symbol ) + { + bool bit = symbol & 1; + symbol |= 0x100; symbol >>= 1; + int price = price_bit( bm[symbol], bit ); + bit = symbol & 1; symbol >>= 1; price += price_bit( bm[symbol], bit ); + bit = symbol & 1; symbol >>= 1; price += price_bit( bm[symbol], bit ); + bit = symbol & 1; symbol >>= 1; price += price_bit( bm[symbol], bit ); + bit = symbol & 1; symbol >>= 1; price += price_bit( bm[symbol], bit ); + bit = symbol & 1; symbol >>= 1; price += price_bit( bm[symbol], bit ); + bit = symbol & 1; symbol >>= 1; price += price_bit( bm[symbol], bit ); + return price + price_bit( bm[1], symbol & 1 ); + } + + +inline int price_symbol_reversed( const Bit_model bm[], int symbol, + const int num_bits ) + { + int price = 0; + int model = 1; + for( int i = num_bits; i > 0; --i ) + { + const bool bit = symbol & 1; + symbol >>= 1; + price += price_bit( bm[model], bit ); + model <<= 1; model |= bit; + } + return price; + } + + +inline int price_matched( const Bit_model bm[], unsigned symbol, + unsigned match_byte ) + { + int price = 0; + unsigned mask = 0x100; + symbol |= mask; + while( true ) + { + const unsigned match_bit = ( match_byte <<= 1 ) & mask; + const bool bit = ( symbol <<= 1 ) & 0x100; + price += price_bit( bm[(symbol>>9)+match_bit+mask], bit ); + if( symbol >= 0x10000 ) return price; + mask &= ~(match_bit ^ symbol); // if( match_bit != bit ) mask = 0; + } + } + + +class Matchfinder_base + { + bool read_block(); + void normalize_pos(); + + Matchfinder_base( const Matchfinder_base & ); // declared as private + void operator=( const Matchfinder_base & ); // declared as private + +protected: + unsigned long long partial_data_pos; + uint8_t * buffer; // input buffer + int32_t * prev_positions; // 1 + last seen position of key. else 0 + int32_t * pos_array; // may be tree or chain + const int before_size; // bytes to keep in buffer before dictionary + int buffer_size; + int dictionary_size; // bytes to keep in buffer before pos + int pos; // current pos in buffer + int cyclic_pos; // cycles through [0, dictionary_size] + int stream_pos; // first byte not yet read from file + int pos_limit; // when reached, a new block must be read + int key4_mask; + const int num_prev_positions23; + int num_prev_positions; // size of prev_positions + int pos_array_size; + const int infd; // input file descriptor + bool at_stream_end; // stream_pos shows real end of file + + Matchfinder_base( const int before_size_, + const int dict_size, const int after_size, + const int dict_factor, const int num_prev_positions23_, + const int pos_array_factor, const int ifd ); + + ~Matchfinder_base() + { delete[] prev_positions; std::free( buffer ); } + +public: + uint8_t peek( const int distance ) const { return buffer[pos-distance]; } + int available_bytes() const { return stream_pos - pos; } + unsigned long long data_position() const { return partial_data_pos + pos; } + bool data_finished() const { return at_stream_end && pos >= stream_pos; } + const uint8_t * ptr_to_current_pos() const { return buffer + pos; } + + int true_match_len( const int index, const int distance ) const + { + const uint8_t * const data = buffer + pos; + int i = index; + const int len_limit = std::min( available_bytes(), (int)max_match_len ); + while( i < len_limit && data[i-distance] == data[i] ) ++i; + return i; + } + + void move_pos() + { + if( ++cyclic_pos > dictionary_size ) cyclic_pos = 0; + if( ++pos >= pos_limit ) normalize_pos(); + } + + void reset(); + }; + + +class Range_encoder + { + enum { buffer_size = 65536 }; + uint64_t low; + unsigned long long partial_member_pos; + uint8_t * const buffer; // output buffer + int pos; // current pos in buffer + uint32_t range; + unsigned ff_count; + const int outfd; // output file descriptor + uint8_t cache; + Lzip_header header; + + void shift_low() + { + if( low >> 24 != 0xFF ) + { + const bool carry = ( low > 0xFFFFFFFFU ); + put_byte( cache + carry ); + for( ; ff_count > 0; --ff_count ) put_byte( 0xFF + carry ); + cache = low >> 24; + } + else ++ff_count; + low = ( low & 0x00FFFFFFU ) << 8; + } + + Range_encoder( const Range_encoder & ); // declared as private + void operator=( const Range_encoder & ); // declared as private + +public: + void reset( const unsigned dictionary_size ) + { + low = 0; + partial_member_pos = 0; + pos = 0; + range = 0xFFFFFFFFU; + ff_count = 0; + cache = 0; + header.dictionary_size( dictionary_size ); + for( int i = 0; i < Lzip_header::size; ++i ) + put_byte( header.data[i] ); + } + + Range_encoder( const unsigned dictionary_size, const int ofd ) + : + buffer( new uint8_t[buffer_size] ), outfd( ofd ) + { + header.set_magic(); + reset( dictionary_size ); + } + + ~Range_encoder() { delete[] buffer; } + + unsigned long long member_position() const + { return partial_member_pos + pos + ff_count; } + + void flush() { for( int i = 0; i < 5; ++i ) shift_low(); } + void flush_data(); + + void put_byte( const uint8_t b ) + { + buffer[pos] = b; + if( ++pos >= buffer_size ) flush_data(); + } + + void encode( const int symbol, const int num_bits ) + { + for( unsigned mask = 1 << ( num_bits - 1 ); mask > 0; mask >>= 1 ) + { + range >>= 1; + if( symbol & mask ) low += range; + if( range <= 0x00FFFFFFU ) { range <<= 8; shift_low(); } + } + } + + void encode_bit( Bit_model & bm, const bool bit ) + { + const uint32_t bound = ( range >> bit_model_total_bits ) * bm.probability; + if( !bit ) + { + range = bound; + bm.probability += + ( bit_model_total - bm.probability ) >> bit_model_move_bits; + } + else + { + low += bound; + range -= bound; + bm.probability -= bm.probability >> bit_model_move_bits; + } + if( range <= 0x00FFFFFFU ) { range <<= 8; shift_low(); } + } + + void encode_tree3( Bit_model bm[], const int symbol ) + { + bool bit = ( symbol >> 2 ) & 1; + encode_bit( bm[1], bit ); + int model = 2 | bit; + bit = ( symbol >> 1 ) & 1; + encode_bit( bm[model], bit ); model <<= 1; model |= bit; + encode_bit( bm[model], symbol & 1 ); + } + + void encode_tree6( Bit_model bm[], const unsigned symbol ) + { + bool bit = ( symbol >> 5 ) & 1; + encode_bit( bm[1], bit ); + int model = 2 | bit; + bit = ( symbol >> 4 ) & 1; + encode_bit( bm[model], bit ); model <<= 1; model |= bit; + bit = ( symbol >> 3 ) & 1; + encode_bit( bm[model], bit ); model <<= 1; model |= bit; + bit = ( symbol >> 2 ) & 1; + encode_bit( bm[model], bit ); model <<= 1; model |= bit; + bit = ( symbol >> 1 ) & 1; + encode_bit( bm[model], bit ); model <<= 1; model |= bit; + encode_bit( bm[model], symbol & 1 ); + } + + void encode_tree8( Bit_model bm[], const int symbol ) + { + int model = 1; + for( int i = 7; i >= 0; --i ) + { + const bool bit = ( symbol >> i ) & 1; + encode_bit( bm[model], bit ); + model <<= 1; model |= bit; + } + } + + void encode_tree_reversed( Bit_model bm[], int symbol, const int num_bits ) + { + int model = 1; + for( int i = num_bits; i > 0; --i ) + { + const bool bit = symbol & 1; + symbol >>= 1; + encode_bit( bm[model], bit ); + model <<= 1; model |= bit; + } + } + + void encode_matched( Bit_model bm[], unsigned symbol, unsigned match_byte ) + { + unsigned mask = 0x100; + symbol |= mask; + while( true ) + { + const unsigned match_bit = ( match_byte <<= 1 ) & mask; + const bool bit = ( symbol <<= 1 ) & 0x100; + encode_bit( bm[(symbol>>9)+match_bit+mask], bit ); + if( symbol >= 0x10000 ) break; + mask &= ~(match_bit ^ symbol); // if( match_bit != bit ) mask = 0; + } + } + + void encode_len( Len_model & lm, int symbol, const int pos_state ) + { + bool bit = ( ( symbol -= min_match_len ) >= len_low_symbols ); + encode_bit( lm.choice1, bit ); + if( !bit ) + encode_tree3( lm.bm_low[pos_state], symbol ); + else + { + bit = ( ( symbol -= len_low_symbols ) >= len_mid_symbols ); + encode_bit( lm.choice2, bit ); + if( !bit ) + encode_tree3( lm.bm_mid[pos_state], symbol ); + else + encode_tree8( lm.bm_high, symbol - len_mid_symbols ); + } + } + }; + + +class LZ_encoder_base : public Matchfinder_base + { +protected: + enum { max_marker_size = 16, + num_rep_distances = 4 }; // must be 4 + + uint32_t crc_; + + Bit_model bm_literal[1<<literal_context_bits][0x300]; + Bit_model bm_match[State::states][pos_states]; + Bit_model bm_rep[State::states]; + Bit_model bm_rep0[State::states]; + Bit_model bm_rep1[State::states]; + Bit_model bm_rep2[State::states]; + Bit_model bm_len[State::states][pos_states]; + Bit_model bm_dis_slot[len_states][1<<dis_slot_bits]; + Bit_model bm_dis[modeled_distances-end_dis_model+1]; + Bit_model bm_align[dis_align_size]; + Len_model match_len_model; + Len_model rep_len_model; + Range_encoder renc; + + LZ_encoder_base( const int before_size, const int dict_size, + const int after_size, const int dict_factor, + const int num_prev_positions23, + const int pos_array_factor, + const int ifd, const int outfd ) + : + Matchfinder_base( before_size, dict_size, after_size, dict_factor, + num_prev_positions23, pos_array_factor, ifd ), + crc_( 0xFFFFFFFFU ), + renc( dictionary_size, outfd ) + {} + + unsigned crc() const { return crc_ ^ 0xFFFFFFFFU; } + + int price_literal( const uint8_t prev_byte, const uint8_t symbol ) const + { return price_symbol8( bm_literal[get_lit_state(prev_byte)], symbol ); } + + int price_matched( const uint8_t prev_byte, const uint8_t symbol, + const uint8_t match_byte ) const + { return ::price_matched( bm_literal[get_lit_state(prev_byte)], symbol, + match_byte ); } + + void encode_literal( const uint8_t prev_byte, const uint8_t symbol ) + { renc.encode_tree8( bm_literal[get_lit_state(prev_byte)], symbol ); } + + void encode_matched( const uint8_t prev_byte, const uint8_t symbol, + const uint8_t match_byte ) + { renc.encode_matched( bm_literal[get_lit_state(prev_byte)], symbol, + match_byte ); } + + void encode_pair( const unsigned dis, const int len, const int pos_state ) + { + renc.encode_len( match_len_model, len, pos_state ); + const unsigned dis_slot = get_slot( dis ); + renc.encode_tree6( bm_dis_slot[get_len_state(len)], dis_slot ); + + if( dis_slot >= start_dis_model ) + { + const int direct_bits = ( dis_slot >> 1 ) - 1; + const unsigned base = ( 2 | ( dis_slot & 1 ) ) << direct_bits; + const unsigned direct_dis = dis - base; + + if( dis_slot < end_dis_model ) + renc.encode_tree_reversed( bm_dis + ( base - dis_slot ), + direct_dis, direct_bits ); + else + { + renc.encode( direct_dis >> dis_align_bits, direct_bits - dis_align_bits ); + renc.encode_tree_reversed( bm_align, direct_dis, dis_align_bits ); + } + } + } + + void full_flush( const State state ); + +public: + virtual ~LZ_encoder_base() {} + + unsigned long long member_position() const { return renc.member_position(); } + virtual void reset(); + + virtual bool encode_member( const unsigned long long member_size ) = 0; + }; diff --git a/fast_encoder.cc b/fast_encoder.cc new file mode 100644 index 0000000..06d7a05 --- /dev/null +++ b/fast_encoder.cc @@ -0,0 +1,184 @@ +/* Lzip - LZMA lossless data compressor + Copyright (C) 2008-2022 Antonio Diaz Diaz. + + This program is free software: you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation, either version 2 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program. If not, see <http://www.gnu.org/licenses/>. +*/ + +#define _FILE_OFFSET_BITS 64 + +#include <algorithm> +#include <cerrno> +#include <cstdlib> +#include <cstring> +#include <string> +#include <vector> +#include <stdint.h> + +#include "lzip.h" +#include "encoder_base.h" +#include "fast_encoder.h" + + +int FLZ_encoder::longest_match_len( int * const distance ) + { + enum { len_limit = 16 }; + const int available = std::min( available_bytes(), (int)max_match_len ); + if( available < len_limit ) return 0; + + const uint8_t * const data = ptr_to_current_pos(); + key4 = ( ( key4 << 4 ) ^ data[3] ) & key4_mask; + const int pos1 = pos + 1; + int newpos1 = prev_positions[key4]; + prev_positions[key4] = pos1; + int32_t * ptr0 = pos_array + cyclic_pos; + int maxlen = 0; + + for( int count = 4; ; ) + { + int delta; + if( newpos1 <= 0 || --count < 0 || + ( delta = pos1 - newpos1 ) > dictionary_size ) { *ptr0 = 0; break; } + int32_t * const newptr = pos_array + + ( cyclic_pos - delta + + ( ( cyclic_pos >= delta ) ? 0 : dictionary_size + 1 ) ); + + if( data[maxlen-delta] == data[maxlen] ) + { + int len = 0; + while( len < available && data[len-delta] == data[len] ) ++len; + if( maxlen < len ) + { maxlen = len; *distance = delta - 1; + if( maxlen >= len_limit ) { *ptr0 = *newptr; break; } } + } + + *ptr0 = newpos1; + ptr0 = newptr; + newpos1 = *ptr0; + } + return maxlen; + } + + +bool FLZ_encoder::encode_member( const unsigned long long member_size ) + { + const unsigned long long member_size_limit = + member_size - Lzip_trailer::size - max_marker_size; + int rep = 0; + int reps[num_rep_distances]; + State state; + for( int i = 0; i < num_rep_distances; ++i ) reps[i] = 0; + + if( data_position() != 0 || renc.member_position() != Lzip_header::size ) + return false; // can be called only once + + if( !data_finished() ) // encode first byte + { + const uint8_t prev_byte = 0; + const uint8_t cur_byte = peek( 0 ); + renc.encode_bit( bm_match[state()][0], 0 ); + encode_literal( prev_byte, cur_byte ); + crc32.update_byte( crc_, cur_byte ); + reset_key4(); + update_and_move( 1 ); + } + + while( !data_finished() && renc.member_position() < member_size_limit ) + { + int match_distance; + const int main_len = longest_match_len( &match_distance ); + const int pos_state = data_position() & pos_state_mask; + int len = 0; + + for( int i = 0; i < num_rep_distances; ++i ) + { + const int tlen = true_match_len( 0, reps[i] + 1 ); + if( tlen > len ) { len = tlen; rep = i; } + } + if( len > min_match_len && len + 3 > main_len ) + { + crc32.update_buf( crc_, ptr_to_current_pos(), len ); + renc.encode_bit( bm_match[state()][pos_state], 1 ); + renc.encode_bit( bm_rep[state()], 1 ); + renc.encode_bit( bm_rep0[state()], rep != 0 ); + if( rep == 0 ) + renc.encode_bit( bm_len[state()][pos_state], 1 ); + else + { + renc.encode_bit( bm_rep1[state()], rep > 1 ); + if( rep > 1 ) + renc.encode_bit( bm_rep2[state()], rep > 2 ); + const int distance = reps[rep]; + for( int i = rep; i > 0; --i ) reps[i] = reps[i-1]; + reps[0] = distance; + } + state.set_rep(); + renc.encode_len( rep_len_model, len, pos_state ); + move_pos(); + update_and_move( len - 1 ); + continue; + } + + if( main_len > min_match_len ) + { + crc32.update_buf( crc_, ptr_to_current_pos(), main_len ); + renc.encode_bit( bm_match[state()][pos_state], 1 ); + renc.encode_bit( bm_rep[state()], 0 ); + state.set_match(); + for( int i = num_rep_distances - 1; i > 0; --i ) reps[i] = reps[i-1]; + reps[0] = match_distance; + encode_pair( match_distance, main_len, pos_state ); + move_pos(); + update_and_move( main_len - 1 ); + continue; + } + + const uint8_t prev_byte = peek( 1 ); + const uint8_t cur_byte = peek( 0 ); + const uint8_t match_byte = peek( reps[0] + 1 ); + move_pos(); + crc32.update_byte( crc_, cur_byte ); + + if( match_byte == cur_byte ) + { + const int short_rep_price = price1( bm_match[state()][pos_state] ) + + price1( bm_rep[state()] ) + + price0( bm_rep0[state()] ) + + price0( bm_len[state()][pos_state] ); + int price = price0( bm_match[state()][pos_state] ); + if( state.is_char() ) + price += price_literal( prev_byte, cur_byte ); + else + price += price_matched( prev_byte, cur_byte, match_byte ); + if( short_rep_price < price ) + { + renc.encode_bit( bm_match[state()][pos_state], 1 ); + renc.encode_bit( bm_rep[state()], 1 ); + renc.encode_bit( bm_rep0[state()], 0 ); + renc.encode_bit( bm_len[state()][pos_state], 0 ); + state.set_short_rep(); + continue; + } + } + + // literal byte + renc.encode_bit( bm_match[state()][pos_state], 0 ); + if( state.is_char_set_char() ) + encode_literal( prev_byte, cur_byte ); + else + encode_matched( prev_byte, cur_byte, match_byte ); + } + + full_flush( state ); + return true; + } diff --git a/fast_encoder.h b/fast_encoder.h new file mode 100644 index 0000000..c41f9e4 --- /dev/null +++ b/fast_encoder.h @@ -0,0 +1,61 @@ +/* Lzip - LZMA lossless data compressor + Copyright (C) 2008-2022 Antonio Diaz Diaz. + + This program is free software: you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation, either version 2 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program. If not, see <http://www.gnu.org/licenses/>. +*/ + +class FLZ_encoder : public LZ_encoder_base + { + unsigned key4; // key made from latest 4 bytes + + void reset_key4() + { + key4 = 0; + for( int i = 0; i < 3 && i < available_bytes(); ++i ) + key4 = ( key4 << 4 ) ^ buffer[i]; + } + + int longest_match_len( int * const distance ); + + void update_and_move( int n ) + { + while( --n >= 0 ) + { + if( available_bytes() >= 4 ) + { + key4 = ( ( key4 << 4 ) ^ buffer[pos+3] ) & key4_mask; + pos_array[cyclic_pos] = prev_positions[key4]; + prev_positions[key4] = pos + 1; + } + move_pos(); + } + } + + enum { before_size = 0, + dict_size = 65536, + // bytes to keep in buffer after pos + after_size = max_match_len, + dict_factor = 16, + num_prev_positions23 = 0, + pos_array_factor = 1 }; + +public: + FLZ_encoder( const int ifd, const int outfd ) + : + LZ_encoder_base( before_size, dict_size, after_size, dict_factor, + num_prev_positions23, pos_array_factor, ifd, outfd ) + {} + + bool encode_member( const unsigned long long member_size ); + }; @@ -0,0 +1,113 @@ +/* Lzip - LZMA lossless data compressor + Copyright (C) 2008-2022 Antonio Diaz Diaz. + + This program is free software: you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation, either version 2 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program. If not, see <http://www.gnu.org/licenses/>. +*/ + +#define _FILE_OFFSET_BITS 64 + +#include <cstdio> +#include <cstring> +#include <string> +#include <vector> +#include <stdint.h> +#include <unistd.h> +#include <sys/stat.h> + +#include "lzip.h" +#include "lzip_index.h" + + +namespace { + +void list_line( const unsigned long long uncomp_size, + const unsigned long long comp_size, + const char * const input_filename ) + { + if( uncomp_size > 0 ) + std::printf( "%14llu %14llu %6.2f%% %s\n", uncomp_size, comp_size, + 100.0 - ( ( 100.0 * comp_size ) / uncomp_size ), + input_filename ); + else + std::printf( "%14llu %14llu -INF%% %s\n", uncomp_size, comp_size, + input_filename ); + } + +} // end namespace + + +int list_files( const std::vector< std::string > & filenames, + const bool ignore_trailing, const bool loose_trailing ) + { + unsigned long long total_comp = 0, total_uncomp = 0; + int files = 0, retval = 0; + bool first_post = true; + bool stdin_used = false; + for( unsigned i = 0; i < filenames.size(); ++i ) + { + const bool from_stdin = ( filenames[i] == "-" ); + if( from_stdin ) { if( stdin_used ) continue; else stdin_used = true; } + const char * const input_filename = + from_stdin ? "(stdin)" : filenames[i].c_str(); + struct stat in_stats; // not used + const int infd = from_stdin ? STDIN_FILENO : + open_instream( input_filename, &in_stats, false, true ); + if( infd < 0 ) { set_retval( retval, 1 ); continue; } + + const Lzip_index lzip_index( infd, ignore_trailing, loose_trailing ); + close( infd ); + if( lzip_index.retval() != 0 ) + { + show_file_error( input_filename, lzip_index.error().c_str() ); + set_retval( retval, lzip_index.retval() ); + continue; + } + if( verbosity < 0 ) continue; + const unsigned long long udata_size = lzip_index.udata_size(); + const unsigned long long cdata_size = lzip_index.cdata_size(); + total_comp += cdata_size; total_uncomp += udata_size; ++files; + const long members = lzip_index.members(); + if( first_post ) + { + first_post = false; + if( verbosity >= 1 ) std::fputs( " dict memb trail ", stdout ); + std::fputs( " uncompressed compressed saved name\n", stdout ); + } + if( verbosity >= 1 ) + std::printf( "%s %5ld %6lld ", format_ds( lzip_index.dictionary_size() ), + members, lzip_index.file_size() - cdata_size ); + list_line( udata_size, cdata_size, input_filename ); + + if( verbosity >= 2 && members > 1 ) + { + std::fputs( " member data_pos data_size member_pos member_size\n", stdout ); + for( long i = 0; i < members; ++i ) + { + const Block & db = lzip_index.dblock( i ); + const Block & mb = lzip_index.mblock( i ); + std::printf( "%6ld %14llu %14llu %14llu %14llu\n", + i + 1, db.pos(), db.size(), mb.pos(), mb.size() ); + } + first_post = true; // reprint heading after list of members + } + std::fflush( stdout ); + } + if( verbosity >= 0 && files > 1 ) + { + if( verbosity >= 1 ) std::fputs( " ", stdout ); + list_line( total_uncomp, total_comp, "(totals)" ); + std::fflush( stdout ); + } + return retval; + } @@ -0,0 +1,358 @@ +/* Lzip - LZMA lossless data compressor + Copyright (C) 2008-2022 Antonio Diaz Diaz. + + This program is free software: you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation, either version 2 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program. If not, see <http://www.gnu.org/licenses/>. +*/ + +class State + { + int st; + +public: + enum { states = 12 }; + State() : st( 0 ) {} + int operator()() const { return st; } + bool is_char() const { return st < 7; } + + void set_char() + { + static const int next[states] = { 0, 0, 0, 0, 1, 2, 3, 4, 5, 6, 4, 5 }; + st = next[st]; + } + bool is_char_set_char() { set_char(); return st < 4; } + void set_char_rep() { st = 8; } + void set_match() { st = ( st < 7 ) ? 7 : 10; } + void set_rep() { st = ( st < 7 ) ? 8 : 11; } + void set_short_rep() { st = ( st < 7 ) ? 9 : 11; } + }; + + +enum { + min_dictionary_bits = 12, + min_dictionary_size = 1 << min_dictionary_bits, // >= modeled_distances + max_dictionary_bits = 29, + max_dictionary_size = 1 << max_dictionary_bits, + min_member_size = 36, + literal_context_bits = 3, + literal_pos_state_bits = 0, // not used + pos_state_bits = 2, + pos_states = 1 << pos_state_bits, + pos_state_mask = pos_states - 1, + + len_states = 4, + dis_slot_bits = 6, + start_dis_model = 4, + end_dis_model = 14, + modeled_distances = 1 << (end_dis_model / 2), // 128 + dis_align_bits = 4, + dis_align_size = 1 << dis_align_bits, + + len_low_bits = 3, + len_mid_bits = 3, + len_high_bits = 8, + len_low_symbols = 1 << len_low_bits, + len_mid_symbols = 1 << len_mid_bits, + len_high_symbols = 1 << len_high_bits, + max_len_symbols = len_low_symbols + len_mid_symbols + len_high_symbols, + + min_match_len = 2, // must be 2 + max_match_len = min_match_len + max_len_symbols - 1, // 273 + min_match_len_limit = 5 }; + +inline int get_len_state( const int len ) + { return std::min( len - min_match_len, len_states - 1 ); } + +inline int get_lit_state( const uint8_t prev_byte ) + { return prev_byte >> ( 8 - literal_context_bits ); } + + +enum { bit_model_move_bits = 5, + bit_model_total_bits = 11, + bit_model_total = 1 << bit_model_total_bits }; + +struct Bit_model + { + int probability; + void reset() { probability = bit_model_total / 2; } + void reset( const int size ) + { for( int i = 0; i < size; ++i ) this[i].reset(); } + Bit_model() { reset(); } + }; + +struct Len_model + { + Bit_model choice1; + Bit_model choice2; + Bit_model bm_low[pos_states][len_low_symbols]; + Bit_model bm_mid[pos_states][len_mid_symbols]; + Bit_model bm_high[len_high_symbols]; + + void reset() + { + choice1.reset(); + choice2.reset(); + bm_low[0][0].reset( pos_states * len_low_symbols ); + bm_mid[0][0].reset( pos_states * len_mid_symbols ); + bm_high[0].reset( len_high_symbols ); + } + }; + + +// defined in main.cc +extern int verbosity; + +class Pretty_print // requires global var 'int verbosity' + { + std::string name_; + std::string padded_name; + const char * const stdin_name; + unsigned longest_name; + mutable bool first_post; + +public: + Pretty_print( const std::vector< std::string > & filenames ) + : stdin_name( "(stdin)" ), longest_name( 0 ), first_post( false ) + { + if( verbosity <= 0 ) return; + const unsigned stdin_name_len = std::strlen( stdin_name ); + for( unsigned i = 0; i < filenames.size(); ++i ) + { + const std::string & s = filenames[i]; + const unsigned len = ( s == "-" ) ? stdin_name_len : s.size(); + if( longest_name < len ) longest_name = len; + } + if( longest_name == 0 ) longest_name = stdin_name_len; + } + + void set_name( const std::string & filename ) + { + if( filename.size() && filename != "-" ) name_ = filename; + else name_ = stdin_name; + padded_name = " "; padded_name += name_; padded_name += ": "; + if( longest_name > name_.size() ) + padded_name.append( longest_name - name_.size(), ' ' ); + first_post = true; + } + + void reset() const { if( name_.size() ) first_post = true; } + const char * name() const { return name_.c_str(); } + void operator()( const char * const msg = 0 ) const; + }; + + +class CRC32 + { + uint32_t data[256]; // Table of CRCs of all 8-bit messages. + +public: + CRC32() + { + for( unsigned n = 0; n < 256; ++n ) + { + unsigned c = n; + for( int k = 0; k < 8; ++k ) + { if( c & 1 ) c = 0xEDB88320U ^ ( c >> 1 ); else c >>= 1; } + data[n] = c; + } + } + + uint32_t operator[]( const uint8_t byte ) const { return data[byte]; } + + void update_byte( uint32_t & crc, const uint8_t byte ) const + { crc = data[(crc^byte)&0xFF] ^ ( crc >> 8 ); } + + // about as fast as it is possible without messing with endianness + void update_buf( uint32_t & crc, const uint8_t * const buffer, + const int size ) const + { + uint32_t c = crc; + for( int i = 0; i < size; ++i ) + c = data[(c^buffer[i])&0xFF] ^ ( c >> 8 ); + crc = c; + } + }; + +extern const CRC32 crc32; + + +inline bool isvalid_ds( const unsigned dictionary_size ) + { return ( dictionary_size >= min_dictionary_size && + dictionary_size <= max_dictionary_size ); } + + +inline int real_bits( unsigned value ) + { + int bits = 0; + while( value > 0 ) { value >>= 1; ++bits; } + return bits; + } + + +const uint8_t lzip_magic[4] = { 0x4C, 0x5A, 0x49, 0x50 }; // "LZIP" + +struct Lzip_header + { + uint8_t data[6]; // 0-3 magic bytes + // 4 version + // 5 coded dictionary size + enum { size = 6 }; + + void set_magic() { std::memcpy( data, lzip_magic, 4 ); data[4] = 1; } + bool verify_magic() const + { return ( std::memcmp( data, lzip_magic, 4 ) == 0 ); } + + bool verify_prefix( const int sz ) const // detect (truncated) header + { + for( int i = 0; i < sz && i < 4; ++i ) + if( data[i] != lzip_magic[i] ) return false; + return ( sz > 0 ); + } + + bool verify_corrupt() const // detect corrupt header + { + int matches = 0; + for( int i = 0; i < 4; ++i ) + if( data[i] == lzip_magic[i] ) ++matches; + return ( matches > 1 && matches < 4 ); + } + + uint8_t version() const { return data[4]; } + bool verify_version() const { return ( data[4] == 1 ); } + + unsigned dictionary_size() const + { + unsigned sz = ( 1 << ( data[5] & 0x1F ) ); + if( sz > min_dictionary_size ) + sz -= ( sz / 16 ) * ( ( data[5] >> 5 ) & 7 ); + return sz; + } + + bool dictionary_size( const unsigned sz ) + { + if( !isvalid_ds( sz ) ) return false; + data[5] = real_bits( sz - 1 ); + if( sz > min_dictionary_size ) + { + const unsigned base_size = 1 << data[5]; + const unsigned fraction = base_size / 16; + for( unsigned i = 7; i >= 1; --i ) + if( base_size - ( i * fraction ) >= sz ) + { data[5] |= ( i << 5 ); break; } + } + return true; + } + + bool verify() const + { return verify_magic() && verify_version() && + isvalid_ds( dictionary_size() ); } + }; + + +struct Lzip_trailer + { + uint8_t data[20]; // 0-3 CRC32 of the uncompressed data + // 4-11 size of the uncompressed data + // 12-19 member size including header and trailer + enum { size = 20 }; + + unsigned data_crc() const + { + unsigned tmp = 0; + for( int i = 3; i >= 0; --i ) { tmp <<= 8; tmp += data[i]; } + return tmp; + } + + void data_crc( unsigned crc ) + { for( int i = 0; i <= 3; ++i ) { data[i] = (uint8_t)crc; crc >>= 8; } } + + unsigned long long data_size() const + { + unsigned long long tmp = 0; + for( int i = 11; i >= 4; --i ) { tmp <<= 8; tmp += data[i]; } + return tmp; + } + + void data_size( unsigned long long sz ) + { for( int i = 4; i <= 11; ++i ) { data[i] = (uint8_t)sz; sz >>= 8; } } + + unsigned long long member_size() const + { + unsigned long long tmp = 0; + for( int i = 19; i >= 12; --i ) { tmp <<= 8; tmp += data[i]; } + return tmp; + } + + void member_size( unsigned long long sz ) + { for( int i = 12; i <= 19; ++i ) { data[i] = (uint8_t)sz; sz >>= 8; } } + + bool verify_consistency() const // check internal consistency + { + const unsigned crc = data_crc(); + const unsigned long long dsize = data_size(); + if( ( crc == 0 ) != ( dsize == 0 ) ) return false; + const unsigned long long msize = member_size(); + if( msize < min_member_size ) return false; + const unsigned long long mlimit = ( 9 * dsize + 7 ) / 8 + min_member_size; + if( mlimit > dsize && msize > mlimit ) return false; + const unsigned long long dlimit = 7090 * ( msize - 26 ) - 1; + if( dlimit > msize && dsize > dlimit ) return false; + return true; + } + }; + + +struct Error + { + const char * const msg; + explicit Error( const char * const s ) : msg( s ) {} + }; + +inline void set_retval( int & retval, const int new_val ) + { if( retval < new_val ) retval = new_val; } + +const char * const bad_magic_msg = "Bad magic number (file not in lzip format)."; +const char * const bad_dict_msg = "Invalid dictionary size in member header."; +const char * const corrupt_mm_msg = "Corrupt header in multimember file."; +const char * const trailing_msg = "Trailing data not allowed."; + +// defined in decoder.cc +int readblock( const int fd, uint8_t * const buf, const int size ); +int writeblock( const int fd, const uint8_t * const buf, const int size ); + +// defined in list.cc +int list_files( const std::vector< std::string > & filenames, + const bool ignore_trailing, const bool loose_trailing ); + +// defined in main.cc +struct stat; +const char * bad_version( const unsigned version ); +const char * format_ds( const unsigned dictionary_size ); +void show_header( const unsigned dictionary_size ); +int open_instream( const char * const name, struct stat * const in_statsp, + const bool one_to_one, const bool reg_only = false ); +void show_error( const char * const msg, const int errcode = 0, + const bool help = false ); +void show_file_error( const char * const filename, const char * const msg, + const int errcode = 0 ); +void internal_error( const char * const msg ); +class Matchfinder_base; +void show_cprogress( const unsigned long long cfile_size = 0, + const unsigned long long partial_size = 0, + const Matchfinder_base * const m = 0, + const Pretty_print * const p = 0 ); +class Range_decoder; +void show_dprogress( const unsigned long long cfile_size = 0, + const unsigned long long partial_size = 0, + const Range_decoder * const d = 0, + const Pretty_print * const p = 0 ); diff --git a/lzip_index.cc b/lzip_index.cc new file mode 100644 index 0000000..b64fb36 --- /dev/null +++ b/lzip_index.cc @@ -0,0 +1,214 @@ +/* Lzip - LZMA lossless data compressor + Copyright (C) 2008-2022 Antonio Diaz Diaz. + + This program is free software: you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation, either version 2 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program. If not, see <http://www.gnu.org/licenses/>. +*/ + +#define _FILE_OFFSET_BITS 64 + +#include <algorithm> +#include <cerrno> +#include <cstdio> +#include <cstring> +#include <string> +#include <vector> +#include <stdint.h> +#include <unistd.h> + +#include "lzip.h" +#include "lzip_index.h" + + +namespace { + +int seek_read( const int fd, uint8_t * const buf, const int size, + const long long pos ) + { + if( lseek( fd, pos, SEEK_SET ) == pos ) + return readblock( fd, buf, size ); + return 0; + } + +} // end namespace + + +bool Lzip_index::check_header_error( const Lzip_header & header ) + { + if( !header.verify_magic() ) + { error_ = bad_magic_msg; retval_ = 2; return true; } + if( !header.verify_version() ) + { error_ = bad_version( header.version() ); retval_ = 2; return true; } + if( !isvalid_ds( header.dictionary_size() ) ) + { error_ = bad_dict_msg; retval_ = 2; return true; } + return false; + } + +void Lzip_index::set_errno_error( const char * const msg ) + { + error_ = msg; error_ += std::strerror( errno ); + retval_ = 1; + } + +void Lzip_index::set_num_error( const char * const msg, unsigned long long num ) + { + char buf[80]; + snprintf( buf, sizeof buf, "%s%llu", msg, num ); + error_ = buf; + retval_ = 2; + } + + +bool Lzip_index::read_header( const int fd, Lzip_header & header, + const long long pos ) + { + if( seek_read( fd, header.data, Lzip_header::size, pos ) != Lzip_header::size ) + { set_errno_error( "Error reading member header: " ); return false; } + return true; + } + + +// If successful, push last member and set pos to member header. +bool Lzip_index::skip_trailing_data( const int fd, unsigned long long & pos, + const bool ignore_trailing, + const bool loose_trailing ) + { + if( pos < min_member_size ) return false; + enum { block_size = 16384, + buffer_size = block_size + Lzip_trailer::size - 1 + Lzip_header::size }; + uint8_t buffer[buffer_size]; + int bsize = pos % block_size; // total bytes in buffer + if( bsize <= buffer_size - block_size ) bsize += block_size; + int search_size = bsize; // bytes to search for trailer + int rd_size = bsize; // bytes to read from file + unsigned long long ipos = pos - rd_size; // aligned to block_size + + while( true ) + { + if( seek_read( fd, buffer, rd_size, ipos ) != rd_size ) + { set_errno_error( "Error seeking member trailer: " ); return false; } + const uint8_t max_msb = ( ipos + search_size ) >> 56; + for( int i = search_size; i >= Lzip_trailer::size; --i ) + if( buffer[i-1] <= max_msb ) // most significant byte of member_size + { + const Lzip_trailer & trailer = + *(const Lzip_trailer *)( buffer + i - Lzip_trailer::size ); + const unsigned long long member_size = trailer.member_size(); + if( member_size == 0 ) // skip trailing zeros + { while( i > Lzip_trailer::size && buffer[i-9] == 0 ) --i; continue; } + if( member_size > ipos + i || !trailer.verify_consistency() ) + continue; + Lzip_header header; + if( !read_header( fd, header, ipos + i - member_size ) ) return false; + if( !header.verify() ) continue; + const Lzip_header & header2 = *(const Lzip_header *)( buffer + i ); + const bool full_h2 = bsize - i >= Lzip_header::size; + if( header2.verify_prefix( bsize - i ) ) // last member + { + if( !full_h2 ) error_ = "Last member in input file is truncated."; + else if( !check_header_error( header2 ) ) + error_ = "Last member in input file is truncated or corrupt."; + retval_ = 2; return false; + } + if( !loose_trailing && full_h2 && header2.verify_corrupt() ) + { error_ = corrupt_mm_msg; retval_ = 2; return false; } + if( !ignore_trailing ) + { error_ = trailing_msg; retval_ = 2; return false; } + pos = ipos + i - member_size; + const unsigned dictionary_size = header.dictionary_size(); + member_vector.push_back( Member( 0, trailer.data_size(), pos, + member_size, dictionary_size ) ); + if( dictionary_size_ < dictionary_size ) + dictionary_size_ = dictionary_size; + return true; + } + if( ipos == 0 ) + { set_num_error( "Bad trailer at pos ", pos - Lzip_trailer::size ); + return false; } + bsize = buffer_size; + search_size = bsize - Lzip_header::size; + rd_size = block_size; + ipos -= rd_size; + std::memcpy( buffer + rd_size, buffer, buffer_size - rd_size ); + } + } + + +Lzip_index::Lzip_index( const int infd, const bool ignore_trailing, + const bool loose_trailing ) + : insize( lseek( infd, 0, SEEK_END ) ), retval_( 0 ), dictionary_size_( 0 ) + { + if( insize < 0 ) + { set_errno_error( "Input file is not seekable: " ); return; } + if( insize < min_member_size ) + { error_ = "Input file is too short."; retval_ = 2; return; } + if( insize > INT64_MAX ) + { error_ = "Input file is too long (2^63 bytes or more)."; + retval_ = 2; return; } + + Lzip_header header; + if( !read_header( infd, header, 0 ) ) return; + if( check_header_error( header ) ) return; + + unsigned long long pos = insize; // always points to a header or to EOF + while( pos >= min_member_size ) + { + Lzip_trailer trailer; + if( seek_read( infd, trailer.data, Lzip_trailer::size, + pos - Lzip_trailer::size ) != Lzip_trailer::size ) + { set_errno_error( "Error reading member trailer: " ); break; } + const unsigned long long member_size = trailer.member_size(); + if( member_size > pos || !trailer.verify_consistency() ) // bad trailer + { + if( member_vector.empty() ) + { if( skip_trailing_data( infd, pos, ignore_trailing, loose_trailing ) ) + continue; else return; } + set_num_error( "Bad trailer at pos ", pos - Lzip_trailer::size ); + break; + } + if( !read_header( infd, header, pos - member_size ) ) break; + if( !header.verify() ) // bad header + { + if( member_vector.empty() ) + { if( skip_trailing_data( infd, pos, ignore_trailing, loose_trailing ) ) + continue; else return; } + set_num_error( "Bad header at pos ", pos - member_size ); + break; + } + pos -= member_size; + const unsigned dictionary_size = header.dictionary_size(); + member_vector.push_back( Member( 0, trailer.data_size(), pos, + member_size, dictionary_size ) ); + if( dictionary_size_ < dictionary_size ) + dictionary_size_ = dictionary_size; + } + if( pos != 0 || member_vector.empty() ) + { + member_vector.clear(); + if( retval_ == 0 ) { error_ = "Can't create file index."; retval_ = 2; } + return; + } + std::reverse( member_vector.begin(), member_vector.end() ); + for( unsigned long i = 0; ; ++i ) + { + const long long end = member_vector[i].dblock.end(); + if( end < 0 || end > INT64_MAX ) + { + member_vector.clear(); + error_ = "Data in input file is too long (2^63 bytes or more)."; + retval_ = 2; return; + } + if( i + 1 >= member_vector.size() ) break; + member_vector[i+1].dblock.pos( end ); + } + } diff --git a/lzip_index.h b/lzip_index.h new file mode 100644 index 0000000..442512d --- /dev/null +++ b/lzip_index.h @@ -0,0 +1,91 @@ +/* Lzip - LZMA lossless data compressor + Copyright (C) 2008-2022 Antonio Diaz Diaz. + + This program is free software: you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation, either version 2 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program. If not, see <http://www.gnu.org/licenses/>. +*/ + +#ifndef INT64_MAX +#define INT64_MAX 0x7FFFFFFFFFFFFFFFLL +#endif + + +class Block + { + long long pos_, size_; // pos + size <= INT64_MAX + +public: + Block( const long long p, const long long s ) : pos_( p ), size_( s ) {} + + long long pos() const { return pos_; } + long long size() const { return size_; } + long long end() const { return pos_ + size_; } + + void pos( const long long p ) { pos_ = p; } + void size( const long long s ) { size_ = s; } + }; + + +class Lzip_index + { + struct Member + { + Block dblock, mblock; // data block, member block + unsigned dictionary_size; + + Member( const long long dp, const long long ds, + const long long mp, const long long ms, const unsigned dict_size ) + : dblock( dp, ds ), mblock( mp, ms ), dictionary_size( dict_size ) {} + }; + + std::vector< Member > member_vector; + std::string error_; + const long long insize; + int retval_; + unsigned dictionary_size_; // largest dictionary size in the file + + bool check_header_error( const Lzip_header & header ); + void set_errno_error( const char * const msg ); + void set_num_error( const char * const msg, unsigned long long num ); + bool read_header( const int fd, Lzip_header & header, const long long pos ); + bool skip_trailing_data( const int fd, unsigned long long & pos, + const bool ignore_trailing, const bool loose_trailing ); + +public: + Lzip_index( const int infd, const bool ignore_trailing, + const bool loose_trailing ); + + long members() const { return member_vector.size(); } + const std::string & error() const { return error_; } + int retval() const { return retval_; } + unsigned dictionary_size() const { return dictionary_size_; } + + long long udata_size() const + { if( member_vector.empty() ) return 0; + return member_vector.back().dblock.end(); } + + long long cdata_size() const + { if( member_vector.empty() ) return 0; + return member_vector.back().mblock.end(); } + + // total size including trailing data (if any) + long long file_size() const + { if( insize >= 0 ) return insize; else return 0; } + + const Block & dblock( const long i ) const + { return member_vector[i].dblock; } + const Block & mblock( const long i ) const + { return member_vector[i].mblock; } + unsigned dictionary_size( const long i ) const + { return member_vector[i].dictionary_size; } + }; @@ -0,0 +1,1071 @@ +/* Lzip - LZMA lossless data compressor + Copyright (C) 2008-2022 Antonio Diaz Diaz. + + This program is free software: you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation, either version 2 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program. If not, see <http://www.gnu.org/licenses/>. +*/ +/* + Exit status: 0 for a normal exit, 1 for environmental problems + (file not found, invalid flags, I/O errors, etc), 2 to indicate a + corrupt or invalid input file, 3 for an internal consistency error + (e.g., bug) which caused lzip to panic. +*/ + +#define _FILE_OFFSET_BITS 64 + +#include <algorithm> +#include <cctype> +#include <cerrno> +#include <climits> +#include <csignal> +#include <cstdio> +#include <cstdlib> +#include <cstring> +#include <new> +#include <string> +#include <vector> +#include <fcntl.h> +#include <stdint.h> +#include <unistd.h> +#include <utime.h> +#include <sys/stat.h> +#if defined __MSVCRT__ || defined __OS2__ || defined __DJGPP__ +#include <io.h> +#if defined __MSVCRT__ +#define fchmod(x,y) 0 +#define fchown(x,y,z) 0 +#define strtoull std::strtoul +#define SIGHUP SIGTERM +#define S_ISSOCK(x) 0 +#ifndef S_IRGRP +#define S_IRGRP 0 +#define S_IWGRP 0 +#define S_IROTH 0 +#define S_IWOTH 0 +#endif +#endif +#if defined __DJGPP__ +#define S_ISSOCK(x) 0 +#define S_ISVTX 0 +#endif +#endif + +#include "arg_parser.h" +#include "lzip.h" +#include "decoder.h" +#include "encoder_base.h" +#include "encoder.h" +#include "fast_encoder.h" + +#ifndef O_BINARY +#define O_BINARY 0 +#endif + +#if CHAR_BIT != 8 +#error "Environments where CHAR_BIT != 8 are not supported." +#endif + +#if ( defined SIZE_MAX && SIZE_MAX < UINT_MAX ) || \ + ( defined SSIZE_MAX && SSIZE_MAX < INT_MAX ) +#error "Environments where 'size_t' is narrower than 'int' are not supported." +#endif + +int verbosity = 0; + +namespace { + +const char * const program_name = "lzip"; +const char * const program_year = "2022"; +const char * invocation_name = program_name; // default value + +const struct { const char * from; const char * to; } known_extensions[] = { + { ".lz", "" }, + { ".tlz", ".tar" }, + { 0, 0 } }; + +struct Lzma_options + { + int dictionary_size; // 4 KiB .. 512 MiB + int match_len_limit; // 5 .. 273 + }; + +enum Mode { m_compress, m_decompress, m_list, m_test }; + +/* Variables used in signal handler context. + They are not declared volatile because the handler never returns. */ +std::string output_filename; +int outfd = -1; +bool delete_output_on_interrupt = false; + + +void show_help() + { + std::printf( "Lzip is a lossless data compressor with a user interface similar to the one\n" + "of gzip or bzip2. Lzip uses a simplified form of the 'Lempel-Ziv-Markov\n" + "chain-Algorithm' (LZMA) stream format and provides a 3 factor integrity\n" + "checking to maximize interoperability and optimize safety. Lzip can compress\n" + "about as fast as gzip (lzip -0) or compress most files more than bzip2\n" + "(lzip -9). Decompression speed is intermediate between gzip and bzip2.\n" + "Lzip is better than gzip and bzip2 from a data recovery perspective. Lzip\n" + "has been designed, written, and tested with great care to replace gzip and\n" + "bzip2 as the standard general-purpose compressed format for unix-like\n" + "systems.\n" + "\nUsage: %s [options] [files]\n", invocation_name ); + std::printf( "\nOptions:\n" + " -h, --help display this help and exit\n" + " -V, --version output version information and exit\n" + " -a, --trailing-error exit with error status if trailing data\n" + " -b, --member-size=<bytes> set member size limit in bytes\n" + " -c, --stdout write to standard output, keep input files\n" + " -d, --decompress decompress\n" + " -f, --force overwrite existing output files\n" + " -F, --recompress force re-compression of compressed files\n" + " -k, --keep keep (don't delete) input files\n" + " -l, --list print (un)compressed file sizes\n" + " -m, --match-length=<bytes> set match length limit in bytes [36]\n" + " -o, --output=<file> write to <file>, keep input files\n" + " -q, --quiet suppress all messages\n" + " -s, --dictionary-size=<bytes> set dictionary size limit in bytes [8 MiB]\n" + " -S, --volume-size=<bytes> set volume size limit in bytes\n" + " -t, --test test compressed file integrity\n" + " -v, --verbose be verbose (a 2nd -v gives more)\n" + " -0 .. -9 set compression level [default 6]\n" + " --fast alias for -0\n" + " --best alias for -9\n" + " --loose-trailing allow trailing data seeming corrupt header\n" + "\nIf no file names are given, or if a file is '-', lzip compresses or\n" + "decompresses from standard input to standard output.\n" + "Numbers may be followed by a multiplier: k = kB = 10^3 = 1000,\n" + "Ki = KiB = 2^10 = 1024, M = 10^6, Mi = 2^20, G = 10^9, Gi = 2^30, etc...\n" + "Dictionary sizes 12 to 29 are interpreted as powers of two, meaning 2^12\n" + "to 2^29 bytes.\n" + "\nThe bidimensional parameter space of LZMA can't be mapped to a linear\n" + "scale optimal for all files. If your files are large, very repetitive,\n" + "etc, you may need to use the options --dictionary-size and --match-length\n" + "directly to achieve optimal performance.\n" + "\nTo extract all the files from archive 'foo.tar.lz', use the commands\n" + "'tar -xf foo.tar.lz' or 'lzip -cd foo.tar.lz | tar -xf -'.\n" + "\nExit status: 0 for a normal exit, 1 for environmental problems (file\n" + "not found, invalid flags, I/O errors, etc), 2 to indicate a corrupt or\n" + "invalid input file, 3 for an internal consistency error (e.g., bug) which\n" + "caused lzip to panic.\n" + "\nThe ideas embodied in lzip are due to (at least) the following people:\n" + "Abraham Lempel and Jacob Ziv (for the LZ algorithm), Andrey Markov (for the\n" + "definition of Markov chains), G.N.N. Martin (for the definition of range\n" + "encoding), Igor Pavlov (for putting all the above together in LZMA), and\n" + "Julian Seward (for bzip2's CLI).\n" + "\nReport bugs to lzip-bug@nongnu.org\n" + "Lzip home page: http://www.nongnu.org/lzip/lzip.html\n" ); + } + + +void show_version() + { + std::printf( "%s %s\n", program_name, PROGVERSION ); + std::printf( "Copyright (C) %s Antonio Diaz Diaz.\n", program_year ); + std::printf( "License GPLv2+: GNU GPL version 2 or later <http://gnu.org/licenses/gpl.html>\n" + "This is free software: you are free to change and redistribute it.\n" + "There is NO WARRANTY, to the extent permitted by law.\n" ); + } + +} // end namespace + +void Pretty_print::operator()( const char * const msg ) const + { + if( verbosity < 0 ) return; + if( first_post ) + { + first_post = false; + std::fputs( padded_name.c_str(), stderr ); + if( !msg ) std::fflush( stderr ); + } + if( msg ) std::fprintf( stderr, "%s\n", msg ); + } + + +const char * bad_version( const unsigned version ) + { + static char buf[80]; + snprintf( buf, sizeof buf, "Version %u member format not supported.", + version ); + return buf; + } + + +const char * format_ds( const unsigned dictionary_size ) + { + enum { bufsize = 16, factor = 1024 }; + static char buf[bufsize]; + const char * const prefix[8] = + { "Ki", "Mi", "Gi", "Ti", "Pi", "Ei", "Zi", "Yi" }; + const char * p = ""; + const char * np = " "; + unsigned num = dictionary_size; + bool exact = ( num % factor == 0 ); + + for( int i = 0; i < 8 && ( num > 9999 || ( exact && num >= factor ) ); ++i ) + { num /= factor; if( num % factor != 0 ) exact = false; + p = prefix[i]; np = ""; } + snprintf( buf, bufsize, "%s%4u %sB", np, num, p ); + return buf; + } + + +void show_header( const unsigned dictionary_size ) + { + std::fprintf( stderr, "dict %s, ", format_ds( dictionary_size ) ); + } + +namespace { + +// separate large numbers >= 100_000 in groups of 3 digits using '_' +const char * format_num3( unsigned long long num ) + { + const char * const si_prefix = "kMGTPEZY"; + const char * const binary_prefix = "KMGTPEZY"; + enum { buffers = 8, bufsize = 4 * sizeof (long long) }; + static char buffer[buffers][bufsize]; // circle of static buffers for printf + static int current = 0; + + char * const buf = buffer[current++]; current %= buffers; + char * p = buf + bufsize - 1; // fill the buffer backwards + *p = 0; // terminator + if( num > 1024 ) + { + char prefix = 0; // try binary first, then si + for( int i = 0; i < 8 && num >= 1024 && num % 1024 == 0; ++i ) + { num /= 1024; prefix = binary_prefix[i]; } + if( prefix ) *(--p) = 'i'; + else + for( int i = 0; i < 8 && num >= 1000 && num % 1000 == 0; ++i ) + { num /= 1000; prefix = si_prefix[i]; } + if( prefix ) *(--p) = prefix; + } + const bool split = num >= 100000; + + for( int i = 0; ; ) + { + *(--p) = num % 10 + '0'; num /= 10; if( num == 0 ) break; + if( split && ++i >= 3 ) { i = 0; *(--p) = '_'; } + } + return p; + } + + +unsigned long long getnum( const char * const arg, + const char * const option_name, + const unsigned long long llimit, + const unsigned long long ulimit ) + { + char * tail; + errno = 0; + unsigned long long result = strtoull( arg, &tail, 0 ); + if( tail == arg ) + { + if( verbosity >= 0 ) + std::fprintf( stderr, "%s: Bad or missing numerical argument in " + "option '%s'.\n", program_name, option_name ); + std::exit( 1 ); + } + + if( !errno && tail[0] ) + { + const unsigned factor = ( tail[1] == 'i' ) ? 1024 : 1000; + int exponent = 0; // 0 = bad multiplier + switch( tail[0] ) + { + case 'Y': exponent = 8; break; + case 'Z': exponent = 7; break; + case 'E': exponent = 6; break; + case 'P': exponent = 5; break; + case 'T': exponent = 4; break; + case 'G': exponent = 3; break; + case 'M': exponent = 2; break; + case 'K': if( factor == 1024 ) exponent = 1; break; + case 'k': if( factor == 1000 ) exponent = 1; break; + } + if( exponent <= 0 ) + { + if( verbosity >= 0 ) + std::fprintf( stderr, "%s: Bad multiplier in numerical argument of " + "option '%s'.\n", program_name, option_name ); + std::exit( 1 ); + } + for( int i = 0; i < exponent; ++i ) + { + if( ulimit / factor >= result ) result *= factor; + else { errno = ERANGE; break; } + } + } + if( !errno && ( result < llimit || result > ulimit ) ) errno = ERANGE; + if( errno ) + { + if( verbosity >= 0 ) + std::fprintf( stderr, "%s: Numerical argument out of limits [%s,%s] " + "in option '%s'.\n", program_name, format_num3( llimit ), + format_num3( ulimit ), option_name ); + std::exit( 1 ); + } + return result; + } + + +int get_dict_size( const char * const arg, const char * const option_name ) + { + char * tail; + const long bits = std::strtol( arg, &tail, 0 ); + if( bits >= min_dictionary_bits && + bits <= max_dictionary_bits && *tail == 0 ) + return 1 << bits; + return getnum( arg, option_name, min_dictionary_size, max_dictionary_size ); + } + + +void set_mode( Mode & program_mode, const Mode new_mode ) + { + if( program_mode != m_compress && program_mode != new_mode ) + { + show_error( "Only one operation can be specified.", 0, true ); + std::exit( 1 ); + } + program_mode = new_mode; + } + + +int extension_index( const std::string & name ) + { + for( int eindex = 0; known_extensions[eindex].from; ++eindex ) + { + const std::string ext( known_extensions[eindex].from ); + if( name.size() > ext.size() && + name.compare( name.size() - ext.size(), ext.size(), ext ) == 0 ) + return eindex; + } + return -1; + } + + +void set_c_outname( const std::string & name, const bool filenames_given, + const bool force_ext, const bool multifile ) + { + /* zupdate < 1.9 depends on lzip adding the extension '.lz' to name when + reading from standard input. */ + output_filename = name; + if( multifile ) output_filename += "00001"; + if( force_ext || multifile || + ( !filenames_given && extension_index( output_filename ) < 0 ) ) + output_filename += known_extensions[0].from; + } + + +void set_d_outname( const std::string & name, const int eindex ) + { + if( eindex >= 0 ) + { + const std::string from( known_extensions[eindex].from ); + if( name.size() > from.size() ) + { + output_filename.assign( name, 0, name.size() - from.size() ); + output_filename += known_extensions[eindex].to; + return; + } + } + output_filename = name; output_filename += ".out"; + if( verbosity >= 1 ) + std::fprintf( stderr, "%s: Can't guess original name for '%s' -- using '%s'\n", + program_name, name.c_str(), output_filename.c_str() ); + } + +} // end namespace + +int open_instream( const char * const name, struct stat * const in_statsp, + const bool one_to_one, const bool reg_only ) + { + int infd = open( name, O_RDONLY | O_BINARY ); + if( infd < 0 ) + show_file_error( name, "Can't open input file", errno ); + else + { + const int i = fstat( infd, in_statsp ); + const mode_t mode = in_statsp->st_mode; + const bool can_read = ( i == 0 && !reg_only && + ( S_ISBLK( mode ) || S_ISCHR( mode ) || + S_ISFIFO( mode ) || S_ISSOCK( mode ) ) ); + if( i != 0 || ( !S_ISREG( mode ) && ( !can_read || one_to_one ) ) ) + { + if( verbosity >= 0 ) + std::fprintf( stderr, "%s: Input file '%s' is not a regular file%s.\n", + program_name, name, ( can_read && one_to_one ) ? + ",\n and neither '-c' nor '-o' were specified" : "" ); + close( infd ); + infd = -1; + } + } + return infd; + } + +namespace { + +int open_instream2( const char * const name, struct stat * const in_statsp, + const Mode program_mode, const int eindex, + const bool one_to_one, const bool recompress ) + { + if( program_mode == m_compress && !recompress && eindex >= 0 ) + { + if( verbosity >= 0 ) + std::fprintf( stderr, "%s: Input file '%s' already has '%s' suffix.\n", + program_name, name, known_extensions[eindex].from ); + return -1; + } + return open_instream( name, in_statsp, one_to_one, false ); + } + + +bool open_outstream( const bool force, const bool protect ) + { + const mode_t usr_rw = S_IRUSR | S_IWUSR; + const mode_t all_rw = usr_rw | S_IRGRP | S_IWGRP | S_IROTH | S_IWOTH; + const mode_t outfd_mode = protect ? usr_rw : all_rw; + int flags = O_CREAT | O_WRONLY | O_BINARY; + if( force ) flags |= O_TRUNC; else flags |= O_EXCL; + + outfd = open( output_filename.c_str(), flags, outfd_mode ); + if( outfd >= 0 ) delete_output_on_interrupt = true; + else if( verbosity >= 0 ) + { + if( errno == EEXIST ) + std::fprintf( stderr, "%s: Output file '%s' already exists, skipping.\n", + program_name, output_filename.c_str() ); + else + std::fprintf( stderr, "%s: Can't create output file '%s': %s\n", + program_name, output_filename.c_str(), std::strerror( errno ) ); + } + return ( outfd >= 0 ); + } + + +void set_signals( void (*action)(int) ) + { + std::signal( SIGHUP, action ); + std::signal( SIGINT, action ); + std::signal( SIGTERM, action ); + } + + +void cleanup_and_fail( const int retval ) + { + set_signals( SIG_IGN ); // ignore signals + if( delete_output_on_interrupt ) + { + delete_output_on_interrupt = false; + if( verbosity >= 0 ) + std::fprintf( stderr, "%s: Deleting output file '%s', if it exists.\n", + program_name, output_filename.c_str() ); + if( outfd >= 0 ) { close( outfd ); outfd = -1; } + if( std::remove( output_filename.c_str() ) != 0 && errno != ENOENT ) + show_error( "WARNING: deletion of output file (apparently) failed." ); + } + std::exit( retval ); + } + + +extern "C" void signal_handler( int ) + { + show_error( "Control-C or similar caught, quitting." ); + cleanup_and_fail( 1 ); + } + + +bool check_tty_in( const char * const input_filename, const int infd, + const Mode program_mode, int & retval ) + { + if( ( program_mode == m_decompress || program_mode == m_test ) && + isatty( infd ) ) // for example /dev/tty + { show_file_error( input_filename, + "I won't read compressed data from a terminal." ); + close( infd ); set_retval( retval, 2 ); + if( program_mode != m_test ) cleanup_and_fail( retval ); + return false; } + return true; + } + +bool check_tty_out( const Mode program_mode ) + { + if( program_mode == m_compress && isatty( outfd ) ) + { show_file_error( output_filename.size() ? + output_filename.c_str() : "(stdout)", + "I won't write compressed data to a terminal." ); + return false; } + return true; + } + + +// Set permissions, owner, and times. +void close_and_set_permissions( const struct stat * const in_statsp ) + { + bool warning = false; + if( in_statsp ) + { + const mode_t mode = in_statsp->st_mode; + // fchown will in many cases return with EPERM, which can be safely ignored. + if( fchown( outfd, in_statsp->st_uid, in_statsp->st_gid ) == 0 ) + { if( fchmod( outfd, mode ) != 0 ) warning = true; } + else + if( errno != EPERM || + fchmod( outfd, mode & ~( S_ISUID | S_ISGID | S_ISVTX ) ) != 0 ) + warning = true; + } + if( close( outfd ) != 0 ) + { + show_error( "Error closing output file", errno ); + cleanup_and_fail( 1 ); + } + outfd = -1; + delete_output_on_interrupt = false; + if( in_statsp ) + { + struct utimbuf t; + t.actime = in_statsp->st_atime; + t.modtime = in_statsp->st_mtime; + if( utime( output_filename.c_str(), &t ) != 0 ) warning = true; + } + if( warning && verbosity >= 1 ) + show_error( "Can't change output file attributes." ); + } + + +bool next_filename() + { + const unsigned name_len = output_filename.size(); + const unsigned ext_len = std::strlen( known_extensions[0].from ); + if( name_len >= ext_len + 5 ) // "*00001.lz" + for( int i = name_len - ext_len - 1, j = 0; j < 5; --i, ++j ) + { + if( output_filename[i] < '9' ) { ++output_filename[i]; return true; } + else output_filename[i] = '0'; + } + return false; + } + + +int compress( const unsigned long long cfile_size, + const unsigned long long member_size, + const unsigned long long volume_size, const int infd, + const Lzma_options & encoder_options, const Pretty_print & pp, + const struct stat * const in_statsp, const bool zero ) + { + int retval = 0; + LZ_encoder_base * encoder = 0; // polymorphic encoder + if( verbosity >= 1 ) pp(); + + if( zero ) + encoder = new FLZ_encoder( infd, outfd ); + else + { + Lzip_header header; + if( header.dictionary_size( encoder_options.dictionary_size ) && + encoder_options.match_len_limit >= min_match_len_limit && + encoder_options.match_len_limit <= max_match_len ) + encoder = new LZ_encoder( header.dictionary_size(), + encoder_options.match_len_limit, infd, outfd ); + else internal_error( "invalid argument to encoder." ); + } + + unsigned long long in_size = 0, out_size = 0, partial_volume_size = 0; + while( true ) // encode one member per iteration + { + const unsigned long long size = ( volume_size > 0 ) ? + std::min( member_size, volume_size - partial_volume_size ) : member_size; + show_cprogress( cfile_size, in_size, encoder, &pp ); // init + if( !encoder->encode_member( size ) ) + { pp( "Encoder error." ); retval = 1; break; } + in_size += encoder->data_position(); + out_size += encoder->member_position(); + if( encoder->data_finished() ) break; + if( volume_size > 0 ) + { + partial_volume_size += encoder->member_position(); + if( partial_volume_size >= volume_size - min_dictionary_size ) + { + partial_volume_size = 0; + if( delete_output_on_interrupt ) + { + close_and_set_permissions( in_statsp ); + if( !next_filename() ) + { pp( "Too many volume files." ); retval = 1; break; } + if( !open_outstream( true, in_statsp ) ) { retval = 1; break; } + } + } + } + encoder->reset(); + } + + if( retval == 0 && verbosity >= 1 ) + { + if( in_size == 0 || out_size == 0 ) + std::fputs( " no data compressed.\n", stderr ); + else + std::fprintf( stderr, "%6.3f:1, %5.2f%% ratio, %5.2f%% saved, " + "%llu in, %llu out.\n", + (double)in_size / out_size, + ( 100.0 * out_size ) / in_size, + 100.0 - ( ( 100.0 * out_size ) / in_size ), + in_size, out_size ); + } + delete encoder; + return retval; + } + + +unsigned char xdigit( const unsigned value ) + { + if( value <= 9 ) return '0' + value; + if( value <= 15 ) return 'A' + value - 10; + return 0; + } + + +bool show_trailing_data( const uint8_t * const data, const int size, + const Pretty_print & pp, const bool all, + const int ignore_trailing ) // -1 = show + { + if( verbosity >= 4 || ignore_trailing <= 0 ) + { + std::string msg; + if( !all ) msg = "first bytes of "; + msg += "trailing data = "; + for( int i = 0; i < size; ++i ) + { + msg += xdigit( data[i] >> 4 ); + msg += xdigit( data[i] & 0x0F ); + msg += ' '; + } + msg += '\''; + for( int i = 0; i < size; ++i ) + { if( std::isprint( data[i] ) ) msg += data[i]; else msg += '.'; } + msg += '\''; + pp( msg.c_str() ); + if( ignore_trailing == 0 ) show_file_error( pp.name(), trailing_msg ); + } + return ( ignore_trailing > 0 ); + } + + +int decompress( const unsigned long long cfile_size, const int infd, + const Pretty_print & pp, const bool ignore_trailing, + const bool loose_trailing, const bool testing ) + { + unsigned long long partial_file_pos = 0; + Range_decoder rdec( infd ); + int retval = 0; + + for( bool first_member = true; ; first_member = false ) + { + Lzip_header header; + rdec.reset_member_position(); + const int size = rdec.read_data( header.data, Lzip_header::size ); + if( rdec.finished() ) // End Of File + { + if( first_member ) + { show_file_error( pp.name(), "File ends unexpectedly at member header." ); + retval = 2; } + else if( header.verify_prefix( size ) ) + { pp( "Truncated header in multimember file." ); + show_trailing_data( header.data, size, pp, true, -1 ); + retval = 2; } + else if( size > 0 && !show_trailing_data( header.data, size, pp, + true, ignore_trailing ) ) + retval = 2; + break; + } + if( !header.verify_magic() ) + { + if( first_member ) + { show_file_error( pp.name(), bad_magic_msg ); retval = 2; } + else if( !loose_trailing && header.verify_corrupt() ) + { pp( corrupt_mm_msg ); + show_trailing_data( header.data, size, pp, false, -1 ); + retval = 2; } + else if( !show_trailing_data( header.data, size, pp, false, ignore_trailing ) ) + retval = 2; + break; + } + if( !header.verify_version() ) + { pp( bad_version( header.version() ) ); retval = 2; break; } + const unsigned dictionary_size = header.dictionary_size(); + if( !isvalid_ds( dictionary_size ) ) + { pp( bad_dict_msg ); retval = 2; break; } + + if( verbosity >= 2 || ( verbosity == 1 && first_member ) ) pp(); + + LZ_decoder decoder( rdec, dictionary_size, outfd ); + show_dprogress( cfile_size, partial_file_pos, &rdec, &pp ); // init + const int result = decoder.decode_member( pp ); + partial_file_pos += rdec.member_position(); + if( result != 0 ) + { + if( verbosity >= 0 && result <= 2 ) + { + pp(); + std::fprintf( stderr, "%s at pos %llu\n", ( result == 2 ) ? + "File ends unexpectedly" : "Decoder error", + partial_file_pos ); + } + retval = 2; break; + } + if( verbosity >= 2 ) + { std::fputs( testing ? "ok\n" : "done\n", stderr ); pp.reset(); } + } + if( verbosity == 1 && retval == 0 ) + std::fputs( testing ? "ok\n" : "done\n", stderr ); + return retval; + } + +} // end namespace + + +void show_error( const char * const msg, const int errcode, const bool help ) + { + if( verbosity < 0 ) return; + if( msg && msg[0] ) + std::fprintf( stderr, "%s: %s%s%s\n", program_name, msg, + ( errcode > 0 ) ? ": " : "", + ( errcode > 0 ) ? std::strerror( errcode ) : "" ); + if( help ) + std::fprintf( stderr, "Try '%s --help' for more information.\n", + invocation_name ); + } + + +void show_file_error( const char * const filename, const char * const msg, + const int errcode ) + { + if( verbosity >= 0 ) + std::fprintf( stderr, "%s: %s: %s%s%s\n", program_name, filename, msg, + ( errcode > 0 ) ? ": " : "", + ( errcode > 0 ) ? std::strerror( errcode ) : "" ); + } + + +void internal_error( const char * const msg ) + { + if( verbosity >= 0 ) + std::fprintf( stderr, "%s: internal error: %s\n", program_name, msg ); + std::exit( 3 ); + } + + +void show_cprogress( const unsigned long long cfile_size, + const unsigned long long partial_size, + const Matchfinder_base * const m, + const Pretty_print * const p ) + { + static unsigned long long csize = 0; // file_size / 100 + static unsigned long long psize = 0; + static const Matchfinder_base * mb = 0; + static const Pretty_print * pp = 0; + static bool enabled = true; + + if( !enabled ) return; + if( p ) // initialize static vars + { + if( verbosity < 2 || !isatty( STDERR_FILENO ) ) { enabled = false; return; } + csize = cfile_size; psize = partial_size; mb = m; pp = p; + } + if( mb && pp ) + { + const unsigned long long pos = psize + mb->data_position(); + if( csize > 0 ) + std::fprintf( stderr, "%4llu%% %.1f MB\r", pos / csize, pos / 1000000.0 ); + else + std::fprintf( stderr, " %.1f MB\r", pos / 1000000.0 ); + pp->reset(); (*pp)(); // restore cursor position + } + } + + +void show_dprogress( const unsigned long long cfile_size, + const unsigned long long partial_size, + const Range_decoder * const d, + const Pretty_print * const p ) + { + static unsigned long long csize = 0; // file_size / 100 + static unsigned long long psize = 0; + static const Range_decoder * rdec = 0; + static const Pretty_print * pp = 0; + static int counter = 0; + static bool enabled = true; + + if( !enabled ) return; + if( p ) // initialize static vars + { + if( verbosity < 2 || !isatty( STDERR_FILENO ) ) { enabled = false; return; } + csize = cfile_size; psize = partial_size; rdec = d; pp = p; counter = 0; + } + if( rdec && pp && --counter <= 0 ) + { + const unsigned long long pos = psize + rdec->member_position(); + counter = 7; // update display every 114688 bytes + if( csize > 0 ) + std::fprintf( stderr, "%4llu%% %.1f MB\r", pos / csize, pos / 1000000.0 ); + else + std::fprintf( stderr, " %.1f MB\r", pos / 1000000.0 ); + pp->reset(); (*pp)(); // restore cursor position + } + } + + +int main( const int argc, const char * const argv[] ) + { + /* Mapping from gzip/bzip2 style 1..9 compression modes + to the corresponding LZMA compression modes. */ + const Lzma_options option_mapping[] = + { + { 1 << 16, 16 }, // -0 + { 1 << 20, 5 }, // -1 + { 3 << 19, 6 }, // -2 + { 1 << 21, 8 }, // -3 + { 3 << 20, 12 }, // -4 + { 1 << 22, 20 }, // -5 + { 1 << 23, 36 }, // -6 + { 1 << 24, 68 }, // -7 + { 3 << 23, 132 }, // -8 + { 1 << 25, 273 } }; // -9 + Lzma_options encoder_options = option_mapping[6]; // default = "-6" + const unsigned long long max_member_size = 0x0008000000000000ULL; /* 2 PiB */ + const unsigned long long max_volume_size = 0x4000000000000000ULL; /* 4 EiB */ + unsigned long long member_size = max_member_size; + unsigned long long volume_size = 0; + std::string default_output_filename; + Mode program_mode = m_compress; + bool force = false; + bool ignore_trailing = true; + bool keep_input_files = false; + bool loose_trailing = false; + bool recompress = false; + bool to_stdout = false; + bool zero = false; + if( argc > 0 ) invocation_name = argv[0]; + + enum { opt_lt = 256 }; + const Arg_parser::Option options[] = + { + { '0', "fast", Arg_parser::no }, + { '1', 0, Arg_parser::no }, + { '2', 0, Arg_parser::no }, + { '3', 0, Arg_parser::no }, + { '4', 0, Arg_parser::no }, + { '5', 0, Arg_parser::no }, + { '6', 0, Arg_parser::no }, + { '7', 0, Arg_parser::no }, + { '8', 0, Arg_parser::no }, + { '9', "best", Arg_parser::no }, + { 'a', "trailing-error", Arg_parser::no }, + { 'b', "member-size", Arg_parser::yes }, + { 'c', "stdout", Arg_parser::no }, + { 'd', "decompress", Arg_parser::no }, + { 'f', "force", Arg_parser::no }, + { 'F', "recompress", Arg_parser::no }, + { 'h', "help", Arg_parser::no }, + { 'k', "keep", Arg_parser::no }, + { 'l', "list", Arg_parser::no }, + { 'm', "match-length", Arg_parser::yes }, + { 'n', "threads", Arg_parser::yes }, + { 'o', "output", Arg_parser::yes }, + { 'q', "quiet", Arg_parser::no }, + { 's', "dictionary-size", Arg_parser::yes }, + { 'S', "volume-size", Arg_parser::yes }, + { 't', "test", Arg_parser::no }, + { 'v', "verbose", Arg_parser::no }, + { 'V', "version", Arg_parser::no }, + { opt_lt, "loose-trailing", Arg_parser::no }, + { 0, 0, Arg_parser::no } }; + + const Arg_parser parser( argc, argv, options ); + if( parser.error().size() ) // bad option + { show_error( parser.error().c_str(), 0, true ); return 1; } + + int argind = 0; + for( ; argind < parser.arguments(); ++argind ) + { + const int code = parser.code( argind ); + if( !code ) break; // no more options + const char * const pn = parser.parsed_name( argind ).c_str(); + const std::string & sarg = parser.argument( argind ); + const char * const arg = sarg.c_str(); + switch( code ) + { + case '0': case '1': case '2': case '3': case '4': + case '5': case '6': case '7': case '8': case '9': + zero = ( code == '0' ); + encoder_options = option_mapping[code-'0']; break; + case 'a': ignore_trailing = false; break; + case 'b': member_size = getnum( arg, pn, 100000, max_member_size ); break; + case 'c': to_stdout = true; break; + case 'd': set_mode( program_mode, m_decompress ); break; + case 'f': force = true; break; + case 'F': recompress = true; break; + case 'h': show_help(); return 0; + case 'k': keep_input_files = true; break; + case 'l': set_mode( program_mode, m_list ); break; + case 'm': encoder_options.match_len_limit = + getnum( arg, pn, min_match_len_limit, max_match_len ); + zero = false; break; + case 'n': break; + case 'o': if( sarg == "-" ) to_stdout = true; + else { default_output_filename = sarg; } break; + case 'q': verbosity = -1; break; + case 's': encoder_options.dictionary_size = get_dict_size( arg, pn ); + zero = false; break; + case 'S': volume_size = getnum( arg, pn, 100000, max_volume_size ); break; + case 't': set_mode( program_mode, m_test ); break; + case 'v': if( verbosity < 4 ) ++verbosity; break; + case 'V': show_version(); return 0; + case opt_lt: loose_trailing = true; break; + default : internal_error( "uncaught option." ); + } + } // end process options + +#if defined __MSVCRT__ || defined __OS2__ || defined __DJGPP__ + setmode( STDIN_FILENO, O_BINARY ); + setmode( STDOUT_FILENO, O_BINARY ); +#endif + + std::vector< std::string > filenames; + bool filenames_given = false; + for( ; argind < parser.arguments(); ++argind ) + { + filenames.push_back( parser.argument( argind ) ); + if( filenames.back() != "-" ) filenames_given = true; + } + if( filenames.empty() ) filenames.push_back("-"); + + if( program_mode == m_list ) + return list_files( filenames, ignore_trailing, loose_trailing ); + + if( program_mode == m_compress ) + { + if( volume_size > 0 && !to_stdout && default_output_filename.size() && + filenames.size() > 1 ) + { show_error( "Only can compress one file when using '-o' and '-S'.", + 0, true ); return 1; } + dis_slots.init(); + prob_prices.init(); + } + else volume_size = 0; + if( program_mode == m_test ) to_stdout = false; // apply overrides + if( program_mode == m_test || to_stdout ) default_output_filename.clear(); + + if( to_stdout && program_mode != m_test ) // check tty only once + { outfd = STDOUT_FILENO; if( !check_tty_out( program_mode ) ) return 1; } + else outfd = -1; + + const bool to_file = !to_stdout && program_mode != m_test && + default_output_filename.size(); + if( !to_stdout && program_mode != m_test && ( filenames_given || to_file ) ) + set_signals( signal_handler ); + + Pretty_print pp( filenames ); + + int failed_tests = 0; + int retval = 0; + const bool one_to_one = !to_stdout && program_mode != m_test && !to_file; + bool stdin_used = false; + for( unsigned i = 0; i < filenames.size(); ++i ) + { + std::string input_filename; + int infd; + struct stat in_stats; + + pp.set_name( filenames[i] ); + if( filenames[i] == "-" ) + { + if( stdin_used ) continue; else stdin_used = true; + infd = STDIN_FILENO; + if( !check_tty_in( pp.name(), infd, program_mode, retval ) ) continue; + if( one_to_one ) { outfd = STDOUT_FILENO; output_filename.clear(); } + } + else + { + const int eindex = extension_index( input_filename = filenames[i] ); + infd = open_instream2( input_filename.c_str(), &in_stats, program_mode, + eindex, one_to_one, recompress ); + if( infd < 0 ) { set_retval( retval, 1 ); continue; } + if( !check_tty_in( pp.name(), infd, program_mode, retval ) ) continue; + if( one_to_one ) // open outfd after verifying infd + { + if( program_mode == m_compress ) + set_c_outname( input_filename, true, true, volume_size > 0 ); + else set_d_outname( input_filename, eindex ); + if( !open_outstream( force, true ) ) + { close( infd ); set_retval( retval, 1 ); continue; } + } + } + + if( one_to_one && !check_tty_out( program_mode ) ) + { set_retval( retval, 1 ); return retval; } // don't delete a tty + + if( to_file && outfd < 0 ) // open outfd after verifying infd + { + if( program_mode == m_compress ) set_c_outname( default_output_filename, + filenames_given, false, volume_size > 0 ); + else output_filename = default_output_filename; + if( !open_outstream( force, false ) || !check_tty_out( program_mode ) ) + return 1; // check tty only once and don't try to delete a tty + } + + const struct stat * const in_statsp = + ( input_filename.size() && one_to_one ) ? &in_stats : 0; + const unsigned long long cfile_size = + ( input_filename.size() && S_ISREG( in_stats.st_mode ) ) ? + ( in_stats.st_size + 99 ) / 100 : 0; + int tmp; + try { + if( program_mode == m_compress ) + tmp = compress( cfile_size, member_size, volume_size, infd, + encoder_options, pp, in_statsp, zero ); + else + tmp = decompress( cfile_size, infd, pp, ignore_trailing, + loose_trailing, program_mode == m_test ); + } + catch( std::bad_alloc & ) + { pp( ( program_mode == m_compress ) ? + "Not enough memory. Try a smaller dictionary size." : + "Not enough memory." ); tmp = 1; } + catch( Error & e ) { pp(); show_error( e.msg, errno ); tmp = 1; } + if( close( infd ) != 0 ) + { show_file_error( pp.name(), "Error closing input file", errno ); + set_retval( tmp, 1 ); } + set_retval( retval, tmp ); + if( tmp ) + { if( program_mode != m_test ) cleanup_and_fail( retval ); + else ++failed_tests; } + + if( delete_output_on_interrupt && one_to_one ) + close_and_set_permissions( in_statsp ); + if( input_filename.size() && !keep_input_files && one_to_one && + ( program_mode != m_compress || volume_size == 0 ) ) + std::remove( input_filename.c_str() ); + } + if( delete_output_on_interrupt ) close_and_set_permissions( 0 ); // -o + else if( outfd >= 0 && close( outfd ) != 0 ) // -c + { + show_error( "Error closing stdout", errno ); + set_retval( retval, 1 ); + } + if( failed_tests > 0 && verbosity >= 1 && filenames.size() > 1 ) + std::fprintf( stderr, "%s: warning: %d %s failed the test.\n", + program_name, failed_tests, + ( failed_tests == 1 ) ? "file" : "files" ); + return retval; + } diff --git a/testsuite/check.sh b/testsuite/check.sh new file mode 100755 index 0000000..916ed4e --- /dev/null +++ b/testsuite/check.sh @@ -0,0 +1,441 @@ +#! /bin/sh +# check script for Lzip - LZMA lossless data compressor +# Copyright (C) 2008-2022 Antonio Diaz Diaz. +# +# This script is free software: you have unlimited permission +# to copy, distribute, and modify it. + +LC_ALL=C +export LC_ALL +objdir=`pwd` +testdir=`cd "$1" ; pwd` +LZIP="${objdir}"/lzip +framework_failure() { echo "failure in testing framework" ; exit 1 ; } + +if [ ! -f "${LZIP}" ] || [ ! -x "${LZIP}" ] ; then + echo "${LZIP}: cannot execute" + exit 1 +fi + +[ -e "${LZIP}" ] 2> /dev/null || + { + echo "$0: a POSIX shell is required to run the tests" + echo "Try bash -c \"$0 $1 $2\"" + exit 1 + } + +if [ -d tmp ] ; then rm -rf tmp ; fi +mkdir tmp +cd "${objdir}"/tmp || framework_failure + +cat "${testdir}"/test.txt > in || framework_failure +in_lz="${testdir}"/test.txt.lz +in_em="${testdir}"/test_em.txt.lz +fox_lz="${testdir}"/fox.lz +fail=0 +test_failed() { fail=1 ; printf " $1" ; [ -z "$2" ] || printf "($2)" ; } + +printf "testing lzip-%s..." "$2" + +"${LZIP}" -fkqm4 in +[ $? = 1 ] || test_failed $LINENO +[ ! -e in.lz ] || test_failed $LINENO +"${LZIP}" -fkqm274 in +[ $? = 1 ] || test_failed $LINENO +[ ! -e in.lz ] || test_failed $LINENO +for i in bad_size -1 0 4095 513MiB 1G 1T 1P 1E 1Z 1Y 10KB ; do + "${LZIP}" -fkqs $i in + [ $? = 1 ] || test_failed $LINENO $i + [ ! -e in.lz ] || test_failed $LINENO $i +done +"${LZIP}" -lq in +[ $? = 2 ] || test_failed $LINENO +"${LZIP}" -tq in +[ $? = 2 ] || test_failed $LINENO +"${LZIP}" -tq < in +[ $? = 2 ] || test_failed $LINENO +"${LZIP}" -cdq in +[ $? = 2 ] || test_failed $LINENO +"${LZIP}" -cdq < in +[ $? = 2 ] || test_failed $LINENO +"${LZIP}" -dq -o in < "${in_lz}" +[ $? = 1 ] || test_failed $LINENO +"${LZIP}" -dq -o in "${in_lz}" +[ $? = 1 ] || test_failed $LINENO +"${LZIP}" -dq -o out nx_file.lz +[ $? = 1 ] || test_failed $LINENO +[ ! -e out ] || test_failed $LINENO +"${LZIP}" -q -o out.lz nx_file +[ $? = 1 ] || test_failed $LINENO +[ ! -e out.lz ] || test_failed $LINENO +"${LZIP}" -qf -S100k -o out in in +[ $? = 1 ] || test_failed $LINENO +# these are for code coverage +"${LZIP}" -lt "${in_lz}" 2> /dev/null +[ $? = 1 ] || test_failed $LINENO +"${LZIP}" -cdl "${in_lz}" > out 2> /dev/null +[ $? = 1 ] || test_failed $LINENO +"${LZIP}" -cdt "${in_lz}" > out 2> /dev/null +[ $? = 1 ] || test_failed $LINENO +"${LZIP}" -t -- nx_file.lz 2> /dev/null +[ $? = 1 ] || test_failed $LINENO +"${LZIP}" -t "" < /dev/null 2> /dev/null +[ $? = 1 ] || test_failed $LINENO +"${LZIP}" --help > /dev/null || test_failed $LINENO +"${LZIP}" -n1 -V > /dev/null || test_failed $LINENO +"${LZIP}" -m 2> /dev/null +[ $? = 1 ] || test_failed $LINENO +"${LZIP}" -z 2> /dev/null +[ $? = 1 ] || test_failed $LINENO +"${LZIP}" --bad_option 2> /dev/null +[ $? = 1 ] || test_failed $LINENO +"${LZIP}" --t 2> /dev/null +[ $? = 1 ] || test_failed $LINENO +"${LZIP}" --test=2 2> /dev/null +[ $? = 1 ] || test_failed $LINENO +"${LZIP}" --output= 2> /dev/null +[ $? = 1 ] || test_failed $LINENO +"${LZIP}" --output 2> /dev/null +[ $? = 1 ] || test_failed $LINENO +printf "LZIP\001-.............................." | "${LZIP}" -t 2> /dev/null +printf "LZIP\002-.............................." | "${LZIP}" -t 2> /dev/null +printf "LZIP\001+.............................." | "${LZIP}" -t 2> /dev/null +rm -f out || framework_failure + +printf "\ntesting decompression..." + +for i in "${in_lz}" "${in_em}" ; do + "${LZIP}" -lq "$i" || test_failed $LINENO "$i" + "${LZIP}" -t "$i" || test_failed $LINENO "$i" + "${LZIP}" -d "$i" -o copy || test_failed $LINENO "$i" + cmp in copy || test_failed $LINENO "$i" + "${LZIP}" -cd "$i" > copy || test_failed $LINENO "$i" + cmp in copy || test_failed $LINENO "$i" + "${LZIP}" -d "$i" -o - > copy || test_failed $LINENO "$i" + cmp in copy || test_failed $LINENO "$i" + "${LZIP}" -d < "$i" > copy || test_failed $LINENO "$i" + cmp in copy || test_failed $LINENO "$i" + rm -f copy || framework_failure +done + +lines=$("${LZIP}" -tvv "${in_em}" 2>&1 | wc -l) || test_failed $LINENO +[ "${lines}" -eq 8 ] || test_failed $LINENO "${lines}" + +lines=$("${LZIP}" -lvv "${in_em}" | wc -l) || test_failed $LINENO +[ "${lines}" -eq 11 ] || test_failed $LINENO "${lines}" + +"${LZIP}" -cd "${fox_lz}" > fox || test_failed $LINENO +cat "${in_lz}" > copy.lz || framework_failure +"${LZIP}" -dk copy.lz || test_failed $LINENO +cmp in copy || test_failed $LINENO +cat fox > copy || framework_failure +cat "${in_lz}" > out.lz || framework_failure +rm -f out || framework_failure +"${LZIP}" -d copy.lz out.lz 2> /dev/null # skip copy, decompress out +[ $? = 1 ] || test_failed $LINENO +cmp fox copy || test_failed $LINENO +cmp in out || test_failed $LINENO +"${LZIP}" -df copy.lz || test_failed $LINENO +[ ! -e copy.lz ] || test_failed $LINENO +cmp in copy || test_failed $LINENO +rm -f copy out || framework_failure + +cat "${in_lz}" > copy.lz || framework_failure +"${LZIP}" -d -S100k copy.lz || test_failed $LINENO # ignore -S +[ ! -e copy.lz ] || test_failed $LINENO +cmp in copy || test_failed $LINENO + +printf "to be overwritten" > copy || framework_failure +"${LZIP}" -df -o copy < "${in_lz}" || test_failed $LINENO +cmp in copy || test_failed $LINENO +rm -f out copy || framework_failure +"${LZIP}" -d -o ./- "${in_lz}" || test_failed $LINENO +cmp in ./- || test_failed $LINENO +rm -f ./- || framework_failure +"${LZIP}" -d -o ./- < "${in_lz}" || test_failed $LINENO +cmp in ./- || test_failed $LINENO +rm -f ./- || framework_failure + +cat "${in_lz}" > anyothername || framework_failure +"${LZIP}" -dv - anyothername - < "${in_lz}" > copy 2> /dev/null || + test_failed $LINENO +cmp in copy || test_failed $LINENO +cmp in anyothername.out || test_failed $LINENO +rm -f copy anyothername.out || framework_failure + +"${LZIP}" -lq in "${in_lz}" +[ $? = 2 ] || test_failed $LINENO +"${LZIP}" -lq nx_file.lz "${in_lz}" +[ $? = 1 ] || test_failed $LINENO +"${LZIP}" -tq in "${in_lz}" +[ $? = 2 ] || test_failed $LINENO +"${LZIP}" -tq nx_file.lz "${in_lz}" +[ $? = 1 ] || test_failed $LINENO +"${LZIP}" -cdq in "${in_lz}" > copy +[ $? = 2 ] || test_failed $LINENO +cat copy in | cmp in - || test_failed $LINENO # copy must be empty +"${LZIP}" -cdq nx_file.lz "${in_lz}" > copy +[ $? = 1 ] || test_failed $LINENO +cmp in copy || test_failed $LINENO +rm -f copy || framework_failure +cat "${in_lz}" > copy.lz || framework_failure +for i in 1 2 3 4 5 6 7 ; do + printf "g" >> copy.lz || framework_failure + "${LZIP}" -alvv copy.lz "${in_lz}" > /dev/null 2>&1 + [ $? = 2 ] || test_failed $LINENO $i + "${LZIP}" -atvvvv copy.lz "${in_lz}" 2> /dev/null + [ $? = 2 ] || test_failed $LINENO $i +done +"${LZIP}" -dq in copy.lz +[ $? = 2 ] || test_failed $LINENO +[ -e copy.lz ] || test_failed $LINENO +[ ! -e copy ] || test_failed $LINENO +[ ! -e in.out ] || test_failed $LINENO +"${LZIP}" -dq nx_file.lz copy.lz +[ $? = 1 ] || test_failed $LINENO +[ ! -e copy.lz ] || test_failed $LINENO +[ ! -e nx_file ] || test_failed $LINENO +cmp in copy || test_failed $LINENO + +cat in in > in2 || framework_failure +"${LZIP}" -lq "${in_lz}" "${in_lz}" || test_failed $LINENO +"${LZIP}" -t "${in_lz}" "${in_lz}" || test_failed $LINENO +"${LZIP}" -cd "${in_lz}" "${in_lz}" -o out > copy2 || test_failed $LINENO +[ ! -e out ] || test_failed $LINENO # override -o +cmp in2 copy2 || test_failed $LINENO +rm -f copy2 || framework_failure +"${LZIP}" -d "${in_lz}" "${in_lz}" -o copy2 || test_failed $LINENO +cmp in2 copy2 || test_failed $LINENO +rm -f copy2 || framework_failure + +cat "${in_lz}" "${in_lz}" > copy2.lz || framework_failure +printf "\ngarbage" >> copy2.lz || framework_failure +"${LZIP}" -tvvvv copy2.lz 2> /dev/null || test_failed $LINENO +"${LZIP}" -alq copy2.lz +[ $? = 2 ] || test_failed $LINENO +"${LZIP}" -atq copy2.lz +[ $? = 2 ] || test_failed $LINENO +"${LZIP}" -atq < copy2.lz +[ $? = 2 ] || test_failed $LINENO +"${LZIP}" -adkq copy2.lz +[ $? = 2 ] || test_failed $LINENO +[ ! -e copy2 ] || test_failed $LINENO +"${LZIP}" -adkq -o copy2 < copy2.lz +[ $? = 2 ] || test_failed $LINENO +[ ! -e copy2 ] || test_failed $LINENO +printf "to be overwritten" > copy2 || framework_failure +"${LZIP}" -df copy2.lz || test_failed $LINENO +cmp in2 copy2 || test_failed $LINENO +rm -f copy2 || framework_failure + +printf "\ntesting compression..." + +"${LZIP}" -c -0 in in in -S100k -o out3.lz > copy2.lz || test_failed $LINENO +[ ! -e out3.lz ] || test_failed $LINENO # override -o and -S +"${LZIP}" -0f in in --output=copy2.lz || test_failed $LINENO +"${LZIP}" -d copy2.lz -o out2 || test_failed $LINENO +cmp in2 out2 || test_failed $LINENO +rm -f in2 out2 copy2.lz || framework_failure + +"${LZIP}" -cf "${in_lz}" > out 2> /dev/null # /dev/null is a tty on OS/2 +[ $? = 1 ] || test_failed $LINENO +"${LZIP}" -Fvvm36 -o - "${in_lz}" > out 2> /dev/null || test_failed $LINENO +"${LZIP}" -cd out | "${LZIP}" -d > copy || test_failed $LINENO +cmp in copy || test_failed $LINENO + +"${LZIP}" -0 -o ./- in || test_failed $LINENO +"${LZIP}" -cd ./- | cmp in - || test_failed $LINENO +rm -f ./- || framework_failure +"${LZIP}" -0 -o ./- < in || test_failed $LINENO # add .lz +[ ! -e ./- ] || test_failed $LINENO +"${LZIP}" -cd -- -.lz | cmp in - || test_failed $LINENO +rm -f ./-.lz || framework_failure + +for i in s4Ki 0 1 2 3 4 5 6 7 8 9 ; do + "${LZIP}" -k -$i in || test_failed $LINENO $i + mv -f in.lz copy.lz || test_failed $LINENO $i + printf "garbage" >> copy.lz || framework_failure + "${LZIP}" -df copy.lz || test_failed $LINENO $i + cmp in copy || test_failed $LINENO $i + + "${LZIP}" -$i in -c > out || test_failed $LINENO $i + "${LZIP}" -$i in -o o_out || test_failed $LINENO $i # don't add .lz + [ ! -e o_out.lz ] || test_failed $LINENO + cmp out o_out || test_failed $LINENO $i + rm -f o_out || framework_failure + printf "g" >> out || framework_failure + "${LZIP}" -cd out > copy || test_failed $LINENO $i + cmp in copy || test_failed $LINENO $i + + "${LZIP}" -$i < in > out || test_failed $LINENO $i + "${LZIP}" -d < out > copy || test_failed $LINENO $i + cmp in copy || test_failed $LINENO $i + + rm -f out || framework_failure + printf "to be overwritten" > out.lz || framework_failure + "${LZIP}" -f -$i -o out < in || test_failed $LINENO $i # add .lz + [ ! -e out ] || test_failed $LINENO + "${LZIP}" -df -o copy < out.lz || test_failed $LINENO $i + cmp in copy || test_failed $LINENO $i +done +rm -f out out.lz || framework_failure + +cat in in in in in in in in > in8 || framework_failure +"${LZIP}" -1s12 -S100k in8 || test_failed $LINENO +"${LZIP}" -t in800001.lz in800002.lz || test_failed $LINENO +"${LZIP}" -cd in800001.lz in800002.lz | cmp in8 - || test_failed $LINENO +[ ! -e in800003.lz ] || test_failed $LINENO +rm -f in800001.lz in800002.lz || framework_failure +"${LZIP}" -1s12 -S100k -o out.lz in8 || test_failed $LINENO +# ignore -S +"${LZIP}" -d out.lz00001.lz out.lz00002.lz -S100k -o out || test_failed $LINENO +cmp in8 out || test_failed $LINENO +"${LZIP}" -t out.lz00001.lz out.lz00002.lz || test_failed $LINENO +[ ! -e out.lz00003.lz ] || test_failed $LINENO +rm -f out out.lz00001.lz out.lz00002.lz || framework_failure +"${LZIP}" -1ks4Ki -b100000 in8 || test_failed $LINENO +"${LZIP}" -t in8.lz || test_failed $LINENO +"${LZIP}" -cd in8.lz -o out | cmp in8 - || test_failed $LINENO # override -o +[ ! -e out ] || test_failed $LINENO +rm -f in8 || framework_failure +"${LZIP}" -0 -S100k -o out < in8.lz || test_failed $LINENO +"${LZIP}" -t out00001.lz out00002.lz || test_failed $LINENO +"${LZIP}" -cd out00001.lz out00002.lz | cmp in8.lz - || test_failed $LINENO +[ ! -e out00003.lz ] || test_failed $LINENO +rm -f out00001.lz || framework_failure +"${LZIP}" -1 -S100k -o out < in8.lz || test_failed $LINENO +"${LZIP}" -t out00001.lz out00002.lz || test_failed $LINENO +"${LZIP}" -cd out00001.lz out00002.lz | cmp in8.lz - || test_failed $LINENO +[ ! -e out00003.lz ] || test_failed $LINENO +rm -f out00001.lz out00002.lz || framework_failure +"${LZIP}" -0 -F -S100k in8.lz || test_failed $LINENO +"${LZIP}" -t in8.lz00001.lz in8.lz00002.lz || test_failed $LINENO +"${LZIP}" -cd in8.lz00001.lz in8.lz00002.lz | cmp in8.lz - || test_failed $LINENO +[ ! -e in8.lz00003.lz ] || test_failed $LINENO +rm -f in8.lz00001.lz in8.lz00002.lz || framework_failure +"${LZIP}" -0kF -b100k in8.lz || test_failed $LINENO +"${LZIP}" -t in8.lz.lz || test_failed $LINENO +"${LZIP}" -cd in8.lz.lz | cmp in8.lz - || test_failed $LINENO +rm -f in8.lz in8.lz.lz || framework_failure + +printf "\ntesting bad input..." + +headers='LZIp LZiP LZip LzIP LzIp LziP lZIP lZIp lZiP lzIP' +body='\001\014\000\203\377\373\377\377\300\000\000\000\000\000\000\000\000\000\000\000\000\000\000\000$\000\000\000\000\000\000\000' +cat "${in_lz}" > int.lz +printf "LZIP${body}" >> int.lz +if "${LZIP}" -tq int.lz ; then + for header in ${headers} ; do + printf "${header}${body}" > int.lz # first member + "${LZIP}" -lq int.lz + [ $? = 2 ] || test_failed $LINENO ${header} + "${LZIP}" -tq int.lz + [ $? = 2 ] || test_failed $LINENO ${header} + "${LZIP}" -tq < int.lz + [ $? = 2 ] || test_failed $LINENO ${header} + "${LZIP}" -cdq int.lz > /dev/null + [ $? = 2 ] || test_failed $LINENO ${header} + "${LZIP}" -lq --loose-trailing int.lz + [ $? = 2 ] || test_failed $LINENO ${header} + "${LZIP}" -tq --loose-trailing int.lz + [ $? = 2 ] || test_failed $LINENO ${header} + "${LZIP}" -tq --loose-trailing < int.lz + [ $? = 2 ] || test_failed $LINENO ${header} + "${LZIP}" -cdq --loose-trailing int.lz > /dev/null + [ $? = 2 ] || test_failed $LINENO ${header} + cat "${in_lz}" > int.lz + printf "${header}${body}" >> int.lz # trailing data + "${LZIP}" -lq int.lz + [ $? = 2 ] || test_failed $LINENO ${header} + "${LZIP}" -tq int.lz + [ $? = 2 ] || test_failed $LINENO ${header} + "${LZIP}" -tq < int.lz + [ $? = 2 ] || test_failed $LINENO ${header} + "${LZIP}" -cdq int.lz > /dev/null + [ $? = 2 ] || test_failed $LINENO ${header} + "${LZIP}" -lq --loose-trailing int.lz || + test_failed $LINENO ${header} + "${LZIP}" -t --loose-trailing int.lz || + test_failed $LINENO ${header} + "${LZIP}" -t --loose-trailing < int.lz || + test_failed $LINENO ${header} + "${LZIP}" -cd --loose-trailing int.lz > /dev/null || + test_failed $LINENO ${header} + "${LZIP}" -lq --loose-trailing --trailing-error int.lz + [ $? = 2 ] || test_failed $LINENO ${header} + "${LZIP}" -tq --loose-trailing --trailing-error int.lz + [ $? = 2 ] || test_failed $LINENO ${header} + "${LZIP}" -tq --loose-trailing --trailing-error < int.lz + [ $? = 2 ] || test_failed $LINENO ${header} + "${LZIP}" -cdq --loose-trailing --trailing-error int.lz > /dev/null + [ $? = 2 ] || test_failed $LINENO ${header} + done +else + printf "\nwarning: skipping header test: 'printf' does not work on your system." +fi +rm -f int.lz || framework_failure + +for i in fox_v2.lz fox_s11.lz fox_de20.lz \ + fox_bcrc.lz fox_crc0.lz fox_das46.lz fox_mes81.lz ; do + "${LZIP}" -tq "${testdir}"/$i + [ $? = 2 ] || test_failed $LINENO $i +done + +for i in fox_bcrc.lz fox_crc0.lz fox_das46.lz fox_mes81.lz ; do + "${LZIP}" -cdq "${testdir}"/$i > out + [ $? = 2 ] || test_failed $LINENO $i + cmp fox out || test_failed $LINENO $i +done +rm -f fox out || framework_failure + +cat "${in_lz}" "${in_lz}" > in2.lz || framework_failure +cat "${in_lz}" "${in_lz}" "${in_lz}" > in3.lz || framework_failure +if dd if=in3.lz of=trunc.lz bs=14752 count=1 2> /dev/null && + [ -e trunc.lz ] && cmp in2.lz trunc.lz > /dev/null 2>&1 ; then + for i in 6 20 14734 14753 14754 14755 14756 14757 14758 ; do + dd if=in3.lz of=trunc.lz bs=$i count=1 2> /dev/null + "${LZIP}" -lq trunc.lz + [ $? = 2 ] || test_failed $LINENO $i + "${LZIP}" -tq trunc.lz + [ $? = 2 ] || test_failed $LINENO $i + "${LZIP}" -tq < trunc.lz + [ $? = 2 ] || test_failed $LINENO $i + "${LZIP}" -cdq trunc.lz > out + [ $? = 2 ] || test_failed $LINENO $i + "${LZIP}" -dq < trunc.lz > out + [ $? = 2 ] || test_failed $LINENO $i + done +else + printf "\nwarning: skipping truncation test: 'dd' does not work on your system." +fi +rm -f in2.lz in3.lz trunc.lz out || framework_failure + +cat "${in_lz}" > ingin.lz || framework_failure +printf "g" >> ingin.lz || framework_failure +cat "${in_lz}" >> ingin.lz || framework_failure +"${LZIP}" -lq ingin.lz +[ $? = 2 ] || test_failed $LINENO +"${LZIP}" -atq ingin.lz +[ $? = 2 ] || test_failed $LINENO +"${LZIP}" -atq < ingin.lz +[ $? = 2 ] || test_failed $LINENO +"${LZIP}" -acdq ingin.lz > out +[ $? = 2 ] || test_failed $LINENO +"${LZIP}" -adq < ingin.lz > out +[ $? = 2 ] || test_failed $LINENO +"${LZIP}" -t ingin.lz || test_failed $LINENO +"${LZIP}" -t < ingin.lz || test_failed $LINENO +"${LZIP}" -cd ingin.lz > copy || test_failed $LINENO +cmp in copy || test_failed $LINENO +"${LZIP}" -d < ingin.lz > copy || test_failed $LINENO +cmp in copy || test_failed $LINENO +rm -f copy ingin.lz out || framework_failure + +echo +if [ ${fail} = 0 ] ; then + echo "tests completed successfully." + cd "${objdir}" && rm -r tmp +else + echo "tests failed." +fi +exit ${fail} diff --git a/testsuite/fox.lz b/testsuite/fox.lz Binary files differnew file mode 100644 index 0000000..509da82 --- /dev/null +++ b/testsuite/fox.lz diff --git a/testsuite/fox_bcrc.lz b/testsuite/fox_bcrc.lz Binary files differnew file mode 100644 index 0000000..8f6a7c4 --- /dev/null +++ b/testsuite/fox_bcrc.lz diff --git a/testsuite/fox_crc0.lz b/testsuite/fox_crc0.lz Binary files differnew file mode 100644 index 0000000..1abe926 --- /dev/null +++ b/testsuite/fox_crc0.lz diff --git a/testsuite/fox_das46.lz b/testsuite/fox_das46.lz Binary files differnew file mode 100644 index 0000000..43ed9f9 --- /dev/null +++ b/testsuite/fox_das46.lz diff --git a/testsuite/fox_de20.lz b/testsuite/fox_de20.lz Binary files differnew file mode 100644 index 0000000..10949d8 --- /dev/null +++ b/testsuite/fox_de20.lz diff --git a/testsuite/fox_mes81.lz b/testsuite/fox_mes81.lz Binary files differnew file mode 100644 index 0000000..d50ef2e --- /dev/null +++ b/testsuite/fox_mes81.lz diff --git a/testsuite/fox_s11.lz b/testsuite/fox_s11.lz Binary files differnew file mode 100644 index 0000000..dca909c --- /dev/null +++ b/testsuite/fox_s11.lz diff --git a/testsuite/fox_v2.lz b/testsuite/fox_v2.lz Binary files differnew file mode 100644 index 0000000..8620981 --- /dev/null +++ b/testsuite/fox_v2.lz diff --git a/testsuite/test.txt b/testsuite/test.txt new file mode 100644 index 0000000..9196a3a --- /dev/null +++ b/testsuite/test.txt @@ -0,0 +1,676 @@ + GNU GENERAL PUBLIC LICENSE + Version 2, June 1991 + + Copyright (C) 1989, 1991 Free Software Foundation, Inc., + 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA + Everyone is permitted to copy and distribute verbatim copies + of this license document, but changing it is not allowed. + + Preamble + + The licenses for most software are designed to take away your +freedom to share and change it. By contrast, the GNU General Public +License is intended to guarantee your freedom to share and change free +software--to make sure the software is free for all its users. This +General Public License applies to most of the Free Software +Foundation's software and to any other program whose authors commit to +using it. (Some other Free Software Foundation software is covered by +the GNU Lesser General Public License instead.) You can apply it to +your programs, too. + + When we speak of free software, we are referring to freedom, not +price. Our General Public Licenses are designed to make sure that you +have the freedom to distribute copies of free software (and charge for +this service if you wish), that you receive source code or can get it +if you want it, that you can change the software or use pieces of it +in new free programs; and that you know you can do these things. + + To protect your rights, we need to make restrictions that forbid +anyone to deny you these rights or to ask you to surrender the rights. +These restrictions translate to certain responsibilities for you if you +distribute copies of the software, or if you modify it. + + For example, if you distribute copies of such a program, whether +gratis or for a fee, you must give the recipients all the rights that +you have. You must make sure that they, too, receive or can get the +source code. And you must show them these terms so they know their +rights. + + We protect your rights with two steps: (1) copyright the software, and +(2) offer you this license which gives you legal permission to copy, +distribute and/or modify the software. + + Also, for each author's protection and ours, we want to make certain +that everyone understands that there is no warranty for this free +software. If the software is modified by someone else and passed on, we +want its recipients to know that what they have is not the original, so +that any problems introduced by others will not reflect on the original +authors' reputations. + + Finally, any free program is threatened constantly by software +patents. We wish to avoid the danger that redistributors of a free +program will individually obtain patent licenses, in effect making the +program proprietary. To prevent this, we have made it clear that any +patent must be licensed for everyone's free use or not licensed at all. + + The precise terms and conditions for copying, distribution and +modification follow. + + GNU GENERAL PUBLIC LICENSE + TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION + + 0. This License applies to any program or other work which contains +a notice placed by the copyright holder saying it may be distributed +under the terms of this General Public License. The "Program", below, +refers to any such program or work, and a "work based on the Program" +means either the Program or any derivative work under copyright law: +that is to say, a work containing the Program or a portion of it, +either verbatim or with modifications and/or translated into another +language. (Hereinafter, translation is included without limitation in +the term "modification".) Each licensee is addressed as "you". + +Activities other than copying, distribution and modification are not +covered by this License; they are outside its scope. The act of +running the Program is not restricted, and the output from the Program +is covered only if its contents constitute a work based on the +Program (independent of having been made by running the Program). +Whether that is true depends on what the Program does. + + 1. You may copy and distribute verbatim copies of the Program's +source code as you receive it, in any medium, provided that you +conspicuously and appropriately publish on each copy an appropriate +copyright notice and disclaimer of warranty; keep intact all the +notices that refer to this License and to the absence of any warranty; +and give any other recipients of the Program a copy of this License +along with the Program. + +You may charge a fee for the physical act of transferring a copy, and +you may at your option offer warranty protection in exchange for a fee. + + 2. You may modify your copy or copies of the Program or any portion +of it, thus forming a work based on the Program, and copy and +distribute such modifications or work under the terms of Section 1 +above, provided that you also meet all of these conditions: + + a) You must cause the modified files to carry prominent notices + stating that you changed the files and the date of any change. + + b) You must cause any work that you distribute or publish, that in + whole or in part contains or is derived from the Program or any + part thereof, to be licensed as a whole at no charge to all third + parties under the terms of this License. + + c) If the modified program normally reads commands interactively + when run, you must cause it, when started running for such + interactive use in the most ordinary way, to print or display an + announcement including an appropriate copyright notice and a + notice that there is no warranty (or else, saying that you provide + a warranty) and that users may redistribute the program under + these conditions, and telling the user how to view a copy of this + License. (Exception: if the Program itself is interactive but + does not normally print such an announcement, your work based on + the Program is not required to print an announcement.) + +These requirements apply to the modified work as a whole. If +identifiable sections of that work are not derived from the Program, +and can be reasonably considered independent and separate works in +themselves, then this License, and its terms, do not apply to those +sections when you distribute them as separate works. But when you +distribute the same sections as part of a whole which is a work based +on the Program, the distribution of the whole must be on the terms of +this License, whose permissions for other licensees extend to the +entire whole, and thus to each and every part regardless of who wrote it. + +Thus, it is not the intent of this section to claim rights or contest +your rights to work written entirely by you; rather, the intent is to +exercise the right to control the distribution of derivative or +collective works based on the Program. + +In addition, mere aggregation of another work not based on the Program +with the Program (or with a work based on the Program) on a volume of +a storage or distribution medium does not bring the other work under +the scope of this License. + + 3. You may copy and distribute the Program (or a work based on it, +under Section 2) in object code or executable form under the terms of +Sections 1 and 2 above provided that you also do one of the following: + + a) Accompany it with the complete corresponding machine-readable + source code, which must be distributed under the terms of Sections + 1 and 2 above on a medium customarily used for software interchange; or, + + b) Accompany it with a written offer, valid for at least three + years, to give any third party, for a charge no more than your + cost of physically performing source distribution, a complete + machine-readable copy of the corresponding source code, to be + distributed under the terms of Sections 1 and 2 above on a medium + customarily used for software interchange; or, + + c) Accompany it with the information you received as to the offer + to distribute corresponding source code. (This alternative is + allowed only for noncommercial distribution and only if you + received the program in object code or executable form with such + an offer, in accord with Subsection b above.) + +The source code for a work means the preferred form of the work for +making modifications to it. For an executable work, complete source +code means all the source code for all modules it contains, plus any +associated interface definition files, plus the scripts used to +control compilation and installation of the executable. However, as a +special exception, the source code distributed need not include +anything that is normally distributed (in either source or binary +form) with the major components (compiler, kernel, and so on) of the +operating system on which the executable runs, unless that component +itself accompanies the executable. + +If distribution of executable or object code is made by offering +access to copy from a designated place, then offering equivalent +access to copy the source code from the same place counts as +distribution of the source code, even though third parties are not +compelled to copy the source along with the object code. + + 4. You may not copy, modify, sublicense, or distribute the Program +except as expressly provided under this License. Any attempt +otherwise to copy, modify, sublicense or distribute the Program is +void, and will automatically terminate your rights under this License. +However, parties who have received copies, or rights, from you under +this License will not have their licenses terminated so long as such +parties remain in full compliance. + + 5. You are not required to accept this License, since you have not +signed it. However, nothing else grants you permission to modify or +distribute the Program or its derivative works. These actions are +prohibited by law if you do not accept this License. Therefore, by +modifying or distributing the Program (or any work based on the +Program), you indicate your acceptance of this License to do so, and +all its terms and conditions for copying, distributing or modifying +the Program or works based on it. + + 6. Each time you redistribute the Program (or any work based on the +Program), the recipient automatically receives a license from the +original licensor to copy, distribute or modify the Program subject to +these terms and conditions. You may not impose any further +restrictions on the recipients' exercise of the rights granted herein. +You are not responsible for enforcing compliance by third parties to +this License. + + 7. If, as a consequence of a court judgment or allegation of patent +infringement or for any other reason (not limited to patent issues), +conditions are imposed on you (whether by court order, agreement or +otherwise) that contradict the conditions of this License, they do not +excuse you from the conditions of this License. If you cannot +distribute so as to satisfy simultaneously your obligations under this +License and any other pertinent obligations, then as a consequence you +may not distribute the Program at all. For example, if a patent +license would not permit royalty-free redistribution of the Program by +all those who receive copies directly or indirectly through you, then +the only way you could satisfy both it and this License would be to +refrain entirely from distribution of the Program. + +If any portion of this section is held invalid or unenforceable under +any particular circumstance, the balance of the section is intended to +apply and the section as a whole is intended to apply in other +circumstances. + +It is not the purpose of this section to induce you to infringe any +patents or other property right claims or to contest validity of any +such claims; this section has the sole purpose of protecting the +integrity of the free software distribution system, which is +implemented by public license practices. Many people have made +generous contributions to the wide range of software distributed +through that system in reliance on consistent application of that +system; it is up to the author/donor to decide if he or she is willing +to distribute software through any other system and a licensee cannot +impose that choice. + +This section is intended to make thoroughly clear what is believed to +be a consequence of the rest of this License. + + 8. If the distribution and/or use of the Program is restricted in +certain countries either by patents or by copyrighted interfaces, the +original copyright holder who places the Program under this License +may add an explicit geographical distribution limitation excluding +those countries, so that distribution is permitted only in or among +countries not thus excluded. In such case, this License incorporates +the limitation as if written in the body of this License. + + 9. The Free Software Foundation may publish revised and/or new versions +of the General Public License from time to time. Such new versions will +be similar in spirit to the present version, but may differ in detail to +address new problems or concerns. + +Each version is given a distinguishing version number. If the Program +specifies a version number of this License which applies to it and "any +later version", you have the option of following the terms and conditions +either of that version or of any later version published by the Free +Software Foundation. If the Program does not specify a version number of +this License, you may choose any version ever published by the Free Software +Foundation. + + 10. If you wish to incorporate parts of the Program into other free +programs whose distribution conditions are different, write to the author +to ask for permission. For software which is copyrighted by the Free +Software Foundation, write to the Free Software Foundation; we sometimes +make exceptions for this. Our decision will be guided by the two goals +of preserving the free status of all derivatives of our free software and +of promoting the sharing and reuse of software generally. + + NO WARRANTY + + 11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY +FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN +OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES +PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED +OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF +MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS +TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE +PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, +REPAIR OR CORRECTION. + + 12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING +WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR +REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, +INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING +OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED +TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY +YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER +PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE +POSSIBILITY OF SUCH DAMAGES. + + END OF TERMS AND CONDITIONS + + How to Apply These Terms to Your New Programs + + If you develop a new program, and you want it to be of the greatest +possible use to the public, the best way to achieve this is to make it +free software which everyone can redistribute and change under these terms. + + To do so, attach the following notices to the program. It is safest +to attach them to the start of each source file to most effectively +convey the exclusion of warranty; and each file should have at least +the "copyright" line and a pointer to where the full notice is found. + + <one line to give the program's name and a brief idea of what it does.> + Copyright (C) <year> <name of author> + + This program is free software: you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation, either version 2 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program. If not, see <http://www.gnu.org/licenses/>. + +Also add information on how to contact you by electronic and paper mail. + +If the program is interactive, make it output a short notice like this +when it starts in an interactive mode: + + Gnomovision version 69, Copyright (C) <year> <name of author> + Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'. + This is free software, and you are welcome to redistribute it + under certain conditions; type `show c' for details. + +The hypothetical commands `show w' and `show c' should show the appropriate +parts of the General Public License. Of course, the commands you use may +be called something other than `show w' and `show c'; they could even be +mouse-clicks or menu items--whatever suits your program. + +You should also get your employer (if you work as a programmer) or your +school, if any, to sign a "copyright disclaimer" for the program, if +necessary. Here is a sample; alter the names: + + Yoyodyne, Inc., hereby disclaims all copyright interest in the program + `Gnomovision' (which makes passes at compilers) written by James Hacker. + + <signature of Ty Coon>, 1 April 1989 + Ty Coon, President of Vice + +This General Public License does not permit incorporating your program into +proprietary programs. If your program is a subroutine library, you may +consider it more useful to permit linking proprietary applications with the +library. If this is what you want to do, use the GNU Lesser General +Public License instead of this License. + GNU GENERAL PUBLIC LICENSE
+ Version 2, June 1991
+
+ Copyright (C) 1989, 1991 Free Software Foundation, Inc.,
+ 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA
+ Everyone is permitted to copy and distribute verbatim copies
+ of this license document, but changing it is not allowed.
+
+ Preamble
+
+ The licenses for most software are designed to take away your
+freedom to share and change it. By contrast, the GNU General Public
+License is intended to guarantee your freedom to share and change free
+software--to make sure the software is free for all its users. This
+General Public License applies to most of the Free Software
+Foundation's software and to any other program whose authors commit to
+using it. (Some other Free Software Foundation software is covered by
+the GNU Lesser General Public License instead.) You can apply it to
+your programs, too.
+
+ When we speak of free software, we are referring to freedom, not
+price. Our General Public Licenses are designed to make sure that you
+have the freedom to distribute copies of free software (and charge for
+this service if you wish), that you receive source code or can get it
+if you want it, that you can change the software or use pieces of it
+in new free programs; and that you know you can do these things.
+
+ To protect your rights, we need to make restrictions that forbid
+anyone to deny you these rights or to ask you to surrender the rights.
+These restrictions translate to certain responsibilities for you if you
+distribute copies of the software, or if you modify it.
+
+ For example, if you distribute copies of such a program, whether
+gratis or for a fee, you must give the recipients all the rights that
+you have. You must make sure that they, too, receive or can get the
+source code. And you must show them these terms so they know their
+rights.
+
+ We protect your rights with two steps: (1) copyright the software, and
+(2) offer you this license which gives you legal permission to copy,
+distribute and/or modify the software.
+
+ Also, for each author's protection and ours, we want to make certain
+that everyone understands that there is no warranty for this free
+software. If the software is modified by someone else and passed on, we
+want its recipients to know that what they have is not the original, so
+that any problems introduced by others will not reflect on the original
+authors' reputations.
+
+ Finally, any free program is threatened constantly by software
+patents. We wish to avoid the danger that redistributors of a free
+program will individually obtain patent licenses, in effect making the
+program proprietary. To prevent this, we have made it clear that any
+patent must be licensed for everyone's free use or not licensed at all.
+
+ The precise terms and conditions for copying, distribution and
+modification follow.
+
+ GNU GENERAL PUBLIC LICENSE
+ TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
+
+ 0. This License applies to any program or other work which contains
+a notice placed by the copyright holder saying it may be distributed
+under the terms of this General Public License. The "Program", below,
+refers to any such program or work, and a "work based on the Program"
+means either the Program or any derivative work under copyright law:
+that is to say, a work containing the Program or a portion of it,
+either verbatim or with modifications and/or translated into another
+language. (Hereinafter, translation is included without limitation in
+the term "modification".) Each licensee is addressed as "you".
+
+Activities other than copying, distribution and modification are not
+covered by this License; they are outside its scope. The act of
+running the Program is not restricted, and the output from the Program
+is covered only if its contents constitute a work based on the
+Program (independent of having been made by running the Program).
+Whether that is true depends on what the Program does.
+
+ 1. You may copy and distribute verbatim copies of the Program's
+source code as you receive it, in any medium, provided that you
+conspicuously and appropriately publish on each copy an appropriate
+copyright notice and disclaimer of warranty; keep intact all the
+notices that refer to this License and to the absence of any warranty;
+and give any other recipients of the Program a copy of this License
+along with the Program.
+
+You may charge a fee for the physical act of transferring a copy, and
+you may at your option offer warranty protection in exchange for a fee.
+
+ 2. You may modify your copy or copies of the Program or any portion
+of it, thus forming a work based on the Program, and copy and
+distribute such modifications or work under the terms of Section 1
+above, provided that you also meet all of these conditions:
+
+ a) You must cause the modified files to carry prominent notices
+ stating that you changed the files and the date of any change.
+
+ b) You must cause any work that you distribute or publish, that in
+ whole or in part contains or is derived from the Program or any
+ part thereof, to be licensed as a whole at no charge to all third
+ parties under the terms of this License.
+
+ c) If the modified program normally reads commands interactively
+ when run, you must cause it, when started running for such
+ interactive use in the most ordinary way, to print or display an
+ announcement including an appropriate copyright notice and a
+ notice that there is no warranty (or else, saying that you provide
+ a warranty) and that users may redistribute the program under
+ these conditions, and telling the user how to view a copy of this
+ License. (Exception: if the Program itself is interactive but
+ does not normally print such an announcement, your work based on
+ the Program is not required to print an announcement.)
+
+These requirements apply to the modified work as a whole. If
+identifiable sections of that work are not derived from the Program,
+and can be reasonably considered independent and separate works in
+themselves, then this License, and its terms, do not apply to those
+sections when you distribute them as separate works. But when you
+distribute the same sections as part of a whole which is a work based
+on the Program, the distribution of the whole must be on the terms of
+this License, whose permissions for other licensees extend to the
+entire whole, and thus to each and every part regardless of who wrote it.
+
+Thus, it is not the intent of this section to claim rights or contest
+your rights to work written entirely by you; rather, the intent is to
+exercise the right to control the distribution of derivative or
+collective works based on the Program.
+
+In addition, mere aggregation of another work not based on the Program
+with the Program (or with a work based on the Program) on a volume of
+a storage or distribution medium does not bring the other work under
+the scope of this License.
+
+ 3. You may copy and distribute the Program (or a work based on it,
+under Section 2) in object code or executable form under the terms of
+Sections 1 and 2 above provided that you also do one of the following:
+
+ a) Accompany it with the complete corresponding machine-readable
+ source code, which must be distributed under the terms of Sections
+ 1 and 2 above on a medium customarily used for software interchange; or,
+
+ b) Accompany it with a written offer, valid for at least three
+ years, to give any third party, for a charge no more than your
+ cost of physically performing source distribution, a complete
+ machine-readable copy of the corresponding source code, to be
+ distributed under the terms of Sections 1 and 2 above on a medium
+ customarily used for software interchange; or,
+
+ c) Accompany it with the information you received as to the offer
+ to distribute corresponding source code. (This alternative is
+ allowed only for noncommercial distribution and only if you
+ received the program in object code or executable form with such
+ an offer, in accord with Subsection b above.)
+
+The source code for a work means the preferred form of the work for
+making modifications to it. For an executable work, complete source
+code means all the source code for all modules it contains, plus any
+associated interface definition files, plus the scripts used to
+control compilation and installation of the executable. However, as a
+special exception, the source code distributed need not include
+anything that is normally distributed (in either source or binary
+form) with the major components (compiler, kernel, and so on) of the
+operating system on which the executable runs, unless that component
+itself accompanies the executable.
+
+If distribution of executable or object code is made by offering
+access to copy from a designated place, then offering equivalent
+access to copy the source code from the same place counts as
+distribution of the source code, even though third parties are not
+compelled to copy the source along with the object code.
+
+ 4. You may not copy, modify, sublicense, or distribute the Program
+except as expressly provided under this License. Any attempt
+otherwise to copy, modify, sublicense or distribute the Program is
+void, and will automatically terminate your rights under this License.
+However, parties who have received copies, or rights, from you under
+this License will not have their licenses terminated so long as such
+parties remain in full compliance.
+
+ 5. You are not required to accept this License, since you have not
+signed it. However, nothing else grants you permission to modify or
+distribute the Program or its derivative works. These actions are
+prohibited by law if you do not accept this License. Therefore, by
+modifying or distributing the Program (or any work based on the
+Program), you indicate your acceptance of this License to do so, and
+all its terms and conditions for copying, distributing or modifying
+the Program or works based on it.
+
+ 6. Each time you redistribute the Program (or any work based on the
+Program), the recipient automatically receives a license from the
+original licensor to copy, distribute or modify the Program subject to
+these terms and conditions. You may not impose any further
+restrictions on the recipients' exercise of the rights granted herein.
+You are not responsible for enforcing compliance by third parties to
+this License.
+
+ 7. If, as a consequence of a court judgment or allegation of patent
+infringement or for any other reason (not limited to patent issues),
+conditions are imposed on you (whether by court order, agreement or
+otherwise) that contradict the conditions of this License, they do not
+excuse you from the conditions of this License. If you cannot
+distribute so as to satisfy simultaneously your obligations under this
+License and any other pertinent obligations, then as a consequence you
+may not distribute the Program at all. For example, if a patent
+license would not permit royalty-free redistribution of the Program by
+all those who receive copies directly or indirectly through you, then
+the only way you could satisfy both it and this License would be to
+refrain entirely from distribution of the Program.
+
+If any portion of this section is held invalid or unenforceable under
+any particular circumstance, the balance of the section is intended to
+apply and the section as a whole is intended to apply in other
+circumstances.
+
+It is not the purpose of this section to induce you to infringe any
+patents or other property right claims or to contest validity of any
+such claims; this section has the sole purpose of protecting the
+integrity of the free software distribution system, which is
+implemented by public license practices. Many people have made
+generous contributions to the wide range of software distributed
+through that system in reliance on consistent application of that
+system; it is up to the author/donor to decide if he or she is willing
+to distribute software through any other system and a licensee cannot
+impose that choice.
+
+This section is intended to make thoroughly clear what is believed to
+be a consequence of the rest of this License.
+
+ 8. If the distribution and/or use of the Program is restricted in
+certain countries either by patents or by copyrighted interfaces, the
+original copyright holder who places the Program under this License
+may add an explicit geographical distribution limitation excluding
+those countries, so that distribution is permitted only in or among
+countries not thus excluded. In such case, this License incorporates
+the limitation as if written in the body of this License.
+
+ 9. The Free Software Foundation may publish revised and/or new versions
+of the General Public License from time to time. Such new versions will
+be similar in spirit to the present version, but may differ in detail to
+address new problems or concerns.
+
+Each version is given a distinguishing version number. If the Program
+specifies a version number of this License which applies to it and "any
+later version", you have the option of following the terms and conditions
+either of that version or of any later version published by the Free
+Software Foundation. If the Program does not specify a version number of
+this License, you may choose any version ever published by the Free Software
+Foundation.
+
+ 10. If you wish to incorporate parts of the Program into other free
+programs whose distribution conditions are different, write to the author
+to ask for permission. For software which is copyrighted by the Free
+Software Foundation, write to the Free Software Foundation; we sometimes
+make exceptions for this. Our decision will be guided by the two goals
+of preserving the free status of all derivatives of our free software and
+of promoting the sharing and reuse of software generally.
+
+ NO WARRANTY
+
+ 11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY
+FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN
+OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES
+PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED
+OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF
+MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS
+TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE
+PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING,
+REPAIR OR CORRECTION.
+
+ 12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
+WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR
+REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES,
+INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING
+OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED
+TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY
+YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER
+PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE
+POSSIBILITY OF SUCH DAMAGES.
+
+ END OF TERMS AND CONDITIONS
+
+ How to Apply These Terms to Your New Programs
+
+ If you develop a new program, and you want it to be of the greatest
+possible use to the public, the best way to achieve this is to make it
+free software which everyone can redistribute and change under these terms.
+
+ To do so, attach the following notices to the program. It is safest
+to attach them to the start of each source file to most effectively
+convey the exclusion of warranty; and each file should have at least
+the "copyright" line and a pointer to where the full notice is found.
+
+ <one line to give the program's name and a brief idea of what it does.>
+ Copyright (C) <year> <name of author>
+
+ This program is free software: you can redistribute it and/or modify
+ it under the terms of the GNU General Public License as published by
+ the Free Software Foundation, either version 2 of the License, or
+ (at your option) any later version.
+
+ This program is distributed in the hope that it will be useful,
+ but WITHOUT ANY WARRANTY; without even the implied warranty of
+ MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ GNU General Public License for more details.
+
+ You should have received a copy of the GNU General Public License
+ along with this program. If not, see <http://www.gnu.org/licenses/>.
+
+Also add information on how to contact you by electronic and paper mail.
+
+If the program is interactive, make it output a short notice like this
+when it starts in an interactive mode:
+
+ Gnomovision version 69, Copyright (C) <year> <name of author>
+ Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
+ This is free software, and you are welcome to redistribute it
+ under certain conditions; type `show c' for details.
+
+The hypothetical commands `show w' and `show c' should show the appropriate
+parts of the General Public License. Of course, the commands you use may
+be called something other than `show w' and `show c'; they could even be
+mouse-clicks or menu items--whatever suits your program.
+
+You should also get your employer (if you work as a programmer) or your
+school, if any, to sign a "copyright disclaimer" for the program, if
+necessary. Here is a sample; alter the names:
+
+ Yoyodyne, Inc., hereby disclaims all copyright interest in the program
+ `Gnomovision' (which makes passes at compilers) written by James Hacker.
+
+ <signature of Ty Coon>, 1 April 1989
+ Ty Coon, President of Vice
+
+This General Public License does not permit incorporating your program into
+proprietary programs. If your program is a subroutine library, you may
+consider it more useful to permit linking proprietary applications with the
+library. If this is what you want to do, use the GNU Lesser General
+Public License instead of this License.
diff --git a/testsuite/test.txt.lz b/testsuite/test.txt.lz Binary files differnew file mode 100644 index 0000000..22cea6e --- /dev/null +++ b/testsuite/test.txt.lz diff --git a/testsuite/test_em.txt.lz b/testsuite/test_em.txt.lz Binary files differnew file mode 100644 index 0000000..7e96250 --- /dev/null +++ b/testsuite/test_em.txt.lz |