From 2f15376ba464cf08e710c3353bdacc4f503e11b4 Mon Sep 17 00:00:00 2001 From: Daniel Baumann Date: Wed, 23 Jan 2019 18:42:07 +0100 Subject: Merging upstream version 0.9. Signed-off-by: Daniel Baumann --- doc/tarlz.1 | 25 ++++++---- doc/tarlz.info | 153 ++++++++++++++++++++++++++++++++++++++++----------------- doc/tarlz.texi | 136 ++++++++++++++++++++++++++++++++++++-------------- 3 files changed, 222 insertions(+), 92 deletions(-) (limited to 'doc') diff --git a/doc/tarlz.1 b/doc/tarlz.1 index 906fee0..b83a7e6 100644 --- a/doc/tarlz.1 +++ b/doc/tarlz.1 @@ -1,18 +1,20 @@ .\" DO NOT MODIFY THIS FILE! It was generated by help2man 1.46.1. -.TH TARLZ "1" "December 2018" "tarlz 0.8" "User Commands" +.TH TARLZ "1" "January 2019" "tarlz 0.9" "User Commands" .SH NAME tarlz \- creates tar archives with multimember lzip compression .SH SYNOPSIS .B tarlz [\fI\,options\/\fR] [\fI\,files\/\fR] .SH DESCRIPTION -Tarlz is a small and simple implementation of the tar archiver. By default -tarlz creates, lists and extracts archives in a simplified posix pax format -compressed with lzip on a per file basis. Each tar member is compressed in -its own lzip member, as well as the end\-of\-file blocks. This method is fully -backward compatible with standard tar tools like GNU tar, which treat the -resulting multimember tar.lz archive like any other tar.lz archive. Tarlz -can append files to the end of such compressed archives. +Tarlz is a combined implementation of the tar archiver and the lzip +compressor. By default tarlz creates, lists and extracts archives in a +simplified posix pax format compressed with lzip on a per file basis. Each +tar member is compressed in its own lzip member, as well as the end\-of\-file +blocks. This method adds an indexed lzip layer on top of the tar archive, +making it possible to decode the archive safely in parallel. The resulting +multimember tar.lz archive is fully backward compatible with standard tar +tools like GNU tar, which treat it like any other tar.lz archive. Tarlz can +append files to the end of such compressed archives. .PP The tarlz file format is a safe posix\-style backup format. In case of corruption, tarlz can extract all the undamaged members from the tar.lz @@ -40,6 +42,9 @@ change to directory \fB\-f\fR, \fB\-\-file=\fR use archive file .TP +\fB\-n\fR, \fB\-\-threads=\fR +set number of decompression threads [2] +.TP \fB\-q\fR, \fB\-\-quiet\fR suppress all messages .TP @@ -97,8 +102,8 @@ Report bugs to lzip\-bug@nongnu.org .br Tarlz home page: http://www.nongnu.org/lzip/tarlz.html .SH COPYRIGHT -Copyright \(co 2018 Antonio Diaz Diaz. -Using lzlib 1.11\-rc2 +Copyright \(co 2019 Antonio Diaz Diaz. +Using lzlib 1.11 License GPLv2+: GNU GPL version 2 or later .br This is free software: you are free to change and redistribute it. diff --git a/doc/tarlz.info b/doc/tarlz.info index d6d17d0..7f90766 100644 --- a/doc/tarlz.info +++ b/doc/tarlz.info @@ -11,7 +11,7 @@ File: tarlz.info, Node: Top, Next: Introduction, Up: (dir) Tarlz Manual ************ -This manual is for Tarlz (version 0.8, 16 December 2018). +This manual is for Tarlz (version 0.9, 22 January 2019). * Menu: @@ -19,12 +19,13 @@ This manual is for Tarlz (version 0.8, 16 December 2018). * Invoking tarlz:: Command line interface * File format:: Detailed format of the compressed archive * Amendments to pax format:: The reasons for the differences with pax +* Multi-threaded tar:: Limitations of parallel tar decoding * Examples:: A small tutorial with examples * Problems:: Reporting bugs * Concept index:: Index of concepts - Copyright (C) 2013-2018 Antonio Diaz Diaz. + Copyright (C) 2013-2019 Antonio Diaz Diaz. This manual is free documentation: you have unlimited permission to copy, distribute and modify it. @@ -35,12 +36,14 @@ File: tarlz.info, Node: Introduction, Next: Invoking tarlz, Prev: Top, Up: T 1 Introduction ************** -Tarlz is a small and simple implementation of the tar archiver. By -default tarlz creates, lists and extracts archives in a simplified -posix pax format compressed with lzip on a per file basis. Each tar -member is compressed in its own lzip member, as well as the end-of-file -blocks. This method is fully backward compatible with standard tar tools -like GNU tar, which treat the resulting multimember tar.lz archive like +Tarlz is a combined implementation of the tar archiver and the lzip +compressor. By default tarlz creates, lists and extracts archives in a +simplified posix pax format compressed with lzip on a per file basis. +Each tar member is compressed in its own lzip member, as well as the +end-of-file blocks. This method adds an indexed lzip layer on top of +the tar archive, making it possible to decode the archive safely in +parallel. The resulting multimember tar.lz archive is fully backward +compatible with standard tar tools like GNU tar, which treat it like any other tar.lz archive. Tarlz can append files to the end of such compressed archives. @@ -52,7 +55,7 @@ less efficient than compressing the whole tar archive, but it has the following advantages: * The resulting multimember tar.lz archive can be decompressed in - parallel with plzip, multiplying the decompression speed. + parallel, multiplying the decompression speed. * New members can be appended to the archive (by removing the EOF member) just like to an uncompressed tar archive. @@ -74,10 +77,6 @@ with standard tar tools. *Note crc32::. Tarlz does not understand other tar formats like 'gnu', 'oldgnu', 'star' or 'v7'. - Tarlz is intended as a showcase project for the maintainers of real -tar programs to evaluate the format and perhaps implement it in their -tools. -  File: tarlz.info, Node: Invoking tarlz, Next: File format, Prev: Introduction, Up: Top @@ -141,6 +140,21 @@ archive 'foo'. Use archive file ARCHIVE. '-' used as an ARCHIVE argument reads from standard input or writes to standard output. +'-n N' +'--threads=N' + Set the number of decompression threads, overriding the system's + default. Valid values range from 0 to "as many as your system can + support". A value of 0 disables threads entirely. If this option + is not used, tarlz tries to detect the number of processors in the + system and use it as default value. 'tarlz --help' shows the + system's default value. This option currently only has effect when + listing the contents of a multimember compressed archive. *Note + Multi-threaded tar::. + + Note that the number of usable threads is limited during + decompression to the number of lzip members in the tar.lz archive, + which you can find by running 'lzip -lv archive.tar.lz'. + '-q' '--quiet' Quiet operation. Suppress all messages. @@ -288,6 +302,11 @@ following sequence: * Zero or more blocks that contain the contents of the file. + Each tar member must be contiguously stored in a lzip member for the +parallel decoding operations like '--list' to work. If any tar member +is split over two or more lzip members, the archive must be decoded +sequentially. *Note Multi-threaded tar::. + At the end of the archive file there are two 512-byte blocks filled with binary zeros, interpreted as an end-of-archive indicator. These EOF blocks are either compressed in a separate lzip member or compressed @@ -417,19 +436,12 @@ record is used to store the linkname. The mode field provides 12 access permission bits. The following table shows the symbolic name of each bit and its octal value: -Bit Name Bit value -S_ISUID 04000 -S_ISGID 02000 -S_ISVTX 01000 -S_IRUSR 00400 -S_IWUSR 00200 -S_IXUSR 00100 -S_IRGRP 00040 -S_IWGRP 00020 -S_IXGRP 00010 -S_IROTH 00004 -S_IWOTH 00002 -S_IXOTH 00001 +Bit Name Value Bit Name Value Bit Name Value +--------------------------------------------------- +S_ISUID 04000 S_ISGID 02000 S_ISVTX 01000 +S_IRUSR 00400 S_IWUSR 00200 S_IXUSR 00100 +S_IRGRP 00040 S_IWGRP 00020 S_IXGRP 00010 +S_IROTH 00004 S_IWOTH 00002 S_IXOTH 00001 The uid and gid fields are the user and group ID of the owner and group of the file, respectively. @@ -485,12 +497,16 @@ file archived: The magic field contains the ASCII null-terminated string "ustar". The version field contains the characters "00" (0x30,0x30). The fields -uname, and gname are null-terminated character strings. Each numeric -field contains a leading zero-filled, null-terminated octal number using -digits from the ISO/IEC 646:1991 (ASCII) standard. +uname, and gname are null-terminated character strings except when all +characters in the array contain non-null characters including the last +character. Each numeric field contains a leading space- or zero-filled, +optionally null-terminated octal number using digits from the ISO/IEC +646:1991 (ASCII) standard. Tarlz is able to decode numeric fields 1 +byte larger than standard ustar by not requiring a terminating null +character.  -File: tarlz.info, Node: Amendments to pax format, Next: Examples, Prev: File format, Up: Top +File: tarlz.info, Node: Amendments to pax format, Next: Multi-threaded tar, Prev: File format, Up: Top 4 The reasons for the differences with pax ****************************************** @@ -508,7 +524,7 @@ and the concrete reasons to implement them. The posix pax format has a serious flaw. The metadata stored in pax extended records are not protected by any kind of check sequence. Corruption in a long filename may cause the extraction of the file in -the wrong place without warning. Corruption in a long file size may +the wrong place without warning. Corruption in a large file size may cause the truncation of the file or the appending of garbage to the file, both followed by a spurious warning about a corrupt header far from the place of the undetected corruption. @@ -573,9 +589,57 @@ prevents accidental double UTF-8 conversions. If the need arises this behavior will be adjusted with a command line option in the future.  -File: tarlz.info, Node: Examples, Next: Problems, Prev: Amendments to pax format, Up: Top +File: tarlz.info, Node: Multi-threaded tar, Next: Examples, Prev: Amendments to pax format, Up: Top + +5 Limitations of parallel tar decoding +************************************** + +Safely decoding an arbitrary tar archive in parallel is impossible. For +example, if a tar archive containing another tar archive is decoded +starting from some position other than the beginning, there is no way +to know if the first header found there belongs to the outer tar +archive or to the inner tar archive. Tar is a format inherently serial; +it was designed for tapes. + + In the case of compressed tar archives, the start of each compressed +block determines one point through which the tar archive can be decoded +in parallel. Therefore, in tar.lz archives the decoding operations +can't be parallelized if the tar members are not aligned with the lzip +members. Tar archives compressed with plzip can't be decoded in +parallel because tar and plzip do not have a way to align both sets of +members. Certainly one can decompress one such archive with a +multi-threaded tool like plzip, but the increase in speed is not as +large as it could be because plzip must serialize the decompressed data +and pass them to tar, which decodes them sequentially, one tar member +at a time. + + On the other hand, if the tar.lz archive is created with a tool like +tarlz, which can guarantee the alignment between tar members and lzip +members because it controls both archiving and compression, then the +lzip format becomes an indexed layer on top of the tar archive which +makes possible decoding it safely in parallel. + + Tarlz is able to automatically decode aligned and unaligned +multimember tar.lz archives, keeping backwards compatibility. If tarlz +finds a member misalignment during multi-threaded decoding, it switches +to single-threaded mode and continues decoding the archive. Currently +only the '--list' option is able to do multi-threaded decoding. + + If the files in the archive are large, multi-threaded '--list' on a +regular tar.lz archive can be hundreds of times faster than sequential +'--list' because, in addition to using several processors, it only +needs to decompress part of each lzip member. See the following example +listing the Silesia corpus on a dual core machine: + + tarlz -9 -cf silesia.tar.lz silesia + time lzip -cd silesia.tar.lz | tar -tf - (5.032s) + time plzip -cd silesia.tar.lz | tar -tf - (3.256s) + time tarlz -tf silesia.tar.lz (0.020s) + + +File: tarlz.info, Node: Examples, Next: Problems, Prev: Multi-threaded tar, Up: Top -5 A small tutorial with examples +6 A small tutorial with examples ******************************** Example 1: Create a multimember compressed archive 'archive.tar.lz' @@ -633,7 +697,7 @@ Example 8: Copy the contents of directory 'sourcedir' to the directory  File: tarlz.info, Node: Problems, Next: Concept index, Prev: Examples, Up: Top -6 Reporting bugs +7 Reporting bugs **************** There are probably bugs in tarlz. There are certainly errors and @@ -670,16 +734,17 @@ Concept index  Tag Table: Node: Top223 -Node: Introduction946 -Node: Invoking tarlz3084 -Node: File format9606 -Ref: key_crc3214138 -Node: Amendments to pax format19215 -Ref: crc3219729 -Ref: flawed-compat20753 -Node: Examples23126 -Node: Problems24802 -Node: Concept index25328 +Node: Introduction1012 +Node: Invoking tarlz3124 +Node: File format10384 +Ref: key_crc3215169 +Node: Amendments to pax format20586 +Ref: crc3221110 +Ref: flawed-compat22135 +Node: Multi-threaded tar24508 +Node: Examples27012 +Node: Problems28682 +Node: Concept index29208  End Tag Table diff --git a/doc/tarlz.texi b/doc/tarlz.texi index 4c6d16a..d9bdc14 100644 --- a/doc/tarlz.texi +++ b/doc/tarlz.texi @@ -6,8 +6,8 @@ @finalout @c %**end of header -@set UPDATED 16 December 2018 -@set VERSION 0.8 +@set UPDATED 22 January 2019 +@set VERSION 0.9 @dircategory Data Compression @direntry @@ -39,13 +39,14 @@ This manual is for Tarlz (version @value{VERSION}, @value{UPDATED}). * Invoking tarlz:: Command line interface * File format:: Detailed format of the compressed archive * Amendments to pax format:: The reasons for the differences with pax +* Multi-threaded tar:: Limitations of parallel tar decoding * Examples:: A small tutorial with examples * Problems:: Reporting bugs * Concept index:: Index of concepts @end menu @sp 1 -Copyright @copyright{} 2013-2018 Antonio Diaz Diaz. +Copyright @copyright{} 2013-2019 Antonio Diaz Diaz. This manual is free documentation: you have unlimited permission to copy, distribute and modify it. @@ -55,18 +56,20 @@ to copy, distribute and modify it. @chapter Introduction @cindex introduction -@uref{http://www.nongnu.org/lzip/tarlz.html,,Tarlz} is a small and simple -implementation of the tar archiver. By default tarlz creates, lists and -extracts archives in a simplified posix pax format compressed with -@uref{http://www.nongnu.org/lzip/lzip.html,,lzip} on a per file basis. Each -tar member is compressed in its own lzip member, as well as the end-of-file -blocks. This method is fully backward compatible with standard tar tools -like GNU tar, which treat the resulting multimember tar.lz archive like any -other tar.lz archive. Tarlz can append files to the end of such compressed -archives. - -Tarlz can create tar archives with four levels of compression -granularity; per file, per directory, appendable solid, and solid. +@uref{http://www.nongnu.org/lzip/tarlz.html,,Tarlz} is a combined +implementation of the tar archiver and the +@uref{http://www.nongnu.org/lzip/lzip.html,,lzip} compressor. By default +tarlz creates, lists and extracts archives in a simplified posix pax format +compressed with lzip on a per file basis. Each tar member is compressed in +its own lzip member, as well as the end-of-file blocks. This method adds an +indexed lzip layer on top of the tar archive, making it possible to decode +the archive safely in parallel. The resulting multimember tar.lz archive is +fully backward compatible with standard tar tools like GNU tar, which treat +it like any other tar.lz archive. Tarlz can append files to the end of such +compressed archives. + +Tarlz can create tar archives with four levels of compression granularity; +per file, per directory, appendable solid, and solid. @noindent Of course, compressing each file (or each directory) individually is @@ -76,7 +79,7 @@ following advantages: @itemize @bullet @item The resulting multimember tar.lz archive can be decompressed in -parallel with plzip, multiplying the decompression speed. +parallel, multiplying the decompression speed. @item New members can be appended to the archive (by removing the EOF @@ -102,9 +105,6 @@ standard tar tools. @xref{crc32}. Tarlz does not understand other tar formats like @samp{gnu}, @samp{oldgnu}, @samp{star} or @samp{v7}. -Tarlz is intended as a showcase project for the maintainers of real tar -programs to evaluate the format and perhaps implement it in their tools. - @node Invoking tarlz @chapter Invoking tarlz @@ -174,6 +174,20 @@ previous @code{-C} option. Use archive file @var{archive}. @samp{-} used as an @var{archive} argument reads from standard input or writes to standard output. +@item -n @var{n} +@itemx --threads=@var{n} +Set the number of decompression threads, overriding the system's default. +Valid values range from 0 to "as many as your system can support". A value +of 0 disables threads entirely. If this option is not used, tarlz tries to +detect the number of processors in the system and use it as default value. +@w{@samp{tarlz --help}} shows the system's default value. This option +currently only has effect when listing the contents of a multimember +compressed archive. @xref{Multi-threaded tar}. + +Note that the number of usable threads is limited during decompression to +the number of lzip members in the tar.lz archive, which you can find by +running @w{@code{lzip -lv archive.tar.lz}}. + @item -q @itemx --quiet Quiet operation. Suppress all messages. @@ -335,6 +349,11 @@ associated fields in this header block for this file. Zero or more blocks that contain the contents of the file. @end itemize +Each tar member must be contiguously stored in a lzip member for the +parallel decoding operations like @code{--list} to work. If any tar member +is split over two or more lzip members, the archive must be decoded +sequentially. @xref{Multi-threaded tar}. + At the end of the archive file there are two 512-byte blocks filled with binary zeros, interpreted as an end-of-archive indicator. These EOF blocks are either compressed in a separate lzip member or compressed @@ -481,20 +500,12 @@ is used to store the linkname. The mode field provides 12 access permission bits. The following table shows the symbolic name of each bit and its octal value: -@multitable {Bit Name} {Bit value} -@item Bit Name @tab Bit value -@item S_ISUID @tab 04000 -@item S_ISGID @tab 02000 -@item S_ISVTX @tab 01000 -@item S_IRUSR @tab 00400 -@item S_IWUSR @tab 00200 -@item S_IXUSR @tab 00100 -@item S_IRGRP @tab 00040 -@item S_IWGRP @tab 00020 -@item S_IXGRP @tab 00010 -@item S_IROTH @tab 00004 -@item S_IWOTH @tab 00002 -@item S_IXOTH @tab 00001 +@multitable {Bit Name} {Value} {Bit Name} {Value} {Bit Name} {Value} +@headitem Bit Name @tab Value @tab Bit Name @tab Value @tab Bit Name @tab Value +@item S_ISUID @tab 04000 @tab S_ISGID @tab 02000 @tab S_ISVTX @tab 01000 +@item S_IRUSR @tab 00400 @tab S_IWUSR @tab 00200 @tab S_IXUSR @tab 00100 +@item S_IRGRP @tab 00040 @tab S_IWGRP @tab 00020 @tab S_IXGRP @tab 00010 +@item S_IROTH @tab 00004 @tab S_IWOTH @tab 00002 @tab S_IXOTH @tab 00001 @end multitable The uid and gid fields are the user and group ID of the owner and group @@ -551,10 +562,13 @@ regular file (type 0). @end table The magic field contains the ASCII null-terminated string "ustar". The -version field contains the characters "00" (0x30,0x30). The fields -uname, and gname are null-terminated character strings. Each numeric -field contains a leading zero-filled, null-terminated octal number using -digits from the ISO/IEC 646:1991 (ASCII) standard. +version field contains the characters "00" (0x30,0x30). The fields uname, +and gname are null-terminated character strings except when all characters +in the array contain non-null characters including the last character. Each +numeric field contains a leading space- or zero-filled, optionally +null-terminated octal number using digits from the ISO/IEC 646:1991 (ASCII) +standard. Tarlz is able to decode numeric fields 1 byte larger than standard +ustar by not requiring a terminating null character. @node Amendments to pax format @@ -574,7 +588,7 @@ concrete reasons to implement them. The posix pax format has a serious flaw. The metadata stored in pax extended records are not protected by any kind of check sequence. Corruption in a long filename may cause the extraction of the file in the wrong place -without warning. Corruption in a long file size may cause the truncation of +without warning. Corruption in a large file size may cause the truncation of the file or the appending of garbage to the file, both followed by a spurious warning about a corrupt header far from the place of the undetected corruption. @@ -636,6 +650,52 @@ double UTF-8 conversions. If the need arises this behavior will be adjusted with a command line option in the future. +@node Multi-threaded tar +@chapter Limitations of parallel tar decoding + +Safely decoding an arbitrary tar archive in parallel is impossible. For +example, if a tar archive containing another tar archive is decoded starting +from some position other than the beginning, there is no way to know if the +first header found there belongs to the outer tar archive or to the inner +tar archive. Tar is a format inherently serial; it was designed for tapes. + +In the case of compressed tar archives, the start of each compressed block +determines one point through which the tar archive can be decoded in +parallel. Therefore, in tar.lz archives the decoding operations can't be +parallelized if the tar members are not aligned with the lzip members. Tar +archives compressed with plzip can't be decoded in parallel because tar and +plzip do not have a way to align both sets of members. Certainly one can +decompress one such archive with a multi-threaded tool like plzip, but the +increase in speed is not as large as it could be because plzip must +serialize the decompressed data and pass them to tar, which decodes them +sequentially, one tar member at a time. + +On the other hand, if the tar.lz archive is created with a tool like tarlz, +which can guarantee the alignment between tar members and lzip members +because it controls both archiving and compression, then the lzip format +becomes an indexed layer on top of the tar archive which makes possible +decoding it safely in parallel. + +Tarlz is able to automatically decode aligned and unaligned multimember +tar.lz archives, keeping backwards compatibility. If tarlz finds a member +misalignment during multi-threaded decoding, it switches to single-threaded +mode and continues decoding the archive. Currently only the @code{--list} +option is able to do multi-threaded decoding. + +If the files in the archive are large, multi-threaded @code{--list} on a +regular tar.lz archive can be hundreds of times faster than sequential +@code{--list} because, in addition to using several processors, it only +needs to decompress part of each lzip member. See the following example +listing the Silesia corpus on a dual core machine: + +@example +tarlz -9 -cf silesia.tar.lz silesia +time lzip -cd silesia.tar.lz | tar -tf - (5.032s) +time plzip -cd silesia.tar.lz | tar -tf - (3.256s) +time tarlz -tf silesia.tar.lz (0.020s) +@end example + + @node Examples @chapter A small tutorial with examples @cindex examples -- cgit v1.2.3