From 2f15376ba464cf08e710c3353bdacc4f503e11b4 Mon Sep 17 00:00:00 2001 From: Daniel Baumann Date: Wed, 23 Jan 2019 18:42:07 +0100 Subject: Merging upstream version 0.9. Signed-off-by: Daniel Baumann --- doc/tarlz.texi | 136 +++++++++++++++++++++++++++++++++++++++++---------------- 1 file changed, 98 insertions(+), 38 deletions(-) (limited to 'doc/tarlz.texi') diff --git a/doc/tarlz.texi b/doc/tarlz.texi index 4c6d16a..d9bdc14 100644 --- a/doc/tarlz.texi +++ b/doc/tarlz.texi @@ -6,8 +6,8 @@ @finalout @c %**end of header -@set UPDATED 16 December 2018 -@set VERSION 0.8 +@set UPDATED 22 January 2019 +@set VERSION 0.9 @dircategory Data Compression @direntry @@ -39,13 +39,14 @@ This manual is for Tarlz (version @value{VERSION}, @value{UPDATED}). * Invoking tarlz:: Command line interface * File format:: Detailed format of the compressed archive * Amendments to pax format:: The reasons for the differences with pax +* Multi-threaded tar:: Limitations of parallel tar decoding * Examples:: A small tutorial with examples * Problems:: Reporting bugs * Concept index:: Index of concepts @end menu @sp 1 -Copyright @copyright{} 2013-2018 Antonio Diaz Diaz. +Copyright @copyright{} 2013-2019 Antonio Diaz Diaz. This manual is free documentation: you have unlimited permission to copy, distribute and modify it. @@ -55,18 +56,20 @@ to copy, distribute and modify it. @chapter Introduction @cindex introduction -@uref{http://www.nongnu.org/lzip/tarlz.html,,Tarlz} is a small and simple -implementation of the tar archiver. By default tarlz creates, lists and -extracts archives in a simplified posix pax format compressed with -@uref{http://www.nongnu.org/lzip/lzip.html,,lzip} on a per file basis. Each -tar member is compressed in its own lzip member, as well as the end-of-file -blocks. This method is fully backward compatible with standard tar tools -like GNU tar, which treat the resulting multimember tar.lz archive like any -other tar.lz archive. Tarlz can append files to the end of such compressed -archives. - -Tarlz can create tar archives with four levels of compression -granularity; per file, per directory, appendable solid, and solid. +@uref{http://www.nongnu.org/lzip/tarlz.html,,Tarlz} is a combined +implementation of the tar archiver and the +@uref{http://www.nongnu.org/lzip/lzip.html,,lzip} compressor. By default +tarlz creates, lists and extracts archives in a simplified posix pax format +compressed with lzip on a per file basis. Each tar member is compressed in +its own lzip member, as well as the end-of-file blocks. This method adds an +indexed lzip layer on top of the tar archive, making it possible to decode +the archive safely in parallel. The resulting multimember tar.lz archive is +fully backward compatible with standard tar tools like GNU tar, which treat +it like any other tar.lz archive. Tarlz can append files to the end of such +compressed archives. + +Tarlz can create tar archives with four levels of compression granularity; +per file, per directory, appendable solid, and solid. @noindent Of course, compressing each file (or each directory) individually is @@ -76,7 +79,7 @@ following advantages: @itemize @bullet @item The resulting multimember tar.lz archive can be decompressed in -parallel with plzip, multiplying the decompression speed. +parallel, multiplying the decompression speed. @item New members can be appended to the archive (by removing the EOF @@ -102,9 +105,6 @@ standard tar tools. @xref{crc32}. Tarlz does not understand other tar formats like @samp{gnu}, @samp{oldgnu}, @samp{star} or @samp{v7}. -Tarlz is intended as a showcase project for the maintainers of real tar -programs to evaluate the format and perhaps implement it in their tools. - @node Invoking tarlz @chapter Invoking tarlz @@ -174,6 +174,20 @@ previous @code{-C} option. Use archive file @var{archive}. @samp{-} used as an @var{archive} argument reads from standard input or writes to standard output. +@item -n @var{n} +@itemx --threads=@var{n} +Set the number of decompression threads, overriding the system's default. +Valid values range from 0 to "as many as your system can support". A value +of 0 disables threads entirely. If this option is not used, tarlz tries to +detect the number of processors in the system and use it as default value. +@w{@samp{tarlz --help}} shows the system's default value. This option +currently only has effect when listing the contents of a multimember +compressed archive. @xref{Multi-threaded tar}. + +Note that the number of usable threads is limited during decompression to +the number of lzip members in the tar.lz archive, which you can find by +running @w{@code{lzip -lv archive.tar.lz}}. + @item -q @itemx --quiet Quiet operation. Suppress all messages. @@ -335,6 +349,11 @@ associated fields in this header block for this file. Zero or more blocks that contain the contents of the file. @end itemize +Each tar member must be contiguously stored in a lzip member for the +parallel decoding operations like @code{--list} to work. If any tar member +is split over two or more lzip members, the archive must be decoded +sequentially. @xref{Multi-threaded tar}. + At the end of the archive file there are two 512-byte blocks filled with binary zeros, interpreted as an end-of-archive indicator. These EOF blocks are either compressed in a separate lzip member or compressed @@ -481,20 +500,12 @@ is used to store the linkname. The mode field provides 12 access permission bits. The following table shows the symbolic name of each bit and its octal value: -@multitable {Bit Name} {Bit value} -@item Bit Name @tab Bit value -@item S_ISUID @tab 04000 -@item S_ISGID @tab 02000 -@item S_ISVTX @tab 01000 -@item S_IRUSR @tab 00400 -@item S_IWUSR @tab 00200 -@item S_IXUSR @tab 00100 -@item S_IRGRP @tab 00040 -@item S_IWGRP @tab 00020 -@item S_IXGRP @tab 00010 -@item S_IROTH @tab 00004 -@item S_IWOTH @tab 00002 -@item S_IXOTH @tab 00001 +@multitable {Bit Name} {Value} {Bit Name} {Value} {Bit Name} {Value} +@headitem Bit Name @tab Value @tab Bit Name @tab Value @tab Bit Name @tab Value +@item S_ISUID @tab 04000 @tab S_ISGID @tab 02000 @tab S_ISVTX @tab 01000 +@item S_IRUSR @tab 00400 @tab S_IWUSR @tab 00200 @tab S_IXUSR @tab 00100 +@item S_IRGRP @tab 00040 @tab S_IWGRP @tab 00020 @tab S_IXGRP @tab 00010 +@item S_IROTH @tab 00004 @tab S_IWOTH @tab 00002 @tab S_IXOTH @tab 00001 @end multitable The uid and gid fields are the user and group ID of the owner and group @@ -551,10 +562,13 @@ regular file (type 0). @end table The magic field contains the ASCII null-terminated string "ustar". The -version field contains the characters "00" (0x30,0x30). The fields -uname, and gname are null-terminated character strings. Each numeric -field contains a leading zero-filled, null-terminated octal number using -digits from the ISO/IEC 646:1991 (ASCII) standard. +version field contains the characters "00" (0x30,0x30). The fields uname, +and gname are null-terminated character strings except when all characters +in the array contain non-null characters including the last character. Each +numeric field contains a leading space- or zero-filled, optionally +null-terminated octal number using digits from the ISO/IEC 646:1991 (ASCII) +standard. Tarlz is able to decode numeric fields 1 byte larger than standard +ustar by not requiring a terminating null character. @node Amendments to pax format @@ -574,7 +588,7 @@ concrete reasons to implement them. The posix pax format has a serious flaw. The metadata stored in pax extended records are not protected by any kind of check sequence. Corruption in a long filename may cause the extraction of the file in the wrong place -without warning. Corruption in a long file size may cause the truncation of +without warning. Corruption in a large file size may cause the truncation of the file or the appending of garbage to the file, both followed by a spurious warning about a corrupt header far from the place of the undetected corruption. @@ -636,6 +650,52 @@ double UTF-8 conversions. If the need arises this behavior will be adjusted with a command line option in the future. +@node Multi-threaded tar +@chapter Limitations of parallel tar decoding + +Safely decoding an arbitrary tar archive in parallel is impossible. For +example, if a tar archive containing another tar archive is decoded starting +from some position other than the beginning, there is no way to know if the +first header found there belongs to the outer tar archive or to the inner +tar archive. Tar is a format inherently serial; it was designed for tapes. + +In the case of compressed tar archives, the start of each compressed block +determines one point through which the tar archive can be decoded in +parallel. Therefore, in tar.lz archives the decoding operations can't be +parallelized if the tar members are not aligned with the lzip members. Tar +archives compressed with plzip can't be decoded in parallel because tar and +plzip do not have a way to align both sets of members. Certainly one can +decompress one such archive with a multi-threaded tool like plzip, but the +increase in speed is not as large as it could be because plzip must +serialize the decompressed data and pass them to tar, which decodes them +sequentially, one tar member at a time. + +On the other hand, if the tar.lz archive is created with a tool like tarlz, +which can guarantee the alignment between tar members and lzip members +because it controls both archiving and compression, then the lzip format +becomes an indexed layer on top of the tar archive which makes possible +decoding it safely in parallel. + +Tarlz is able to automatically decode aligned and unaligned multimember +tar.lz archives, keeping backwards compatibility. If tarlz finds a member +misalignment during multi-threaded decoding, it switches to single-threaded +mode and continues decoding the archive. Currently only the @code{--list} +option is able to do multi-threaded decoding. + +If the files in the archive are large, multi-threaded @code{--list} on a +regular tar.lz archive can be hundreds of times faster than sequential +@code{--list} because, in addition to using several processors, it only +needs to decompress part of each lzip member. See the following example +listing the Silesia corpus on a dual core machine: + +@example +tarlz -9 -cf silesia.tar.lz silesia +time lzip -cd silesia.tar.lz | tar -tf - (5.032s) +time plzip -cd silesia.tar.lz | tar -tf - (3.256s) +time tarlz -tf silesia.tar.lz (0.020s) +@end example + + @node Examples @chapter A small tutorial with examples @cindex examples -- cgit v1.2.3