From 2f15376ba464cf08e710c3353bdacc4f503e11b4 Mon Sep 17 00:00:00 2001 From: Daniel Baumann Date: Wed, 23 Jan 2019 18:42:07 +0100 Subject: Merging upstream version 0.9. Signed-off-by: Daniel Baumann --- doc/tarlz.info | 153 ++++++++++++++++++++++++++++++++++++++++----------------- 1 file changed, 109 insertions(+), 44 deletions(-) (limited to 'doc/tarlz.info') diff --git a/doc/tarlz.info b/doc/tarlz.info index d6d17d0..7f90766 100644 --- a/doc/tarlz.info +++ b/doc/tarlz.info @@ -11,7 +11,7 @@ File: tarlz.info, Node: Top, Next: Introduction, Up: (dir) Tarlz Manual ************ -This manual is for Tarlz (version 0.8, 16 December 2018). +This manual is for Tarlz (version 0.9, 22 January 2019). * Menu: @@ -19,12 +19,13 @@ This manual is for Tarlz (version 0.8, 16 December 2018). * Invoking tarlz:: Command line interface * File format:: Detailed format of the compressed archive * Amendments to pax format:: The reasons for the differences with pax +* Multi-threaded tar:: Limitations of parallel tar decoding * Examples:: A small tutorial with examples * Problems:: Reporting bugs * Concept index:: Index of concepts - Copyright (C) 2013-2018 Antonio Diaz Diaz. + Copyright (C) 2013-2019 Antonio Diaz Diaz. This manual is free documentation: you have unlimited permission to copy, distribute and modify it. @@ -35,12 +36,14 @@ File: tarlz.info, Node: Introduction, Next: Invoking tarlz, Prev: Top, Up: T 1 Introduction ************** -Tarlz is a small and simple implementation of the tar archiver. By -default tarlz creates, lists and extracts archives in a simplified -posix pax format compressed with lzip on a per file basis. Each tar -member is compressed in its own lzip member, as well as the end-of-file -blocks. This method is fully backward compatible with standard tar tools -like GNU tar, which treat the resulting multimember tar.lz archive like +Tarlz is a combined implementation of the tar archiver and the lzip +compressor. By default tarlz creates, lists and extracts archives in a +simplified posix pax format compressed with lzip on a per file basis. +Each tar member is compressed in its own lzip member, as well as the +end-of-file blocks. This method adds an indexed lzip layer on top of +the tar archive, making it possible to decode the archive safely in +parallel. The resulting multimember tar.lz archive is fully backward +compatible with standard tar tools like GNU tar, which treat it like any other tar.lz archive. Tarlz can append files to the end of such compressed archives. @@ -52,7 +55,7 @@ less efficient than compressing the whole tar archive, but it has the following advantages: * The resulting multimember tar.lz archive can be decompressed in - parallel with plzip, multiplying the decompression speed. + parallel, multiplying the decompression speed. * New members can be appended to the archive (by removing the EOF member) just like to an uncompressed tar archive. @@ -74,10 +77,6 @@ with standard tar tools. *Note crc32::. Tarlz does not understand other tar formats like 'gnu', 'oldgnu', 'star' or 'v7'. - Tarlz is intended as a showcase project for the maintainers of real -tar programs to evaluate the format and perhaps implement it in their -tools. -  File: tarlz.info, Node: Invoking tarlz, Next: File format, Prev: Introduction, Up: Top @@ -141,6 +140,21 @@ archive 'foo'. Use archive file ARCHIVE. '-' used as an ARCHIVE argument reads from standard input or writes to standard output. +'-n N' +'--threads=N' + Set the number of decompression threads, overriding the system's + default. Valid values range from 0 to "as many as your system can + support". A value of 0 disables threads entirely. If this option + is not used, tarlz tries to detect the number of processors in the + system and use it as default value. 'tarlz --help' shows the + system's default value. This option currently only has effect when + listing the contents of a multimember compressed archive. *Note + Multi-threaded tar::. + + Note that the number of usable threads is limited during + decompression to the number of lzip members in the tar.lz archive, + which you can find by running 'lzip -lv archive.tar.lz'. + '-q' '--quiet' Quiet operation. Suppress all messages. @@ -288,6 +302,11 @@ following sequence: * Zero or more blocks that contain the contents of the file. + Each tar member must be contiguously stored in a lzip member for the +parallel decoding operations like '--list' to work. If any tar member +is split over two or more lzip members, the archive must be decoded +sequentially. *Note Multi-threaded tar::. + At the end of the archive file there are two 512-byte blocks filled with binary zeros, interpreted as an end-of-archive indicator. These EOF blocks are either compressed in a separate lzip member or compressed @@ -417,19 +436,12 @@ record is used to store the linkname. The mode field provides 12 access permission bits. The following table shows the symbolic name of each bit and its octal value: -Bit Name Bit value -S_ISUID 04000 -S_ISGID 02000 -S_ISVTX 01000 -S_IRUSR 00400 -S_IWUSR 00200 -S_IXUSR 00100 -S_IRGRP 00040 -S_IWGRP 00020 -S_IXGRP 00010 -S_IROTH 00004 -S_IWOTH 00002 -S_IXOTH 00001 +Bit Name Value Bit Name Value Bit Name Value +--------------------------------------------------- +S_ISUID 04000 S_ISGID 02000 S_ISVTX 01000 +S_IRUSR 00400 S_IWUSR 00200 S_IXUSR 00100 +S_IRGRP 00040 S_IWGRP 00020 S_IXGRP 00010 +S_IROTH 00004 S_IWOTH 00002 S_IXOTH 00001 The uid and gid fields are the user and group ID of the owner and group of the file, respectively. @@ -485,12 +497,16 @@ file archived: The magic field contains the ASCII null-terminated string "ustar". The version field contains the characters "00" (0x30,0x30). The fields -uname, and gname are null-terminated character strings. Each numeric -field contains a leading zero-filled, null-terminated octal number using -digits from the ISO/IEC 646:1991 (ASCII) standard. +uname, and gname are null-terminated character strings except when all +characters in the array contain non-null characters including the last +character. Each numeric field contains a leading space- or zero-filled, +optionally null-terminated octal number using digits from the ISO/IEC +646:1991 (ASCII) standard. Tarlz is able to decode numeric fields 1 +byte larger than standard ustar by not requiring a terminating null +character.  -File: tarlz.info, Node: Amendments to pax format, Next: Examples, Prev: File format, Up: Top +File: tarlz.info, Node: Amendments to pax format, Next: Multi-threaded tar, Prev: File format, Up: Top 4 The reasons for the differences with pax ****************************************** @@ -508,7 +524,7 @@ and the concrete reasons to implement them. The posix pax format has a serious flaw. The metadata stored in pax extended records are not protected by any kind of check sequence. Corruption in a long filename may cause the extraction of the file in -the wrong place without warning. Corruption in a long file size may +the wrong place without warning. Corruption in a large file size may cause the truncation of the file or the appending of garbage to the file, both followed by a spurious warning about a corrupt header far from the place of the undetected corruption. @@ -573,9 +589,57 @@ prevents accidental double UTF-8 conversions. If the need arises this behavior will be adjusted with a command line option in the future.  -File: tarlz.info, Node: Examples, Next: Problems, Prev: Amendments to pax format, Up: Top +File: tarlz.info, Node: Multi-threaded tar, Next: Examples, Prev: Amendments to pax format, Up: Top + +5 Limitations of parallel tar decoding +************************************** + +Safely decoding an arbitrary tar archive in parallel is impossible. For +example, if a tar archive containing another tar archive is decoded +starting from some position other than the beginning, there is no way +to know if the first header found there belongs to the outer tar +archive or to the inner tar archive. Tar is a format inherently serial; +it was designed for tapes. + + In the case of compressed tar archives, the start of each compressed +block determines one point through which the tar archive can be decoded +in parallel. Therefore, in tar.lz archives the decoding operations +can't be parallelized if the tar members are not aligned with the lzip +members. Tar archives compressed with plzip can't be decoded in +parallel because tar and plzip do not have a way to align both sets of +members. Certainly one can decompress one such archive with a +multi-threaded tool like plzip, but the increase in speed is not as +large as it could be because plzip must serialize the decompressed data +and pass them to tar, which decodes them sequentially, one tar member +at a time. + + On the other hand, if the tar.lz archive is created with a tool like +tarlz, which can guarantee the alignment between tar members and lzip +members because it controls both archiving and compression, then the +lzip format becomes an indexed layer on top of the tar archive which +makes possible decoding it safely in parallel. + + Tarlz is able to automatically decode aligned and unaligned +multimember tar.lz archives, keeping backwards compatibility. If tarlz +finds a member misalignment during multi-threaded decoding, it switches +to single-threaded mode and continues decoding the archive. Currently +only the '--list' option is able to do multi-threaded decoding. + + If the files in the archive are large, multi-threaded '--list' on a +regular tar.lz archive can be hundreds of times faster than sequential +'--list' because, in addition to using several processors, it only +needs to decompress part of each lzip member. See the following example +listing the Silesia corpus on a dual core machine: + + tarlz -9 -cf silesia.tar.lz silesia + time lzip -cd silesia.tar.lz | tar -tf - (5.032s) + time plzip -cd silesia.tar.lz | tar -tf - (3.256s) + time tarlz -tf silesia.tar.lz (0.020s) + + +File: tarlz.info, Node: Examples, Next: Problems, Prev: Multi-threaded tar, Up: Top -5 A small tutorial with examples +6 A small tutorial with examples ******************************** Example 1: Create a multimember compressed archive 'archive.tar.lz' @@ -633,7 +697,7 @@ Example 8: Copy the contents of directory 'sourcedir' to the directory  File: tarlz.info, Node: Problems, Next: Concept index, Prev: Examples, Up: Top -6 Reporting bugs +7 Reporting bugs **************** There are probably bugs in tarlz. There are certainly errors and @@ -670,16 +734,17 @@ Concept index  Tag Table: Node: Top223 -Node: Introduction946 -Node: Invoking tarlz3084 -Node: File format9606 -Ref: key_crc3214138 -Node: Amendments to pax format19215 -Ref: crc3219729 -Ref: flawed-compat20753 -Node: Examples23126 -Node: Problems24802 -Node: Concept index25328 +Node: Introduction1012 +Node: Invoking tarlz3124 +Node: File format10384 +Ref: key_crc3215169 +Node: Amendments to pax format20586 +Ref: crc3221110 +Ref: flawed-compat22135 +Node: Multi-threaded tar24508 +Node: Examples27012 +Node: Problems28682 +Node: Concept index29208  End Tag Table -- cgit v1.2.3