3 files changed, 222 insertions, 92 deletions
diff --git a/doc/tarlz.1 b/doc/tarlz.1
index 906fee0..b83a7e6 100644
--- a/doc/tarlz.1
+++ b/doc/tarlz.1
@@ -1,18 +1,20 @@
 .\" DO NOT MODIFY THIS FILE!  It was generated by help2man 1.46.1.
-.TH TARLZ "1" "December 2018" "tarlz 0.8" "User Commands"
+.TH TARLZ "1" "January 2019" "tarlz 0.9" "User Commands"
 .SH NAME
 tarlz \- creates tar archives with multimember lzip compression
 .SH SYNOPSIS
 .B tarlz
 [\fI\,options\/\fR] [\fI\,files\/\fR]
 .SH DESCRIPTION
-Tarlz is a small and simple implementation of the tar archiver. By default
-tarlz creates, lists and extracts archives in a simplified posix pax format
-compressed with lzip on a per file basis. Each tar member is compressed in
-its own lzip member, as well as the end\-of\-file blocks. This method is fully
-backward compatible with standard tar tools like GNU tar, which treat the
-resulting multimember tar.lz archive like any other tar.lz archive. Tarlz
-can append files to the end of such compressed archives.
+Tarlz is a combined implementation of the tar archiver and the lzip
+compressor. By default tarlz creates, lists and extracts archives in a
+simplified posix pax format compressed with lzip on a per file basis. Each
+tar member is compressed in its own lzip member, as well as the end\-of\-file
+blocks. This method adds an indexed lzip layer on top of the tar archive,
+making it possible to decode the archive safely in parallel. The resulting
+multimember tar.lz archive is fully backward compatible with standard tar
+tools like GNU tar, which treat it like any other tar.lz archive. Tarlz can
+append files to the end of such compressed archives.
 .PP
 The tarlz file format is a safe posix\-style backup format. In case of
 corruption, tarlz can extract all the undamaged members from the tar.lz
@@ -40,6 +42,9 @@ change to directory <dir>
 \fB\-f\fR, \fB\-\-file=\fR<archive>
 use archive file <archive>
 .TP
+\fB\-n\fR, \fB\-\-threads=\fR<n>
+set number of decompression threads [2]
+.TP
 \fB\-q\fR, \fB\-\-quiet\fR
 suppress all messages
 .TP
@@ -97,8 +102,8 @@ Report bugs to lzip\-bug@nongnu.org
 .br
 Tarlz home page: http://www.nongnu.org/lzip/tarlz.html
 .SH COPYRIGHT
-Copyright \(co 2018 Antonio Diaz Diaz.
-Using lzlib 1.11\-rc2
+Copyright \(co 2019 Antonio Diaz Diaz.
+Using lzlib 1.11
 License GPLv2+: GNU GPL version 2 or later <http://gnu.org/licenses/gpl.html>
 .br
 This is free software: you are free to change and redistribute it.
diff --git a/doc/tarlz.info b/doc/tarlz.info
index d6d17d0..7f90766 100644
--- a/doc/tarlz.info
+++ b/doc/tarlz.info
@@ -11,7 +11,7 @@ File: tarlz.info,  Node: Top,  Next: Introduction,  Up: (dir)
 Tarlz Manual
 ************
 
-This manual is for Tarlz (version 0.8, 16 December 2018).
+This manual is for Tarlz (version 0.9, 22 January 2019).
 
 * Menu:
 
@@ -19,12 +19,13 @@ This manual is for Tarlz (version 0.8, 16 December 2018).
 * Invoking tarlz::            Command line interface
 * File format::               Detailed format of the compressed archive
 * Amendments to pax format::  The reasons for the differences with pax
+* Multi-threaded tar::        Limitations of parallel tar decoding
 * Examples::                  A small tutorial with examples
 * Problems::                  Reporting bugs
 * Concept index::             Index of concepts
 
 
-   Copyright (C) 2013-2018 Antonio Diaz Diaz.
+   Copyright (C) 2013-2019 Antonio Diaz Diaz.
 
    This manual is free documentation: you have unlimited permission to
 copy, distribute and modify it.
@@ -35,12 +36,14 @@ File: tarlz.info,  Node: Introduction,  Next: Invoking tarlz,  Prev: Top,  Up: T
 1 Introduction
 **************
 
-Tarlz is a small and simple implementation of the tar archiver. By
-default tarlz creates, lists and extracts archives in a simplified
-posix pax format compressed with lzip on a per file basis. Each tar
-member is compressed in its own lzip member, as well as the end-of-file
-blocks. This method is fully backward compatible with standard tar tools
-like GNU tar, which treat the resulting multimember tar.lz archive like
+Tarlz is a combined implementation of the tar archiver and the lzip
+compressor. By default tarlz creates, lists and extracts archives in a
+simplified posix pax format compressed with lzip on a per file basis.
+Each tar member is compressed in its own lzip member, as well as the
+end-of-file blocks. This method adds an indexed lzip layer on top of
+the tar archive, making it possible to decode the archive safely in
+parallel. The resulting multimember tar.lz archive is fully backward
+compatible with standard tar tools like GNU tar, which treat it like
 any other tar.lz archive. Tarlz can append files to the end of such
 compressed archives.
 
@@ -52,7 +55,7 @@ less efficient than compressing the whole tar archive, but it has the
 following advantages:
 
    * The resulting multimember tar.lz archive can be decompressed in
-     parallel with plzip, multiplying the decompression speed.
+     parallel, multiplying the decompression speed.
 
    * New members can be appended to the archive (by removing the EOF
      member) just like to an uncompressed tar archive.
@@ -74,10 +77,6 @@ with standard tar tools. *Note crc32::.
    Tarlz does not understand other tar formats like 'gnu', 'oldgnu',
 'star' or 'v7'.
 
-   Tarlz is intended as a showcase project for the maintainers of real
-tar programs to evaluate the format and perhaps implement it in their
-tools.
-
 
 File: tarlz.info,  Node: Invoking tarlz,  Next: File format,  Prev: Introduction,  Up: Top
 
@@ -141,6 +140,21 @@ archive 'foo'.
      Use archive file ARCHIVE. '-' used as an ARCHIVE argument reads
      from standard input or writes to standard output.
 
+'-n N'
+'--threads=N'
+     Set the number of decompression threads, overriding the system's
+     default.  Valid values range from 0 to "as many as your system can
+     support". A value of 0 disables threads entirely. If this option
+     is not used, tarlz tries to detect the number of processors in the
+     system and use it as default value.  'tarlz --help' shows the
+     system's default value. This option currently only has effect when
+     listing the contents of a multimember compressed archive. *Note
+     Multi-threaded tar::.
+
+     Note that the number of usable threads is limited during
+     decompression to the number of lzip members in the tar.lz archive,
+     which you can find by running 'lzip -lv archive.tar.lz'.
+
 '-q'
 '--quiet'
      Quiet operation. Suppress all messages.
@@ -288,6 +302,11 @@ following sequence:
 
    * Zero or more blocks that contain the contents of the file.
 
+   Each tar member must be contiguously stored in a lzip member for the
+parallel decoding operations like '--list' to work. If any tar member
+is split over two or more lzip members, the archive must be decoded
+sequentially. *Note Multi-threaded tar::.
+
    At the end of the archive file there are two 512-byte blocks filled
 with binary zeros, interpreted as an end-of-archive indicator. These EOF
 blocks are either compressed in a separate lzip member or compressed
@@ -417,19 +436,12 @@ record is used to store the linkname.
    The mode field provides 12 access permission bits. The following
 table shows the symbolic name of each bit and its octal value:
 
-Bit Name   Bit value
-S_ISUID    04000
-S_ISGID    02000
-S_ISVTX    01000
-S_IRUSR    00400
-S_IWUSR    00200
-S_IXUSR    00100
-S_IRGRP    00040
-S_IWGRP    00020
-S_IXGRP    00010
-S_IROTH    00004
-S_IWOTH    00002
-S_IXOTH    00001
+Bit Name   Value   Bit Name   Value   Bit Name   Value
+---------------------------------------------------
+S_ISUID    04000   S_ISGID    02000   S_ISVTX    01000
+S_IRUSR    00400   S_IWUSR    00200   S_IXUSR    00100
+S_IRGRP    00040   S_IWGRP    00020   S_IXGRP    00010
+S_IROTH    00004   S_IWOTH    00002   S_IXOTH    00001
 
    The uid and gid fields are the user and group ID of the owner and
 group of the file, respectively.
@@ -485,12 +497,16 @@ file archived:
 
    The magic field contains the ASCII null-terminated string "ustar".
 The version field contains the characters "00" (0x30,0x30). The fields
-uname, and gname are null-terminated character strings. Each numeric
-field contains a leading zero-filled, null-terminated octal number using
-digits from the ISO/IEC 646:1991 (ASCII) standard.
+uname, and gname are null-terminated character strings except when all
+characters in the array contain non-null characters including the last
+character. Each numeric field contains a leading space- or zero-filled,
+optionally null-terminated octal number using digits from the ISO/IEC
+646:1991 (ASCII) standard. Tarlz is able to decode numeric fields 1
+byte larger than standard ustar by not requiring a terminating null
+character.
 
 
-File: tarlz.info,  Node: Amendments to pax format,  Next: Examples,  Prev: File format,  Up: Top
+File: tarlz.info,  Node: Amendments to pax format,  Next: Multi-threaded tar,  Prev: File format,  Up: Top
 
 4 The reasons for the differences with pax
 ******************************************
@@ -508,7 +524,7 @@ and the concrete reasons to implement them.
 The posix pax format has a serious flaw. The metadata stored in pax
 extended records are not protected by any kind of check sequence.
 Corruption in a long filename may cause the extraction of the file in
-the wrong place without warning. Corruption in a long file size may
+the wrong place without warning. Corruption in a large file size may
 cause the truncation of the file or the appending of garbage to the
 file, both followed by a spurious warning about a corrupt header far
 from the place of the undetected corruption.
@@ -573,9 +589,57 @@ prevents accidental double UTF-8 conversions. If the need arises this
 behavior will be adjusted with a command line option in the future.
 
 
-File: tarlz.info,  Node: Examples,  Next: Problems,  Prev: Amendments to pax format,  Up: Top
+File: tarlz.info,  Node: Multi-threaded tar,  Next: Examples,  Prev: Amendments to pax format,  Up: Top
+
+5 Limitations of parallel tar decoding
+**************************************
+
+Safely decoding an arbitrary tar archive in parallel is impossible. For
+example, if a tar archive containing another tar archive is decoded
+starting from some position other than the beginning, there is no way
+to know if the first header found there belongs to the outer tar
+archive or to the inner tar archive. Tar is a format inherently serial;
+it was designed for tapes.
+
+   In the case of compressed tar archives, the start of each compressed
+block determines one point through which the tar archive can be decoded
+in parallel. Therefore, in tar.lz archives the decoding operations
+can't be parallelized if the tar members are not aligned with the lzip
+members. Tar archives compressed with plzip can't be decoded in
+parallel because tar and plzip do not have a way to align both sets of
+members. Certainly one can decompress one such archive with a
+multi-threaded tool like plzip, but the increase in speed is not as
+large as it could be because plzip must serialize the decompressed data
+and pass them to tar, which decodes them sequentially, one tar member
+at a time.
+
+   On the other hand, if the tar.lz archive is created with a tool like
+tarlz, which can guarantee the alignment between tar members and lzip
+members because it controls both archiving and compression, then the
+lzip format becomes an indexed layer on top of the tar archive which
+makes possible decoding it safely in parallel.
+
+   Tarlz is able to automatically decode aligned and unaligned
+multimember tar.lz archives, keeping backwards compatibility. If tarlz
+finds a member misalignment during multi-threaded decoding, it switches
+to single-threaded mode and continues decoding the archive. Currently
+only the '--list' option is able to do multi-threaded decoding.
+
+   If the files in the archive are large, multi-threaded '--list' on a
+regular tar.lz archive can be hundreds of times faster than sequential
+'--list' because, in addition to using several processors, it only
+needs to decompress part of each lzip member. See the following example
+listing the Silesia corpus on a dual core machine:
+
+     tarlz -9 -cf silesia.tar.lz silesia
+     time lzip -cd silesia.tar.lz | tar -tf -            (5.032s)
+     time plzip -cd silesia.tar.lz | tar -tf -           (3.256s)
+     time tarlz -tf silesia.tar.lz                       (0.020s)
+
+
+File: tarlz.info,  Node: Examples,  Next: Problems,  Prev: Multi-threaded tar,  Up: Top
 
-5 A small tutorial with examples
+6 A small tutorial with examples
 ********************************
 
 Example 1: Create a multimember compressed archive 'archive.tar.lz'
@@ -633,7 +697,7 @@ Example 8: Copy the contents of directory 'sourcedir' to the directory
 
 File: tarlz.info,  Node: Problems,  Next: Concept index,  Prev: Examples,  Up: Top
 
-6 Reporting bugs
+7 Reporting bugs
 ****************
 
 There are probably bugs in tarlz. There are certainly errors and
@@ -670,16 +734,17 @@ Concept index
 
 Tag Table:
 Node: Top223
-Node: Introduction946
-Node: Invoking tarlz3084
-Node: File format9606
-Ref: key_crc3214138
-Node: Amendments to pax format19215
-Ref: crc3219729
-Ref: flawed-compat20753
-Node: Examples23126
-Node: Problems24802
-Node: Concept index25328
+Node: Introduction1012
+Node: Invoking tarlz3124
+Node: File format10384
+Ref: key_crc3215169
+Node: Amendments to pax format20586
+Ref: crc3221110
+Ref: flawed-compat22135
+Node: Multi-threaded tar24508
+Node: Examples27012
+Node: Problems28682
+Node: Concept index29208
 
 End Tag Table
 
diff --git a/doc/tarlz.texi b/doc/tarlz.texi
index 4c6d16a..d9bdc14 100644
--- a/doc/tarlz.texi
+++ b/doc/tarlz.texi
@@ -6,8 +6,8 @@
 @finalout
 @c %**end of header
 
-@set UPDATED 16 December 2018
-@set VERSION 0.8
+@set UPDATED 22 January 2019
+@set VERSION 0.9
 
 @dircategory Data Compression
 @direntry
@@ -39,13 +39,14 @@ This manual is for Tarlz (version @value{VERSION}, @value{UPDATED}).
 * Invoking tarlz::            Command line interface
 * File format::               Detailed format of the compressed archive
 * Amendments to pax format::  The reasons for the differences with pax
+* Multi-threaded tar::        Limitations of parallel tar decoding
 * Examples::                  A small tutorial with examples
 * Problems::                  Reporting bugs
 * Concept index::             Index of concepts
 @end menu
 
 @sp 1
-Copyright @copyright{} 2013-2018 Antonio Diaz Diaz.
+Copyright @copyright{} 2013-2019 Antonio Diaz Diaz.
 
 This manual is free documentation: you have unlimited permission
 to copy, distribute and modify it.
@@ -55,18 +56,20 @@ to copy, distribute and modify it.
 @chapter Introduction
 @cindex introduction
 
-@uref{http://www.nongnu.org/lzip/tarlz.html,,Tarlz} is a small and simple
-implementation of the tar archiver. By default tarlz creates, lists and
-extracts archives in a simplified posix pax format compressed with
-@uref{http://www.nongnu.org/lzip/lzip.html,,lzip} on a per file basis. Each
-tar member is compressed in its own lzip member, as well as the end-of-file
-blocks. This method is fully backward compatible with standard tar tools
-like GNU tar, which treat the resulting multimember tar.lz archive like any
-other tar.lz archive. Tarlz can append files to the end of such compressed
-archives.
-
-Tarlz can create tar archives with four levels of compression
-granularity; per file, per directory, appendable solid, and solid.
+@uref{http://www.nongnu.org/lzip/tarlz.html,,Tarlz} is a combined
+implementation of the tar archiver and the
+@uref{http://www.nongnu.org/lzip/lzip.html,,lzip} compressor. By default
+tarlz creates, lists and extracts archives in a simplified posix pax format
+compressed with lzip on a per file basis. Each tar member is compressed in
+its own lzip member, as well as the end-of-file blocks. This method adds an
+indexed lzip layer on top of the tar archive, making it possible to decode
+the archive safely in parallel. The resulting multimember tar.lz archive is
+fully backward compatible with standard tar tools like GNU tar, which treat
+it like any other tar.lz archive. Tarlz can append files to the end of such
+compressed archives.
+
+Tarlz can create tar archives with four levels of compression granularity;
+per file, per directory, appendable solid, and solid.
 
 @noindent
 Of course, compressing each file (or each directory) individually is
@@ -76,7 +79,7 @@ following advantages:
 @itemize @bullet
 @item
 The resulting multimember tar.lz archive can be decompressed in
-parallel with plzip, multiplying the decompression speed.
+parallel, multiplying the decompression speed.
 
 @item
 New members can be appended to the archive (by removing the EOF
@@ -102,9 +105,6 @@ standard tar tools. @xref{crc32}.
 Tarlz does not understand other tar formats like @samp{gnu}, @samp{oldgnu},
 @samp{star} or @samp{v7}.
 
-Tarlz is intended as a showcase project for the maintainers of real tar
-programs to evaluate the format and perhaps implement it in their tools.
-
 
 @node Invoking tarlz
 @chapter Invoking tarlz
@@ -174,6 +174,20 @@ previous @code{-C} option.
 Use archive file @var{archive}. @samp{-} used as an @var{archive}
 argument reads from standard input or writes to standard output.
 
+@item -n @var{n}
+@itemx --threads=@var{n}
+Set the number of decompression threads, overriding the system's default.
+Valid values range from 0 to "as many as your system can support". A value
+of 0 disables threads entirely. If this option is not used, tarlz tries to
+detect the number of processors in the system and use it as default value.
+@w{@samp{tarlz --help}} shows the system's default value. This option
+currently only has effect when listing the contents of a multimember
+compressed archive. @xref{Multi-threaded tar}.
+
+Note that the number of usable threads is limited during decompression to
+the number of lzip members in the tar.lz archive, which you can find by
+running @w{@code{lzip -lv archive.tar.lz}}.
+
 @item -q
 @itemx --quiet
 Quiet operation. Suppress all messages.
@@ -335,6 +349,11 @@ associated fields in this header block for this file.
 Zero or more blocks that contain the contents of the file.
 @end itemize
 
+Each tar member must be contiguously stored in a lzip member for the
+parallel decoding operations like @code{--list} to work. If any tar member
+is split over two or more lzip members, the archive must be decoded
+sequentially. @xref{Multi-threaded tar}.
+
 At the end of the archive file there are two 512-byte blocks filled with
 binary zeros, interpreted as an end-of-archive indicator. These EOF
 blocks are either compressed in a separate lzip member or compressed
@@ -481,20 +500,12 @@ is used to store the linkname.
 The mode field provides 12 access permission bits. The following table
 shows the symbolic name of each bit and its octal value:
 
-@multitable {Bit Name} {Bit value}
-@item Bit Name @tab Bit value
-@item S_ISUID @tab 04000
-@item S_ISGID @tab 02000
-@item S_ISVTX @tab 01000
-@item S_IRUSR @tab 00400
-@item S_IWUSR @tab 00200
-@item S_IXUSR @tab 00100
-@item S_IRGRP @tab 00040
-@item S_IWGRP @tab 00020
-@item S_IXGRP @tab 00010
-@item S_IROTH @tab 00004
-@item S_IWOTH @tab 00002
-@item S_IXOTH @tab 00001
+@multitable {Bit Name} {Value} {Bit Name} {Value} {Bit Name} {Value}
+@headitem Bit Name @tab Value @tab Bit Name @tab Value @tab Bit Name @tab Value
+@item S_ISUID @tab 04000 @tab S_ISGID @tab 02000 @tab S_ISVTX @tab 01000
+@item S_IRUSR @tab 00400 @tab S_IWUSR @tab 00200 @tab S_IXUSR @tab 00100
+@item S_IRGRP @tab 00040 @tab S_IWGRP @tab 00020 @tab S_IXGRP @tab 00010
+@item S_IROTH @tab 00004 @tab S_IWOTH @tab 00002 @tab S_IXOTH @tab 00001
 @end multitable
 
 The uid and gid fields are the user and group ID of the owner and group
@@ -551,10 +562,13 @@ regular file (type 0).
 @end table
 
 The magic field contains the ASCII null-terminated string "ustar". The
-version field contains the characters "00" (0x30,0x30). The fields
-uname, and gname are null-terminated character strings. Each numeric
-field contains a leading zero-filled, null-terminated octal number using
-digits from the ISO/IEC 646:1991 (ASCII) standard.
+version field contains the characters "00" (0x30,0x30). The fields uname,
+and gname are null-terminated character strings except when all characters
+in the array contain non-null characters including the last character. Each
+numeric field contains a leading space- or zero-filled, optionally
+null-terminated octal number using digits from the ISO/IEC 646:1991 (ASCII)
+standard. Tarlz is able to decode numeric fields 1 byte larger than standard
+ustar by not requiring a terminating null character.
 
 
 @node Amendments to pax format
@@ -574,7 +588,7 @@ concrete reasons to implement them.
 The posix pax format has a serious flaw. The metadata stored in pax extended
 records are not protected by any kind of check sequence. Corruption in a
 long filename may cause the extraction of the file in the wrong place
-without warning. Corruption in a long file size may cause the truncation of
+without warning. Corruption in a large file size may cause the truncation of
 the file or the appending of garbage to the file, both followed by a
 spurious warning about a corrupt header far from the place of the undetected
 corruption.
@@ -636,6 +650,52 @@ double UTF-8 conversions. If the need arises this behavior will be adjusted
 with a command line option in the future.
 
 
+@node Multi-threaded tar
+@chapter Limitations of parallel tar decoding
+
+Safely decoding an arbitrary tar archive in parallel is impossible. For
+example, if a tar archive containing another tar archive is decoded starting
+from some position other than the beginning, there is no way to know if the
+first header found there belongs to the outer tar archive or to the inner
+tar archive. Tar is a format inherently serial; it was designed for tapes.
+
+In the case of compressed tar archives, the start of each compressed block
+determines one point through which the tar archive can be decoded in
+parallel. Therefore, in tar.lz archives the decoding operations can't be
+parallelized if the tar members are not aligned with the lzip members. Tar
+archives compressed with plzip can't be decoded in parallel because tar and
+plzip do not have a way to align both sets of members. Certainly one can
+decompress one such archive with a multi-threaded tool like plzip, but the
+increase in speed is not as large as it could be because plzip must
+serialize the decompressed data and pass them to tar, which decodes them
+sequentially, one tar member at a time.
+
+On the other hand, if the tar.lz archive is created with a tool like tarlz,
+which can guarantee the alignment between tar members and lzip members
+because it controls both archiving and compression, then the lzip format
+becomes an indexed layer on top of the tar archive which makes possible
+decoding it safely in parallel.
+
+Tarlz is able to automatically decode aligned and unaligned multimember
+tar.lz archives, keeping backwards compatibility. If tarlz finds a member
+misalignment during multi-threaded decoding, it switches to single-threaded
+mode and continues decoding the archive. Currently only the @code{--list}
+option is able to do multi-threaded decoding.
+
+If the files in the archive are large, multi-threaded @code{--list} on a
+regular tar.lz archive can be hundreds of times faster than sequential
+@code{--list} because, in addition to using several processors, it only
+needs to decompress part of each lzip member. See the following example
+listing the Silesia corpus on a dual core machine:
+
+@example
+tarlz -9 -cf silesia.tar.lz silesia
+time lzip -cd silesia.tar.lz | tar -tf -            (5.032s)
+time plzip -cd silesia.tar.lz | tar -tf -           (3.256s)
+time tarlz -tf silesia.tar.lz                       (0.020s)
+@end example
+
+
 @node Examples
 @chapter A small tutorial with examples
 @cindex examples