1 files changed, 458 insertions, 64 deletions
diff --git a/doc/tarlz.texi b/doc/tarlz.texi
index db51dd6..4c6d16a 100644
--- a/doc/tarlz.texi
+++ b/doc/tarlz.texi
@@ -6,8 +6,8 @@
 @finalout
 @c %**end of header
 
-@set UPDATED 23 April 2018
-@set VERSION 0.4
+@set UPDATED 16 December 2018
+@set VERSION 0.8
 
 @dircategory Data Compression
 @direntry
@@ -35,11 +35,13 @@
 This manual is for Tarlz (version @value{VERSION}, @value{UPDATED}).
 
 @menu
-* Introduction::           Purpose and features of tarlz
-* Invoking tarlz::         Command line interface
-* Examples::               A small tutorial with examples
-* Problems::               Reporting bugs
-* Concept index::          Index of concepts
+* Introduction::              Purpose and features of tarlz
+* Invoking tarlz::            Command line interface
+* File format::               Detailed format of the compressed archive
+* Amendments to pax format::  The reasons for the differences with pax
+* Examples::                  A small tutorial with examples
+* Problems::                  Reporting bugs
+* Concept index::             Index of concepts
 @end menu
 
 @sp 1
@@ -53,43 +55,19 @@ to copy, distribute and modify it.
 @chapter Introduction
 @cindex introduction
 
-Tarlz is a small and simple implementation of the tar archiver. By
-default tarlz creates, lists and extracts archives in the 'ustar' format
-compressed with lzip on a per file basis. Tarlz can append files to the
-end of such compressed archives.
-
-Each tar member is compressed in its own lzip member, as well as the
-end-of-file blocks. This same method works for any tar format (gnu,
-ustar, posix) and is fully backward compatible with standard tar tools
-like GNU tar, which treat the resulting multimember tar.lz archive like
-any other tar.lz archive.
+@uref{http://www.nongnu.org/lzip/tarlz.html,,Tarlz} is a small and simple
+implementation of the tar archiver. By default tarlz creates, lists and
+extracts archives in a simplified posix pax format compressed with
+@uref{http://www.nongnu.org/lzip/lzip.html,,lzip} on a per file basis. Each
+tar member is compressed in its own lzip member, as well as the end-of-file
+blocks. This method is fully backward compatible with standard tar tools
+like GNU tar, which treat the resulting multimember tar.lz archive like any
+other tar.lz archive. Tarlz can append files to the end of such compressed
+archives.
 
 Tarlz can create tar archives with four levels of compression
 granularity; per file, per directory, appendable solid, and solid.
 
-Tarlz is intended as a showcase project for the maintainers of real tar
-programs to evaluate the format and perhaps implement it in their tools.
-
-The diagram below shows the correspondence between tar members (formed
-by a header plus optional data) in the tar archive and
-@uref{http://www.nongnu.org/lzip/manual/lzip_manual.html#File-format,,lzip members}
-in the resulting multimember tar.lz archive:
-@ifnothtml
-@xref{File format,,,lzip}.
-@end ifnothtml
-
-@verbatim
-tar
-+========+======+========+======+========+======+========+
-| header | data | header | data | header | data |   eof  |
-+========+======+========+======+========+======+========+
-
-tar.lz
-+===============+===============+===============+========+
-|     member    |     member    |     member    | member |
-+===============+===============+===============+========+
-@end verbatim
-
 @noindent
 Of course, compressing each file (or each directory) individually is
 less efficient than compressing the whole tar archive, but it has the
@@ -101,15 +79,16 @@ The resulting multimember tar.lz archive can be decompressed in
 parallel with plzip, multiplying the decompression speed.
 
 @item
-New members can be appended to the archive (by removing the eof
+New members can be appended to the archive (by removing the EOF
 member) just like to an uncompressed tar archive.
 
 @item
 It is a safe posix-style backup format. In case of corruption,
 tarlz can extract all the undamaged members from the tar.lz
 archive, skipping over the damaged members, just like the standard
-(uncompressed) tar. Moreover, lziprecover can be used to recover at
-least part of the contents of the damaged members.
+(uncompressed) tar. Moreover, the option @code{--keep-damaged} can be
+used to recover as much data as possible from each damaged member,
+and lziprecover can be used to recover some of the damaged members.
 
 @item
 A multimember tar.lz archive is usually smaller than the
@@ -117,6 +96,15 @@ corresponding solidly compressed tar.gz archive, except when
 individually compressing files smaller than about 32 KiB.
 @end itemize
 
+Tarlz protects the extended records with a CRC in a way compatible with
+standard tar tools. @xref{crc32}.
+
+Tarlz does not understand other tar formats like @samp{gnu}, @samp{oldgnu},
+@samp{star} or @samp{v7}.
+
+Tarlz is intended as a showcase project for the maintainers of real tar
+programs to evaluate the format and perhaps implement it in their tools.
+
 
 @node Invoking tarlz
 @chapter Invoking tarlz
@@ -133,9 +121,16 @@ tarlz [@var{options}] [@var{files}]
 
 @noindent
 On archive creation or appending, tarlz removes leading and trailing
-slashes from file names, as well as file name prefixes containing a
+slashes from filenames, as well as filename prefixes containing a
 @samp{..} component. On extraction, archive members containing a
-@samp{..} component are skipped.
+@samp{..} component are skipped. Tarlz detects when the archive being
+created or enlarged is among the files to be dumped, appended or
+concatenated, and skips it.
+
+On extraction and listing, tarlz removes leading @samp{./} strings from
+member names in the archive or given in the command line, so that
+@w{@code{tarlz -xf foo ./bar baz}} extracts members @samp{bar} and
+@samp{./baz} from archive @samp{foo}.
 
 tarlz supports the following options:
 
@@ -147,10 +142,21 @@ Print an informative help message describing the options and exit.
 @item -V
 @itemx --version
 Print the version number of tarlz on the standard output and exit.
+This version number should be included in all bug reports.
+
+@item -A
+@itemx --concatenate
+Append tar.lz archives to the end of a tar.lz archive. All the archives
+involved must be regular (seekable) files compressed as multimember lzip
+files, and the two end-of-file blocks plus any zero padding must be
+contained in the last lzip member of each archive. The intermediate
+end-of-file blocks are removed as each new archive is concatenated. Exit
+with status 0 without modifying the archive if no @var{files} have been
+specified. Tarlz can't concatenate uncompressed tar archives.
 
 @item -c
 @itemx --create
-Create a new archive.
+Create a new archive from @var{files}.
 
 @item -C @var{dir}
 @itemx --directory=@var{dir}
@@ -174,18 +180,19 @@ Quiet operation. Suppress all messages.
 
 @item -r
 @itemx --append
-Append files to the end of an archive. The archive must be a regular
-(seekable) file compressed as a multimember lzip file, and the two
-end-of-file blocks plus any zero padding must be contained in the last
-lzip member of the archive. First this last member is removed, then the
-new members are appended, and then a new end-of-file member is appended
-to the archive. Exit with status 0 without modifying the archive if no
-@var{files} have been specified. tarlz can't append files to an
-uncompressed tar archive.
+Append files to the end of a tar.lz archive. The archive must be a
+regular (seekable) file compressed as a multimember lzip file, and the
+two end-of-file blocks plus any zero padding must be contained in the
+last lzip member of the archive. First this last member is removed, then
+the new members are appended, and then a new end-of-file member is
+appended to the archive. Exit with status 0 without modifying the
+archive if no @var{files} have been specified. Tarlz can't append files
+to an uncompressed tar archive.
 
 @item -t
 @itemx --list
-List the contents of an archive.
+List the contents of an archive. If @var{files} are given, list only the
+given @var{files}.
 
 @item -v
 @itemx --verbose
@@ -193,10 +200,13 @@ Verbosely list files processed.
 
 @item -x
 @itemx --extract
-Extract files from an archive.
+Extract files from an archive. If @var{files} are given, extract only
+the given @var{files}. Else extract all the files in the archive.
 
 @item -0 .. -9
 Set the compression level. The default compression level is @samp{-6}.
+Like lzip, tarlz also minimizes the dictionary size of the lzip members
+it creates, reducing the amount of memory required for decompression.
 
 @item --asolid
 When creating or appending to a compressed archive, use appendable solid
@@ -212,23 +222,56 @@ end-of-file blocks are compressed into a separate lzip member. This
 creates a compressed appendable archive with a separate lzip member for
 each top-level directory.
 
+@item --no-solid
+When creating or appending to a compressed archive, compress each file
+separately. The end-of-file blocks are compressed into a separate lzip
+member. This creates a compressed appendable archive with a separate
+lzip member for each file. This option allows tarlz revert to default
+behavior if, for example, tarlz is invoked through an alias like
+@code{tar='tarlz --solid'}.
+
 @item --solid
 When creating or appending to a compressed archive, use solid
 compression. The files being added to the archive, along with the
 end-of-file blocks, are compressed into a single lzip member. The
 resulting archive is not appendable. No more files can be later appended
-to the archive without decompressing it first.
+to the archive.
 
-@item --group=@var{group}
-When creating or appending, use @var{group} for files added to the
-archive. If @var{group} is not a valid group name, it is decoded as a
-decimal numeric group ID.
+@item --anonymous
+Equivalent to @code{--owner=root --group=root}.
 
 @item --owner=@var{owner}
 When creating or appending, use @var{owner} for files added to the
 archive. If @var{owner} is not a valid user name, it is decoded as a
 decimal numeric user ID.
 
+@item --group=@var{group}
+When creating or appending, use @var{group} for files added to the
+archive. If @var{group} is not a valid group name, it is decoded as a
+decimal numeric group ID.
+
+@item --keep-damaged
+Don't delete partially extracted files. If a decompression error happens
+while extracting a file, keep the partial data extracted. Use this
+option to recover as much data as possible from each damaged member.
+
+@item --missing-crc
+Exit with error status 2 if the CRC of the extended records is missing.
+When this option is used, tarlz detects any corruption in the extended
+records (only limited by CRC collisions). But note that a corrupt
+@samp{GNU.crc32} keyword, for example @samp{GNU.crc33}, is reported as a
+missing CRC instead of as a corrupt record. This misleading
+@samp{Missing CRC} message is the consequence of a flaw in the posix pax
+format; i.e., the lack of a mandatory check sequence in the extended
+records. @xref{crc32}.
+
+@ignore
+@item --permissive
+Allow some violations of the archive format, like consecutive extended
+headers preceding a ustar header, or several records with the same
+keyword appearing in the same block of extended records.
+@end ignore
+
 @item --uncompressed
 With @code{--create}, don't compress the created tar archive. Create an
 uncompressed tar archive instead.
@@ -241,6 +284,358 @@ invalid input file, 3 for an internal consistency error (eg, bug) which
 caused tarlz to panic.
 
 
+@node File format
+@chapter File format
+@cindex file format
+
+In the diagram below, a box like this:
+@verbatim
++---+
+|   | <-- the vertical bars might be missing
++---+
+@end verbatim
+
+represents one byte; a box like this:
+@verbatim
++==============+
+|              |
++==============+
+@end verbatim
+
+represents a variable number of bytes or a fixed but large number of
+bytes (for example 512).
+
+@sp 1
+A tar.lz file consists of a series of lzip members (compressed data sets).
+The members simply appear one after another in the file, with no
+additional information before, between, or after them.
+
+Each lzip member contains one or more tar members in a simplified posix
+pax interchange format; the only pax typeflag value supported by tarlz
+(in addition to the typeflag values defined by the ustar format) is
+@samp{x}. The pax format is an extension on top of the ustar format that
+removes the size limitations of the ustar format.
+
+Each tar member contains one file archived, and is represented by the
+following sequence:
+
+@itemize @bullet
+@item
+An optional extended header block with extended header records. This
+header block is of the form described in pax header block, with a
+typeflag value of @samp{x}. The extended header records are included as
+the data for this header block.
+
+@item
+A header block in ustar format that describes the file. Any fields
+defined in the preceding optional extended header records override the
+associated fields in this header block for this file.
+
+@item
+Zero or more blocks that contain the contents of the file.
+@end itemize
+
+At the end of the archive file there are two 512-byte blocks filled with
+binary zeros, interpreted as an end-of-archive indicator. These EOF
+blocks are either compressed in a separate lzip member or compressed
+along with the tar members contained in the last lzip member.
+
+The diagram below shows the correspondence between each tar member
+(formed by one or two headers plus optional data) in the tar archive and
+each
+@uref{http://www.nongnu.org/lzip/manual/lzip_manual.html#File-format,,lzip member}
+in the resulting multimember tar.lz archive:
+@ifnothtml
+@xref{File format,,,lzip}.
+@end ifnothtml
+
+@verbatim
+tar
++========+======+=================+===============+========+======+========+
+| header | data | extended header | extended data | header | data |   EOF  |
++========+======+=================+===============+========+======+========+
+
+tar.lz
++===============+=================================================+========+
+|     member    |                      member                     | member |
++===============+=================================================+========+
+@end verbatim
+
+@ignore
+When @code{--permissive} is used, the following violations of the
+archive format are allowed:@*
+If several extended headers precede an ustar header, only the last
+extended header takes effect. The other extended headers are ignored.
+Similarly, if several records with the same keyword appear in the same
+block of extended records, only the last record for the repeated keyword
+takes effect. The other records for the repeated keyword are ignored.
+@end ignore
+
+@sp 1
+@section Pax header block
+
+The pax header block is identical to the ustar header block described below
+except that the typeflag has the value @samp{x} (extended). The size field
+is the size of the extended header data in bytes. Most other fields in the
+pax header block are zeroed on archive creation to prevent trouble if the
+archive is read by an ustar tool, and are ignored by tarlz on archive
+extraction. @xref{flawed-compat}.
+
+The pax extended header data consists of one or more records, each of
+them constructed as follows:@*
+@code{"%d %s=%s\n", <length>, <keyword>, <value>}
+
+The <length>, <blank>, <keyword>, <equals-sign>, and <newline> in the
+record must be limited to the portable character set. The <length> field
+contains the decimal length of the record in bytes, including the
+trailing <newline>. The <value> field is stored as-is, without
+conversion to UTF-8 nor any other transformation.
+
+These are the <keyword> fields currently supported by tarlz:
+
+@table @code
+@item linkpath
+The pathname of a link being created to another file, of any type,
+previously archived. This record overrides the linkname field in the
+following ustar header block. The following ustar header block
+determines the type of link created. If typeflag of the following header
+block is 1, it will be a hard link. If typeflag is 2, it will be a
+symbolic link and the linkpath value will be used as the contents of the
+symbolic link.
+
+@item path
+The pathname of the following file. This record overrides the name and
+prefix fields in the following ustar header block.
+
+@item size
+The size of the file in bytes, expressed as a decimal number using
+digits from the ISO/IEC 646:1991 (ASCII) standard. This record overrides
+the size field in the following ustar header block. The size record is
+used only for files with a size value greater than 8_589_934_591
+@w{(octal 77777777777)}. This is 2^33 bytes or larger.
+
+@anchor{key_crc32}
+@item GNU.crc32
+CRC32-C (Castagnoli) of the extended header data excluding the 8 bytes
+representing the CRC <value> itself. The <value> is represented as 8
+hexadecimal digits in big endian order,
+@w{@samp{22 GNU.crc32=00000000\n}}. The keyword of the CRC record is
+protected by the CRC to guarante that corruption is always detected
+(except in case of CRC collision). A CRC was chosen because a checksum
+is too weak for a potentially large list of variable sized records. A
+checksum can't detect simple errors like the swapping of two bytes.
+@end table
+
+@sp 1
+@section Ustar header block
+
+The ustar header block has a length of 512 bytes and is structured as
+shown in the following table. All lengths and offsets are in decimal.
+
+@multitable {Field Name} {Offset} {Length (in bytes)}
+@item Field Name @tab Offset @tab Length (in bytes)
+@item name     @tab   0 @tab 100
+@item mode     @tab 100 @tab   8
+@item uid      @tab 108 @tab   8
+@item gid      @tab 116 @tab   8
+@item size     @tab 124 @tab  12
+@item mtime    @tab 136 @tab  12
+@item chksum   @tab 148 @tab   8
+@item typeflag @tab 156 @tab   1
+@item linkname @tab 157 @tab 100
+@item magic    @tab 257 @tab   6
+@item version  @tab 263 @tab   2
+@item uname    @tab 265 @tab  32
+@item gname    @tab 297 @tab  32
+@item devmajor @tab 329 @tab   8
+@item devminor @tab 337 @tab   8
+@item prefix   @tab 345 @tab 155
+@end multitable
+
+All characters in the header block are coded using the ISO/IEC 646:1991
+(ASCII) standard, except in fields storing names for files, users, and
+groups. For maximum portability between implementations, names should
+only contain characters from the portable filename character set. But if
+an implementation supports the use of characters outside of @samp{/} and
+the portable filename character set in names for files, users, and
+groups, tarlz will use the byte values in these names unmodified.
+
+The fields name, linkname, and prefix are null-terminated character
+strings except when all characters in the array contain non-null
+characters including the last character.
+
+The name and the prefix fields produce the pathname of the file. A new
+pathname is formed, if prefix is not an empty string (its first
+character is not null), by concatenating prefix (up to the first null
+character), a <slash> character, and name; otherwise, name is used
+alone. In either case, name is terminated at the first null character.
+If prefix begins with a null character, it is ignored. In this manner,
+pathnames of at most 256 characters can be supported. If a pathname does
+not fit in the space provided, an extended record is used to store the
+pathname.
+
+The linkname field does not use the prefix to produce a pathname. If the
+linkname does not fit in the 100 characters provided, an extended record
+is used to store the linkname.
+
+The mode field provides 12 access permission bits. The following table
+shows the symbolic name of each bit and its octal value:
+
+@multitable {Bit Name} {Bit value}
+@item Bit Name @tab Bit value
+@item S_ISUID @tab 04000
+@item S_ISGID @tab 02000
+@item S_ISVTX @tab 01000
+@item S_IRUSR @tab 00400
+@item S_IWUSR @tab 00200
+@item S_IXUSR @tab 00100
+@item S_IRGRP @tab 00040
+@item S_IWGRP @tab 00020
+@item S_IXGRP @tab 00010
+@item S_IROTH @tab 00004
+@item S_IWOTH @tab 00002
+@item S_IXOTH @tab 00001
+@end multitable
+
+The uid and gid fields are the user and group ID of the owner and group
+of the file, respectively.
+
+The size field contains the octal representation of the size of the file
+in bytes. If the typeflag field specifies a file of type '0' (regular
+file) or '7' (high performance regular file), the number of logical
+records following the header is @w{(size / 512)} rounded to the next
+integer. For all other values of typeflag, tarlz either sets the size
+field to 0 or ignores it, and does not store or expect any logical
+records following the header. If the file size is larger than
+8_589_934_591 bytes @w{(octal 77777777777)}, an extended record is used
+to store the file size.
+
+The mtime field contains the octal representation of the modification
+time of the file at the time it was archived, obtained from the stat()
+function.
+
+The chksum field contains the octal representation of the value of the
+simple sum of all bytes in the header logical record. Each byte in the
+header is treated as an unsigned value. When calculating the checksum,
+the chksum field is treated as if it were all <space> characters.
+
+The typeflag field contains a single character specifying the type of
+file archived:
+
+@table @code
+@item '0'
+Regular file.
+
+@item '1'
+Hard link to another file, of any type, previously archived.
+
+@item '2'
+Symbolic link.
+
+@item '3', '4'
+Character special file and block special file respectively. In this case
+the devmajor and devminor fields contain information defining the
+device in unspecified format.
+
+@item '5'
+Directory.
+
+@item '6'
+FIFO special file.
+
+@item '7'
+Reserved to represent a file to which an implementation has associated
+some high-performance attribute. Tarlz treats this type of file as a
+regular file (type 0).
+
+@end table
+
+The magic field contains the ASCII null-terminated string "ustar". The
+version field contains the characters "00" (0x30,0x30). The fields
+uname, and gname are null-terminated character strings. Each numeric
+field contains a leading zero-filled, null-terminated octal number using
+digits from the ISO/IEC 646:1991 (ASCII) standard.
+
+
+@node Amendments to pax format
+@chapter The reasons for the differences with pax
+@cindex Amendments to pax format
+
+Tarlz is meant to reliably detect invalid or corrupt metadata during
+extraction and to not create safety risks in the archives it creates. In
+order to achieve these goals, tarlz makes some changes to the variant of the
+pax format that it uses. This chapter describes these changes and the
+concrete reasons to implement them.
+
+@sp 1
+@anchor{crc32}
+@section Add a CRC of the extended records
+
+The posix pax format has a serious flaw. The metadata stored in pax extended
+records are not protected by any kind of check sequence. Corruption in a
+long filename may cause the extraction of the file in the wrong place
+without warning. Corruption in a long file size may cause the truncation of
+the file or the appending of garbage to the file, both followed by a
+spurious warning about a corrupt header far from the place of the undetected
+corruption.
+
+Metadata like filename and file size must be always protected in an archive
+format because of the adverse effects of undetected corruption in them,
+potentially much worse that undetected corruption in the data. Even more so
+in the case of pax because the amount of metadata it stores is potentially
+large, making undetected corruption more probable.
+
+Because of the above, tarlz protects the extended records with a CRC in
+a way compatible with standard tar tools. @xref{key_crc32}.
+
+@sp 1
+@anchor{flawed-compat}
+@section Remove flawed backward compatibility
+
+In order to allow the extraction of pax archives by a tar utility conforming
+to the POSIX-2:1993 standard, POSIX.1-2008 recommends selecting extended
+header field values that allow such tar to create a regular file containing
+the extended header records as data. This approach is broken because if the
+extended header is needed because of a long filename, the name and prefix
+fields will be unable to contain the full pathname of the file. Therefore
+the files corresponding to both the extended header and the overridden ustar
+header will be extracted using truncated filenames, perhaps overwriting
+existing files or directories. It may be a security risk to extract a file
+with a truncated filename.
+
+To avoid this problem, tarlz writes extended headers with all fields zeroed
+except size, chksum, typeflag, magic and version. This prevents old tar
+programs from extracting the extended records as a file in the wrong place.
+Tarlz also sets to zero those fields of the ustar header overridden by
+extended records.
+
+If the extended header is needed because of a file size larger than
+@w{8 GiB}, the size field will be unable to contain the full size of the
+file. Therefore the file may be partially extracted, and the tool will issue
+a spurious warning about a corrupt header at the point where it thinks the
+file ends. Setting to zero the overridden size in the ustar header at least
+prevents the partial extraction and makes obvious that the file has been
+truncated.
+
+@sp 1
+@section As simple as possible (but not simpler)
+
+The tarlz format is mainly ustar. Extended pax headers are used only when
+needed because the length of a filename or link name, or the size of a file
+exceed the limits of the ustar format. Adding extended headers to each
+member just to record subsecond timestamps seems wasteful for a backup
+format.
+
+@sp 1
+@section Avoid misconversions to/from UTF-8
+
+There is no portable way to tell what charset a text string is coded into.
+Therefore, tarlz stores all fields representing text strings as-is, without
+conversion to UTF-8 nor any other transformation. This prevents accidental
+double UTF-8 conversions. If the need arises this behavior will be adjusted
+with a command line option in the future.
+
+
 @node Examples
 @chapter A small tutorial with examples
 @cindex examples
@@ -280,7 +675,7 @@ Example 4: Create a compressed appendable archive containing directories
 directory. Then append files @samp{a}, @samp{b}, @samp{c}, @samp{d} and
 @samp{e} to the archive, all of them contained in a single lzip member.
 The resulting archive @samp{archive.tar.lz} contains 5 lzip members
-(including the eof member).
+(including the EOF member).
 
 @example
 tarlz --dsolid -cf archive.tar.lz dir1 dir2 dir3
@@ -291,8 +686,7 @@ tarlz --asolid -rf archive.tar.lz a b c d e
 @noindent
 Example 5: Create a solidly compressed archive @samp{archive.tar.lz}
 containing files @samp{a}, @samp{b} and @samp{c}. Note that no more
-files can be later appended to the archive without decompressing it
-first.
+files can be later appended to the archive.
 
 @example
 tarlz --solid -cf archive.tar.lz a b c