diff options
Diffstat (limited to 'doc/tarlz.texi')
-rw-r--r-- | doc/tarlz.texi | 522 |
1 files changed, 458 insertions, 64 deletions
diff --git a/doc/tarlz.texi b/doc/tarlz.texi index db51dd6..4c6d16a 100644 --- a/doc/tarlz.texi +++ b/doc/tarlz.texi @@ -6,8 +6,8 @@ @finalout @c %**end of header -@set UPDATED 23 April 2018 -@set VERSION 0.4 +@set UPDATED 16 December 2018 +@set VERSION 0.8 @dircategory Data Compression @direntry @@ -35,11 +35,13 @@ This manual is for Tarlz (version @value{VERSION}, @value{UPDATED}). @menu -* Introduction:: Purpose and features of tarlz -* Invoking tarlz:: Command line interface -* Examples:: A small tutorial with examples -* Problems:: Reporting bugs -* Concept index:: Index of concepts +* Introduction:: Purpose and features of tarlz +* Invoking tarlz:: Command line interface +* File format:: Detailed format of the compressed archive +* Amendments to pax format:: The reasons for the differences with pax +* Examples:: A small tutorial with examples +* Problems:: Reporting bugs +* Concept index:: Index of concepts @end menu @sp 1 @@ -53,43 +55,19 @@ to copy, distribute and modify it. @chapter Introduction @cindex introduction -Tarlz is a small and simple implementation of the tar archiver. By -default tarlz creates, lists and extracts archives in the 'ustar' format -compressed with lzip on a per file basis. Tarlz can append files to the -end of such compressed archives. - -Each tar member is compressed in its own lzip member, as well as the -end-of-file blocks. This same method works for any tar format (gnu, -ustar, posix) and is fully backward compatible with standard tar tools -like GNU tar, which treat the resulting multimember tar.lz archive like -any other tar.lz archive. +@uref{http://www.nongnu.org/lzip/tarlz.html,,Tarlz} is a small and simple +implementation of the tar archiver. By default tarlz creates, lists and +extracts archives in a simplified posix pax format compressed with +@uref{http://www.nongnu.org/lzip/lzip.html,,lzip} on a per file basis. Each +tar member is compressed in its own lzip member, as well as the end-of-file +blocks. This method is fully backward compatible with standard tar tools +like GNU tar, which treat the resulting multimember tar.lz archive like any +other tar.lz archive. Tarlz can append files to the end of such compressed +archives. Tarlz can create tar archives with four levels of compression granularity; per file, per directory, appendable solid, and solid. -Tarlz is intended as a showcase project for the maintainers of real tar -programs to evaluate the format and perhaps implement it in their tools. - -The diagram below shows the correspondence between tar members (formed -by a header plus optional data) in the tar archive and -@uref{http://www.nongnu.org/lzip/manual/lzip_manual.html#File-format,,lzip members} -in the resulting multimember tar.lz archive: -@ifnothtml -@xref{File format,,,lzip}. -@end ifnothtml - -@verbatim -tar -+========+======+========+======+========+======+========+ -| header | data | header | data | header | data | eof | -+========+======+========+======+========+======+========+ - -tar.lz -+===============+===============+===============+========+ -| member | member | member | member | -+===============+===============+===============+========+ -@end verbatim - @noindent Of course, compressing each file (or each directory) individually is less efficient than compressing the whole tar archive, but it has the @@ -101,15 +79,16 @@ The resulting multimember tar.lz archive can be decompressed in parallel with plzip, multiplying the decompression speed. @item -New members can be appended to the archive (by removing the eof +New members can be appended to the archive (by removing the EOF member) just like to an uncompressed tar archive. @item It is a safe posix-style backup format. In case of corruption, tarlz can extract all the undamaged members from the tar.lz archive, skipping over the damaged members, just like the standard -(uncompressed) tar. Moreover, lziprecover can be used to recover at -least part of the contents of the damaged members. +(uncompressed) tar. Moreover, the option @code{--keep-damaged} can be +used to recover as much data as possible from each damaged member, +and lziprecover can be used to recover some of the damaged members. @item A multimember tar.lz archive is usually smaller than the @@ -117,6 +96,15 @@ corresponding solidly compressed tar.gz archive, except when individually compressing files smaller than about 32 KiB. @end itemize +Tarlz protects the extended records with a CRC in a way compatible with +standard tar tools. @xref{crc32}. + +Tarlz does not understand other tar formats like @samp{gnu}, @samp{oldgnu}, +@samp{star} or @samp{v7}. + +Tarlz is intended as a showcase project for the maintainers of real tar +programs to evaluate the format and perhaps implement it in their tools. + @node Invoking tarlz @chapter Invoking tarlz @@ -133,9 +121,16 @@ tarlz [@var{options}] [@var{files}] @noindent On archive creation or appending, tarlz removes leading and trailing -slashes from file names, as well as file name prefixes containing a +slashes from filenames, as well as filename prefixes containing a @samp{..} component. On extraction, archive members containing a -@samp{..} component are skipped. +@samp{..} component are skipped. Tarlz detects when the archive being +created or enlarged is among the files to be dumped, appended or +concatenated, and skips it. + +On extraction and listing, tarlz removes leading @samp{./} strings from +member names in the archive or given in the command line, so that +@w{@code{tarlz -xf foo ./bar baz}} extracts members @samp{bar} and +@samp{./baz} from archive @samp{foo}. tarlz supports the following options: @@ -147,10 +142,21 @@ Print an informative help message describing the options and exit. @item -V @itemx --version Print the version number of tarlz on the standard output and exit. +This version number should be included in all bug reports. + +@item -A +@itemx --concatenate +Append tar.lz archives to the end of a tar.lz archive. All the archives +involved must be regular (seekable) files compressed as multimember lzip +files, and the two end-of-file blocks plus any zero padding must be +contained in the last lzip member of each archive. The intermediate +end-of-file blocks are removed as each new archive is concatenated. Exit +with status 0 without modifying the archive if no @var{files} have been +specified. Tarlz can't concatenate uncompressed tar archives. @item -c @itemx --create -Create a new archive. +Create a new archive from @var{files}. @item -C @var{dir} @itemx --directory=@var{dir} @@ -174,18 +180,19 @@ Quiet operation. Suppress all messages. @item -r @itemx --append -Append files to the end of an archive. The archive must be a regular -(seekable) file compressed as a multimember lzip file, and the two -end-of-file blocks plus any zero padding must be contained in the last -lzip member of the archive. First this last member is removed, then the -new members are appended, and then a new end-of-file member is appended -to the archive. Exit with status 0 without modifying the archive if no -@var{files} have been specified. tarlz can't append files to an -uncompressed tar archive. +Append files to the end of a tar.lz archive. The archive must be a +regular (seekable) file compressed as a multimember lzip file, and the +two end-of-file blocks plus any zero padding must be contained in the +last lzip member of the archive. First this last member is removed, then +the new members are appended, and then a new end-of-file member is +appended to the archive. Exit with status 0 without modifying the +archive if no @var{files} have been specified. Tarlz can't append files +to an uncompressed tar archive. @item -t @itemx --list -List the contents of an archive. +List the contents of an archive. If @var{files} are given, list only the +given @var{files}. @item -v @itemx --verbose @@ -193,10 +200,13 @@ Verbosely list files processed. @item -x @itemx --extract -Extract files from an archive. +Extract files from an archive. If @var{files} are given, extract only +the given @var{files}. Else extract all the files in the archive. @item -0 .. -9 Set the compression level. The default compression level is @samp{-6}. +Like lzip, tarlz also minimizes the dictionary size of the lzip members +it creates, reducing the amount of memory required for decompression. @item --asolid When creating or appending to a compressed archive, use appendable solid @@ -212,23 +222,56 @@ end-of-file blocks are compressed into a separate lzip member. This creates a compressed appendable archive with a separate lzip member for each top-level directory. +@item --no-solid +When creating or appending to a compressed archive, compress each file +separately. The end-of-file blocks are compressed into a separate lzip +member. This creates a compressed appendable archive with a separate +lzip member for each file. This option allows tarlz revert to default +behavior if, for example, tarlz is invoked through an alias like +@code{tar='tarlz --solid'}. + @item --solid When creating or appending to a compressed archive, use solid compression. The files being added to the archive, along with the end-of-file blocks, are compressed into a single lzip member. The resulting archive is not appendable. No more files can be later appended -to the archive without decompressing it first. +to the archive. -@item --group=@var{group} -When creating or appending, use @var{group} for files added to the -archive. If @var{group} is not a valid group name, it is decoded as a -decimal numeric group ID. +@item --anonymous +Equivalent to @code{--owner=root --group=root}. @item --owner=@var{owner} When creating or appending, use @var{owner} for files added to the archive. If @var{owner} is not a valid user name, it is decoded as a decimal numeric user ID. +@item --group=@var{group} +When creating or appending, use @var{group} for files added to the +archive. If @var{group} is not a valid group name, it is decoded as a +decimal numeric group ID. + +@item --keep-damaged +Don't delete partially extracted files. If a decompression error happens +while extracting a file, keep the partial data extracted. Use this +option to recover as much data as possible from each damaged member. + +@item --missing-crc +Exit with error status 2 if the CRC of the extended records is missing. +When this option is used, tarlz detects any corruption in the extended +records (only limited by CRC collisions). But note that a corrupt +@samp{GNU.crc32} keyword, for example @samp{GNU.crc33}, is reported as a +missing CRC instead of as a corrupt record. This misleading +@samp{Missing CRC} message is the consequence of a flaw in the posix pax +format; i.e., the lack of a mandatory check sequence in the extended +records. @xref{crc32}. + +@ignore +@item --permissive +Allow some violations of the archive format, like consecutive extended +headers preceding a ustar header, or several records with the same +keyword appearing in the same block of extended records. +@end ignore + @item --uncompressed With @code{--create}, don't compress the created tar archive. Create an uncompressed tar archive instead. @@ -241,6 +284,358 @@ invalid input file, 3 for an internal consistency error (eg, bug) which caused tarlz to panic. +@node File format +@chapter File format +@cindex file format + +In the diagram below, a box like this: +@verbatim ++---+ +| | <-- the vertical bars might be missing ++---+ +@end verbatim + +represents one byte; a box like this: +@verbatim ++==============+ +| | ++==============+ +@end verbatim + +represents a variable number of bytes or a fixed but large number of +bytes (for example 512). + +@sp 1 +A tar.lz file consists of a series of lzip members (compressed data sets). +The members simply appear one after another in the file, with no +additional information before, between, or after them. + +Each lzip member contains one or more tar members in a simplified posix +pax interchange format; the only pax typeflag value supported by tarlz +(in addition to the typeflag values defined by the ustar format) is +@samp{x}. The pax format is an extension on top of the ustar format that +removes the size limitations of the ustar format. + +Each tar member contains one file archived, and is represented by the +following sequence: + +@itemize @bullet +@item +An optional extended header block with extended header records. This +header block is of the form described in pax header block, with a +typeflag value of @samp{x}. The extended header records are included as +the data for this header block. + +@item +A header block in ustar format that describes the file. Any fields +defined in the preceding optional extended header records override the +associated fields in this header block for this file. + +@item +Zero or more blocks that contain the contents of the file. +@end itemize + +At the end of the archive file there are two 512-byte blocks filled with +binary zeros, interpreted as an end-of-archive indicator. These EOF +blocks are either compressed in a separate lzip member or compressed +along with the tar members contained in the last lzip member. + +The diagram below shows the correspondence between each tar member +(formed by one or two headers plus optional data) in the tar archive and +each +@uref{http://www.nongnu.org/lzip/manual/lzip_manual.html#File-format,,lzip member} +in the resulting multimember tar.lz archive: +@ifnothtml +@xref{File format,,,lzip}. +@end ifnothtml + +@verbatim +tar ++========+======+=================+===============+========+======+========+ +| header | data | extended header | extended data | header | data | EOF | ++========+======+=================+===============+========+======+========+ + +tar.lz ++===============+=================================================+========+ +| member | member | member | ++===============+=================================================+========+ +@end verbatim + +@ignore +When @code{--permissive} is used, the following violations of the +archive format are allowed:@* +If several extended headers precede an ustar header, only the last +extended header takes effect. The other extended headers are ignored. +Similarly, if several records with the same keyword appear in the same +block of extended records, only the last record for the repeated keyword +takes effect. The other records for the repeated keyword are ignored. +@end ignore + +@sp 1 +@section Pax header block + +The pax header block is identical to the ustar header block described below +except that the typeflag has the value @samp{x} (extended). The size field +is the size of the extended header data in bytes. Most other fields in the +pax header block are zeroed on archive creation to prevent trouble if the +archive is read by an ustar tool, and are ignored by tarlz on archive +extraction. @xref{flawed-compat}. + +The pax extended header data consists of one or more records, each of +them constructed as follows:@* +@code{"%d %s=%s\n", <length>, <keyword>, <value>} + +The <length>, <blank>, <keyword>, <equals-sign>, and <newline> in the +record must be limited to the portable character set. The <length> field +contains the decimal length of the record in bytes, including the +trailing <newline>. The <value> field is stored as-is, without +conversion to UTF-8 nor any other transformation. + +These are the <keyword> fields currently supported by tarlz: + +@table @code +@item linkpath +The pathname of a link being created to another file, of any type, +previously archived. This record overrides the linkname field in the +following ustar header block. The following ustar header block +determines the type of link created. If typeflag of the following header +block is 1, it will be a hard link. If typeflag is 2, it will be a +symbolic link and the linkpath value will be used as the contents of the +symbolic link. + +@item path +The pathname of the following file. This record overrides the name and +prefix fields in the following ustar header block. + +@item size +The size of the file in bytes, expressed as a decimal number using +digits from the ISO/IEC 646:1991 (ASCII) standard. This record overrides +the size field in the following ustar header block. The size record is +used only for files with a size value greater than 8_589_934_591 +@w{(octal 77777777777)}. This is 2^33 bytes or larger. + +@anchor{key_crc32} +@item GNU.crc32 +CRC32-C (Castagnoli) of the extended header data excluding the 8 bytes +representing the CRC <value> itself. The <value> is represented as 8 +hexadecimal digits in big endian order, +@w{@samp{22 GNU.crc32=00000000\n}}. The keyword of the CRC record is +protected by the CRC to guarante that corruption is always detected +(except in case of CRC collision). A CRC was chosen because a checksum +is too weak for a potentially large list of variable sized records. A +checksum can't detect simple errors like the swapping of two bytes. +@end table + +@sp 1 +@section Ustar header block + +The ustar header block has a length of 512 bytes and is structured as +shown in the following table. All lengths and offsets are in decimal. + +@multitable {Field Name} {Offset} {Length (in bytes)} +@item Field Name @tab Offset @tab Length (in bytes) +@item name @tab 0 @tab 100 +@item mode @tab 100 @tab 8 +@item uid @tab 108 @tab 8 +@item gid @tab 116 @tab 8 +@item size @tab 124 @tab 12 +@item mtime @tab 136 @tab 12 +@item chksum @tab 148 @tab 8 +@item typeflag @tab 156 @tab 1 +@item linkname @tab 157 @tab 100 +@item magic @tab 257 @tab 6 +@item version @tab 263 @tab 2 +@item uname @tab 265 @tab 32 +@item gname @tab 297 @tab 32 +@item devmajor @tab 329 @tab 8 +@item devminor @tab 337 @tab 8 +@item prefix @tab 345 @tab 155 +@end multitable + +All characters in the header block are coded using the ISO/IEC 646:1991 +(ASCII) standard, except in fields storing names for files, users, and +groups. For maximum portability between implementations, names should +only contain characters from the portable filename character set. But if +an implementation supports the use of characters outside of @samp{/} and +the portable filename character set in names for files, users, and +groups, tarlz will use the byte values in these names unmodified. + +The fields name, linkname, and prefix are null-terminated character +strings except when all characters in the array contain non-null +characters including the last character. + +The name and the prefix fields produce the pathname of the file. A new +pathname is formed, if prefix is not an empty string (its first +character is not null), by concatenating prefix (up to the first null +character), a <slash> character, and name; otherwise, name is used +alone. In either case, name is terminated at the first null character. +If prefix begins with a null character, it is ignored. In this manner, +pathnames of at most 256 characters can be supported. If a pathname does +not fit in the space provided, an extended record is used to store the +pathname. + +The linkname field does not use the prefix to produce a pathname. If the +linkname does not fit in the 100 characters provided, an extended record +is used to store the linkname. + +The mode field provides 12 access permission bits. The following table +shows the symbolic name of each bit and its octal value: + +@multitable {Bit Name} {Bit value} +@item Bit Name @tab Bit value +@item S_ISUID @tab 04000 +@item S_ISGID @tab 02000 +@item S_ISVTX @tab 01000 +@item S_IRUSR @tab 00400 +@item S_IWUSR @tab 00200 +@item S_IXUSR @tab 00100 +@item S_IRGRP @tab 00040 +@item S_IWGRP @tab 00020 +@item S_IXGRP @tab 00010 +@item S_IROTH @tab 00004 +@item S_IWOTH @tab 00002 +@item S_IXOTH @tab 00001 +@end multitable + +The uid and gid fields are the user and group ID of the owner and group +of the file, respectively. + +The size field contains the octal representation of the size of the file +in bytes. If the typeflag field specifies a file of type '0' (regular +file) or '7' (high performance regular file), the number of logical +records following the header is @w{(size / 512)} rounded to the next +integer. For all other values of typeflag, tarlz either sets the size +field to 0 or ignores it, and does not store or expect any logical +records following the header. If the file size is larger than +8_589_934_591 bytes @w{(octal 77777777777)}, an extended record is used +to store the file size. + +The mtime field contains the octal representation of the modification +time of the file at the time it was archived, obtained from the stat() +function. + +The chksum field contains the octal representation of the value of the +simple sum of all bytes in the header logical record. Each byte in the +header is treated as an unsigned value. When calculating the checksum, +the chksum field is treated as if it were all <space> characters. + +The typeflag field contains a single character specifying the type of +file archived: + +@table @code +@item '0' +Regular file. + +@item '1' +Hard link to another file, of any type, previously archived. + +@item '2' +Symbolic link. + +@item '3', '4' +Character special file and block special file respectively. In this case +the devmajor and devminor fields contain information defining the +device in unspecified format. + +@item '5' +Directory. + +@item '6' +FIFO special file. + +@item '7' +Reserved to represent a file to which an implementation has associated +some high-performance attribute. Tarlz treats this type of file as a +regular file (type 0). + +@end table + +The magic field contains the ASCII null-terminated string "ustar". The +version field contains the characters "00" (0x30,0x30). The fields +uname, and gname are null-terminated character strings. Each numeric +field contains a leading zero-filled, null-terminated octal number using +digits from the ISO/IEC 646:1991 (ASCII) standard. + + +@node Amendments to pax format +@chapter The reasons for the differences with pax +@cindex Amendments to pax format + +Tarlz is meant to reliably detect invalid or corrupt metadata during +extraction and to not create safety risks in the archives it creates. In +order to achieve these goals, tarlz makes some changes to the variant of the +pax format that it uses. This chapter describes these changes and the +concrete reasons to implement them. + +@sp 1 +@anchor{crc32} +@section Add a CRC of the extended records + +The posix pax format has a serious flaw. The metadata stored in pax extended +records are not protected by any kind of check sequence. Corruption in a +long filename may cause the extraction of the file in the wrong place +without warning. Corruption in a long file size may cause the truncation of +the file or the appending of garbage to the file, both followed by a +spurious warning about a corrupt header far from the place of the undetected +corruption. + +Metadata like filename and file size must be always protected in an archive +format because of the adverse effects of undetected corruption in them, +potentially much worse that undetected corruption in the data. Even more so +in the case of pax because the amount of metadata it stores is potentially +large, making undetected corruption more probable. + +Because of the above, tarlz protects the extended records with a CRC in +a way compatible with standard tar tools. @xref{key_crc32}. + +@sp 1 +@anchor{flawed-compat} +@section Remove flawed backward compatibility + +In order to allow the extraction of pax archives by a tar utility conforming +to the POSIX-2:1993 standard, POSIX.1-2008 recommends selecting extended +header field values that allow such tar to create a regular file containing +the extended header records as data. This approach is broken because if the +extended header is needed because of a long filename, the name and prefix +fields will be unable to contain the full pathname of the file. Therefore +the files corresponding to both the extended header and the overridden ustar +header will be extracted using truncated filenames, perhaps overwriting +existing files or directories. It may be a security risk to extract a file +with a truncated filename. + +To avoid this problem, tarlz writes extended headers with all fields zeroed +except size, chksum, typeflag, magic and version. This prevents old tar +programs from extracting the extended records as a file in the wrong place. +Tarlz also sets to zero those fields of the ustar header overridden by +extended records. + +If the extended header is needed because of a file size larger than +@w{8 GiB}, the size field will be unable to contain the full size of the +file. Therefore the file may be partially extracted, and the tool will issue +a spurious warning about a corrupt header at the point where it thinks the +file ends. Setting to zero the overridden size in the ustar header at least +prevents the partial extraction and makes obvious that the file has been +truncated. + +@sp 1 +@section As simple as possible (but not simpler) + +The tarlz format is mainly ustar. Extended pax headers are used only when +needed because the length of a filename or link name, or the size of a file +exceed the limits of the ustar format. Adding extended headers to each +member just to record subsecond timestamps seems wasteful for a backup +format. + +@sp 1 +@section Avoid misconversions to/from UTF-8 + +There is no portable way to tell what charset a text string is coded into. +Therefore, tarlz stores all fields representing text strings as-is, without +conversion to UTF-8 nor any other transformation. This prevents accidental +double UTF-8 conversions. If the need arises this behavior will be adjusted +with a command line option in the future. + + @node Examples @chapter A small tutorial with examples @cindex examples @@ -280,7 +675,7 @@ Example 4: Create a compressed appendable archive containing directories directory. Then append files @samp{a}, @samp{b}, @samp{c}, @samp{d} and @samp{e} to the archive, all of them contained in a single lzip member. The resulting archive @samp{archive.tar.lz} contains 5 lzip members -(including the eof member). +(including the EOF member). @example tarlz --dsolid -cf archive.tar.lz dir1 dir2 dir3 @@ -291,8 +686,7 @@ tarlz --asolid -rf archive.tar.lz a b c d e @noindent Example 5: Create a solidly compressed archive @samp{archive.tar.lz} containing files @samp{a}, @samp{b} and @samp{c}. Note that no more -files can be later appended to the archive without decompressing it -first. +files can be later appended to the archive. @example tarlz --solid -cf archive.tar.lz a b c |