diff options
Diffstat (limited to 'doc')
-rw-r--r-- | doc/tarlz.1 | 180 | ||||
-rw-r--r-- | doc/tarlz.info | 1287 | ||||
-rw-r--r-- | doc/tarlz.texi | 1356 |
3 files changed, 2823 insertions, 0 deletions
diff --git a/doc/tarlz.1 b/doc/tarlz.1 new file mode 100644 index 0000000..9d63da5 --- /dev/null +++ b/doc/tarlz.1 @@ -0,0 +1,180 @@ +.\" DO NOT MODIFY THIS FILE! It was generated by help2man 1.49.2. +.TH TARLZ "1" "January 2024" "tarlz 0.25" "User Commands" +.SH NAME +tarlz \- creates tar archives with multimember lzip compression +.SH SYNOPSIS +.B tarlz +\fI\,operation \/\fR[\fI\,options\/\fR] [\fI\,files\/\fR] +.SH DESCRIPTION +Tarlz is a massively parallel (multi\-threaded) combined implementation of +the tar archiver and the lzip compressor. Tarlz uses the compression library +lzlib. +.PP +Tarlz creates tar archives using a simplified and safer variant of the POSIX +pax format compressed in lzip format, keeping the alignment between tar +members and lzip members. The resulting multimember tar.lz archive is +backward compatible with standard tar tools like GNU tar, which treat it +like any other tar.lz archive. Tarlz can append files to the end of such +compressed archives. +.PP +Keeping the alignment between tar members and lzip members has two +advantages. It adds an indexed lzip layer on top of the tar archive, making +it possible to decode the archive safely in parallel. It also minimizes the +amount of data lost in case of corruption. +.PP +The tarlz file format is a safe POSIX\-style backup format. In case of +corruption, tarlz can extract all the undamaged members from the tar.lz +archive, skipping over the damaged members, just like the standard +(uncompressed) tar. Moreover, the option '\-\-keep\-damaged' can be used to +recover as much data as possible from each damaged member, and lziprecover +can be used to recover some of the damaged members. +.SS "Operations:" +.TP +\fB\-\-help\fR +display this help and exit +.TP +\fB\-V\fR, \fB\-\-version\fR +output version information and exit +.TP +\fB\-A\fR, \fB\-\-concatenate\fR +append archives to the end of an archive +.TP +\fB\-c\fR, \fB\-\-create\fR +create a new archive +.TP +\fB\-d\fR, \fB\-\-diff\fR +find differences between archive and file system +.TP +\fB\-\-delete\fR +delete files/directories from an archive +.TP +\fB\-r\fR, \fB\-\-append\fR +append files to the end of an archive +.TP +\fB\-t\fR, \fB\-\-list\fR +list the contents of an archive +.TP +\fB\-x\fR, \fB\-\-extract\fR +extract files/directories from an archive +.TP +\fB\-z\fR, \fB\-\-compress\fR +compress existing POSIX tar archives +.TP +\fB\-\-check\-lib\fR +check version of lzlib and exit +.SH OPTIONS +.TP +\fB\-B\fR, \fB\-\-data\-size=\fR<bytes> +set target size of input data blocks [2x8=16 MiB] +.TP +\fB\-C\fR, \fB\-\-directory=\fR<dir> +change to directory <dir> +.TP +\fB\-f\fR, \fB\-\-file=\fR<archive> +use archive file <archive> +.TP +\fB\-h\fR, \fB\-\-dereference\fR +follow symlinks; archive the files they point to +.TP +\fB\-n\fR, \fB\-\-threads=\fR<n> +set number of (de)compression threads [2] +.TP +\fB\-o\fR, \fB\-\-output=\fR<file> +compress to <file> ('\-' for stdout) +.TP +\fB\-p\fR, \fB\-\-preserve\-permissions\fR +don't subtract the umask on extraction +.TP +\fB\-q\fR, \fB\-\-quiet\fR +suppress all messages +.TP +\fB\-v\fR, \fB\-\-verbose\fR +verbosely list files processed +.TP +\fB\-0\fR .. \fB\-9\fR +set compression level [default 6] +.TP +\fB\-\-uncompressed\fR +don't compress the archive created +.TP +\fB\-\-asolid\fR +create solidly compressed appendable archive +.TP +\fB\-\-bsolid\fR +create per block compressed archive (default) +.TP +\fB\-\-dsolid\fR +create per directory compressed archive +.TP +\fB\-\-no\-solid\fR +create per file compressed archive +.TP +\fB\-\-solid\fR +create solidly compressed archive +.TP +\fB\-\-anonymous\fR +equivalent to '\-\-owner=root \fB\-\-group\fR=\fI\,root\/\fR' +.TP +\fB\-\-owner=\fR<owner> +use <owner> name/ID for files added to archive +.TP +\fB\-\-group=\fR<group> +use <group> name/ID for files added to archive +.TP +\fB\-\-exclude=\fR<pattern> +exclude files matching a shell pattern +.TP +\fB\-\-ignore\-ids\fR +ignore differences in owner and group IDs +.TP +\fB\-\-ignore\-metadata\fR +compare only file size and file content +.TP +\fB\-\-ignore\-overflow\fR +ignore mtime overflow differences on 32\-bit +.TP +\fB\-\-keep\-damaged\fR +don't delete partially extracted files +.TP +\fB\-\-missing\-crc\fR +exit with error status if missing extended CRC +.TP +\fB\-\-mtime=\fR<date> +use <date> as mtime for files added to archive +.TP +\fB\-\-out\-slots=\fR<n> +number of 1 MiB output packets buffered [64] +.TP +\fB\-\-warn\-newer\fR +warn if any file is newer than the archive +.PP +If no archive is specified, tarlz tries to read it from standard input or +write it to standard output. +.PP +Exit status: 0 for a normal exit, 1 for environmental problems +(file not found, files differ, invalid command\-line options, I/O errors, +etc), 2 to indicate a corrupt or invalid input file, 3 for an internal +consistency error (e.g., bug) which caused tarlz to panic. +.SH "REPORTING BUGS" +Report bugs to lzip\-bug@nongnu.org +.br +Tarlz home page: http://www.nongnu.org/lzip/tarlz.html +.SH COPYRIGHT +Copyright \(co 2024 Antonio Diaz Diaz. +Using lzlib 1.14\-rc1 +License GPLv2+: GNU GPL version 2 or later <http://gnu.org/licenses/gpl.html> +.br +This is free software: you are free to change and redistribute it. +There is NO WARRANTY, to the extent permitted by law. +.SH "SEE ALSO" +The full documentation for +.B tarlz +is maintained as a Texinfo manual. If the +.B info +and +.B tarlz +programs are properly installed at your site, the command +.IP +.B info tarlz +.PP +should give you access to the complete manual. diff --git a/doc/tarlz.info b/doc/tarlz.info new file mode 100644 index 0000000..25ba882 --- /dev/null +++ b/doc/tarlz.info @@ -0,0 +1,1287 @@ +This is tarlz.info, produced by makeinfo version 4.13+ from tarlz.texi. + +INFO-DIR-SECTION Archiving +START-INFO-DIR-ENTRY +* Tarlz: (tarlz). Archiver with multimember lzip compression +END-INFO-DIR-ENTRY + + +File: tarlz.info, Node: Top, Next: Introduction, Up: (dir) + +Tarlz Manual +************ + +This manual is for Tarlz (version 0.25, 3 January 2024). + +* Menu: + +* Introduction:: Purpose and features of tarlz +* Invoking tarlz:: Command-line interface +* Portable character set:: POSIX portable filename character set +* File format:: Detailed format of the compressed archive +* Amendments to pax format:: The reasons for the differences with pax +* Program design:: Internal structure of tarlz +* Multi-threaded decoding:: Limitations of parallel tar decoding +* Minimum archive sizes:: Sizes required for full multi-threaded speed +* Examples:: A small tutorial with examples +* Problems:: Reporting bugs +* Concept index:: Index of concepts + + + Copyright (C) 2013-2024 Antonio Diaz Diaz. + + This manual is free documentation: you have unlimited permission to copy, +distribute, and modify it. + + +File: tarlz.info, Node: Introduction, Next: Invoking tarlz, Prev: Top, Up: Top + +1 Introduction +************** + +Tarlz is a massively parallel (multi-threaded) combined implementation of +the tar archiver and the lzip compressor. Tarlz uses the compression +library lzlib. + + Tarlz creates tar archives using a simplified and safer variant of the +POSIX pax format compressed in lzip format, keeping the alignment between +tar members and lzip members. The resulting multimember tar.lz archive is +backward compatible with standard tar tools like GNU tar, which treat it +like any other tar.lz archive. Tarlz can append files to the end of such +compressed archives. + + Keeping the alignment between tar members and lzip members has two +advantages. It adds an indexed lzip layer on top of the tar archive, making +it possible to decode the archive safely in parallel. It also minimizes the +amount of data lost in case of corruption. Compressing a tar archive with +plzip may even double the amount of files lost for each lzip member damaged +because it does not keep the members aligned. + + Tarlz can create tar archives with five levels of compression +granularity: per file ('--no-solid'), per block ('--bsolid', default), per +directory ('--dsolid'), appendable solid ('--asolid'), and solid +('--solid'). It can also create uncompressed tar archives. + +Of course, compressing each file (or each directory) individually can't +achieve a compression ratio as high as compressing solidly the whole tar +archive, but it has the following advantages: + + * The resulting multimember tar.lz archive can be decompressed in + parallel, multiplying the decompression speed. + + * New members can be appended to the archive (by removing the + end-of-archive member), and unwanted members can be deleted from the + archive. Just like an uncompressed tar archive. + + * It is a safe POSIX-style backup format. In case of corruption, tarlz + can extract all the undamaged members from the tar.lz archive, + skipping over the damaged members, just like the standard + (uncompressed) tar. Moreover, the option '--keep-damaged' can be used + to recover as much data as possible from each damaged member, and + lziprecover can be used to recover some of the damaged members. + + * A multimember tar.lz archive is usually smaller than the corresponding + solidly compressed tar.gz archive, except when individually + compressing files smaller than about 32 KiB. + + Tarlz protects the extended records with a Cyclic Redundancy Check (CRC) +in a way compatible with standard tar tools. *Note crc32::. + + Tarlz does not understand other tar formats like 'gnu', 'oldgnu', +'star', or 'v7'. The command 'tarlz -t -f archive.tar.lz > /dev/null' can +be used to check that the format of the archive is compatible with tarlz. + + +File: tarlz.info, Node: Invoking tarlz, Next: Portable character set, Prev: Introduction, Up: Top + +2 Invoking tarlz +**************** + +The format for running tarlz is: + + tarlz OPERATION [OPTIONS] [FILES] + +All operations except '--concatenate' and '--compress' operate on whole +trees if any FILE is a directory. All operations except '--compress' +overwrite output files without warning. If no archive is specified, tarlz +tries to read it from standard input or write it to standard output. Tarlz +refuses to read archive data from a terminal or write archive data to a +terminal. Tarlz detects when the archive being created or enlarged is among +the files to be archived, appended, or concatenated, and skips it. + + Tarlz does not use absolute file names nor file names above the current +working directory (perhaps changed by option '-C'). On archive creation or +appending tarlz archives the files specified, but removes from member names +any leading and trailing slashes and any file name prefixes containing a +'..' component. On extraction, leading and trailing slashes are also +removed from member names, and archive members containing a '..' component +in the file name are skipped. Tarlz does not follow symbolic links during +extraction; not even symbolic links replacing intermediate directories. + + On extraction and listing, tarlz removes leading './' strings from +member names in the archive or given in the command line, so that +'tarlz -xf foo ./bar baz' extracts members 'bar' and './baz' from archive +'foo'. + + If several compression levels or '--*solid' options are given, the last +setting is used. For example '-9 --solid --uncompressed -1' is equivalent +to '-1 --solid'. + + tarlz supports the following operations: + +'--help' + Print an informative help message describing the options and exit. + +'-V' +'--version' + Print the version number of tarlz on the standard output and exit. + This version number should be included in all bug reports. + +'-A' +'--concatenate' + Append one or more archives to the end of an archive. If no archive is + specified with the option '-f', concatenate the input archives to + standard output. All the archives involved must be regular (seekable) + files, and must be either all compressed or all uncompressed. + Compressed and uncompressed archives can't be mixed. Compressed + archives must be multimember lzip files with the two end-of-archive + blocks plus any zero padding contained in the last lzip member of each + archive. The intermediate end-of-archive blocks are removed as each + new archive is concatenated. If the archive is uncompressed, tarlz + parses tar headers until it finds the end-of-archive blocks. Exit with + status 0 without modifying the archive if no FILES have been specified. + + Concatenating archives containing files in common results in two or + more tar members with the same name in the resulting archive, which + may produce nondeterministic behavior during multi-threaded extraction. + *Note mt-extraction::. + +'-c' +'--create' + Create a new archive from FILES. + +'-d' +'--diff' + Compare and report differences between archive and file system. For + each tar member in the archive, check that the corresponding file in + the file system exists and is of the same type (regular file, + directory, etc). Report on standard output the differences found in + type, mode (permissions), owner and group IDs, modification time, file + size, file contents (of regular files), target (of symlinks) and + device number (of block/character special files). + + As tarlz removes leading slashes from member names, the option '-C' may + be used in combination with '--diff' when absolute file names were used + on archive creation: 'tarlz -C / -d'. Alternatively, tarlz may be run + from the root directory to perform the comparison. + +'--delete' + Delete files and directories from an archive in place. It currently can + delete only from uncompressed archives and from archives with files + compressed individually ('--no-solid' archives). Note that files of + about '--data-size' or larger are compressed individually even if + '--bsolid' is used, and can therefore be deleted. Tarlz takes care to + not delete a tar member unless it is possible to do so. For example it + won't try to delete a tar member that is not compressed individually. + Even in the case of finding a corrupt member after having deleted some + member(s), tarlz stops and copies the rest of the file as soon as + corruption is found, leaving it just as corrupt as it was, but not + worse. + + To delete a directory without deleting the files under it, use + 'tarlz --delete -f foo --exclude='dir/*' dir'. Deleting in place may + be dangerous. A corrupt archive, a power cut, or an I/O error may cause + data loss. + +'-r' +'--append' + Append files to the end of an archive. The archive must be a regular + (seekable) file either compressed or uncompressed. Compressed members + can't be appended to an uncompressed archive, nor vice versa. If the + archive is compressed, it must be a multimember lzip file with the two + end-of-archive blocks plus any zero padding contained in the last lzip + member of the archive. It is possible to append files to an archive + with a different compression granularity. Appending works as follows; + first the end-of-archive blocks are removed, then the new members are + appended, and finally two new end-of-archive blocks are appended to + the archive. If the archive is uncompressed, tarlz parses and skips + tar headers until it finds the end-of-archive blocks. Exit with status + 0 without modifying the archive if no FILES have been specified. + + Appending files already present in the archive results in two or more + tar members with the same name, which may produce nondeterministic + behavior during multi-threaded extraction. *Note mt-extraction::. + +'-t' +'--list' + List the contents of an archive. If FILES are given, list only the + FILES given. + +'-x' +'--extract' + Extract files from an archive. If FILES are given, extract only the + FILES given. Else extract all the files in the archive. To extract a + directory without extracting the files under it, use + 'tarlz -xf foo --exclude='dir/*' dir'. Tarlz removes files and empty + directories unconditionally before extracting over them. Other than + that, it does not make any special effort to extract a file over an + incompatible type of file. For example, extracting a file over a + non-empty directory usually fails. + +'-z' +'--compress' + Compress existing POSIX tar archives aligning the lzip members to the + tar members with choice of granularity ('--bsolid' by default, + '--dsolid' works like '--asolid'). Exit with error status 2 if any + input archive is an empty file. The input archives are kept unchanged. + Existing compressed archives are not overwritten. A hyphen '-' used as + the name of an input archive reads from standard input and writes to + standard output (unless the option '--output' is used). Tarlz can be + used as compressor for GNU tar by using a command like + 'tar -c -Hustar foo | tarlz -z -o foo.tar.lz'. Tarlz can be used as + compressor for zupdate (zutils) by using a command like + 'zupdate --lz="tarlz -z" foo.tar.gz'. Note that tarlz only works + reliably on archives without global headers, or with global headers + whose content can be ignored. + + The compression is reversible, including any garbage present after the + end-of-archive blocks. Tarlz stops parsing after the first + end-of-archive block is found, and then compresses the rest of the + archive. Unless solid compression is requested, the end-of-archive + blocks are compressed in a lzip member separated from the preceding + members and from any non-zero garbage following the end-of-archive + blocks. '--compress' implies plzip argument style, not tar style. Each + input archive is compressed to a file with the extension '.lz' added + unless the option '--output' is used. When '--output' is used, only + one input archive can be specified. '-f' can't be used with + '--compress'. + +'--check-lib' + Compare the version of lzlib used to compile tarlz with the version + actually being used at run time and exit. Report any differences + found. Exit with error status 1 if differences are found. A mismatch + may indicate that lzlib is not correctly installed or that a different + version of lzlib has been installed after compiling tarlz. Exit with + error status 2 if LZ_API_VERSION and LZ_version_string don't match. + 'tarlz -v --check-lib' shows the version of lzlib being used and the + value of LZ_API_VERSION (if defined). *Note Library version: + (lzlib)Library version. + + + tarlz supports the following options: *Note Argument syntax: +(arg_parser)Argument syntax. + +'-B BYTES' +'--data-size=BYTES' + Set target size of input data blocks for the option '--bsolid'. *Note + --bsolid::. Valid values range from 8 KiB to 1 GiB. Default value is + two times the dictionary size, except for option '-0' where it + defaults to 1 MiB. *Note Minimum archive sizes::. + +'-C DIR' +'--directory=DIR' + Change to directory DIR. When creating, appending, comparing, or + extracting, the position of each '-C' option in the command line is + significant; it changes the current working directory for the following + FILES until a new '-C' option appears in the command line. '--list' + and '--delete' ignore any '-C' options specified. DIR is relative to + the then current working directory, perhaps changed by a previous '-C' + option. + + Note that a process can only have one current working directory (CWD). + Therefore multi-threading can't be used to create or decode an archive + if a '-C' option appears after a (relative) file name in the command + line. (All file names are made relative when decoding). + +'-f ARCHIVE' +'--file=ARCHIVE' + Use archive file ARCHIVE. A hyphen '-' used as an ARCHIVE argument + reads from standard input or writes to standard output. + +'-h' +'--dereference' + Follow symbolic links during archive creation, appending or comparison. + Archive or compare the files they point to instead of the links + themselves. + +'-n N' +'--threads=N' + Set the number of (de)compression threads, overriding the system's + default. Valid values range from 0 to "as many as your system can + support". A value of 0 disables threads entirely. If this option is + not used, tarlz tries to detect the number of processors in the system + and use it as default value. 'tarlz --help' shows the system's default + value. See the note about multi-threading in the option '-C' above. + + Note that the number of usable threads is limited during compression to + ceil( uncompressed_size / data_size ) (*note Minimum archive sizes::), + and during decompression to the number of lzip members in the tar.lz + archive, which you can find by running 'lzip -lv archive.tar.lz'. + +'-o FILE' +'--output=FILE' + Write the compressed output to FILE. '-o -' writes the compressed + output to standard output. Currently '--output' only works with + '--compress'. + +'-p' +'--preserve-permissions' + On extraction, set file permissions as they appear in the archive. + This is the default behavior when tarlz is run by the superuser. The + default for other users is to subtract the umask of the user running + tarlz from the permissions specified in the archive. + +'-q' +'--quiet' + Quiet operation. Suppress all messages. + +'-v' +'--verbose' + Verbosely list files processed. Further -v's (up to 4) increase the + verbosity level. + +'-0 .. -9' + Set the compression level for '--create', '--append', and + '--compress'. The default compression level is '-6'. Like lzip, tarlz + also minimizes the dictionary size of the lzip members it creates, + reducing the amount of memory required for decompression. + + Level Dictionary size Match length limit + -0 64 KiB 16 bytes + -1 1 MiB 5 bytes + -2 1.5 MiB 6 bytes + -3 2 MiB 8 bytes + -4 3 MiB 12 bytes + -5 4 MiB 20 bytes + -6 8 MiB 36 bytes + -7 16 MiB 68 bytes + -8 24 MiB 132 bytes + -9 32 MiB 273 bytes + +'--uncompressed' + With '--create', don't compress the tar archive created. Create an + uncompressed tar archive instead. With '--append', don't compress the + new members appended to the tar archive. Compressed members can't be + appended to an uncompressed archive, nor vice versa. '--uncompressed' + can be omitted if it can be deduced from the archive name. (An + uncompressed archive name lacks a '.lz' or '.tlz' extension). + +'--asolid' + When creating or appending to a compressed archive, use appendable + solid compression. All the files being added to the archive are + compressed into a single lzip member, but the end-of-archive blocks + are compressed into a separate lzip member. This creates a solidly + compressed appendable archive. Solid archives can't be created nor + decoded in parallel. + +'--bsolid' + When creating or appending to a compressed archive, use block + compression. Tar members are compressed together in a lzip member + until they approximate a target uncompressed size. The size can't be + exact because each solidly compressed data block must contain an + integer number of tar members. Block compression is the default + because it improves compression ratio for archives with many files + smaller than the block size. This option allows tarlz revert to + default behavior if, for example, it is invoked through an alias like + 'tar='tarlz --solid''. *Note --data-size::, to set the target block + size. + +'--dsolid' + When creating or appending to a compressed archive, compress each file + specified in the command line separately in its own lzip member, and + use solid compression for each directory specified in the command + line. The end-of-archive blocks are compressed into a separate lzip + member. This creates a compressed appendable archive with a separate + lzip member for each file or top-level directory specified. + +'--no-solid' + When creating or appending to a compressed archive, compress each file + separately in its own lzip member. The end-of-archive blocks are + compressed into a separate lzip member. This creates a compressed + appendable archive with a lzip member for each file. + +'--solid' + When creating or appending to a compressed archive, use solid + compression. The files being added to the archive, along with the + end-of-archive blocks, are compressed into a single lzip member. The + resulting archive is not appendable. No more files can be later + appended to the archive. Solid archives can't be created nor decoded + in parallel. + +'--anonymous' + Equivalent to '--owner=root --group=root'. + +'--owner=OWNER' + When creating or appending, use OWNER for files added to the archive. + If OWNER is not a valid user name, it is decoded as a decimal numeric + user ID. + +'--group=GROUP' + When creating or appending, use GROUP for files added to the archive. + If GROUP is not a valid group name, it is decoded as a decimal numeric + group ID. + +'--exclude=PATTERN' + Exclude files matching a shell pattern like '*.o'. A file is considered + to match if any component of the file name matches. For example, '*.o' + matches 'foo.o', 'foo.o/bar' and 'foo/bar.o'. If PATTERN contains a + '/', it matches a corresponding '/' in the file name. For example, + 'foo/*.o' matches 'foo/bar.o'. Multiple '--exclude' options can be + specified. + +'--ignore-ids' + Make '--diff' ignore differences in owner and group IDs. This option is + useful when comparing an '--anonymous' archive. + +'--ignore-metadata' + Make '--diff' ignore any differences in metadata (file permissions, + owner and group IDs, modification time). Compare only file type, file + size, and file content. This option is useful when file permissions + have not been fully restored because uid/gid changed on extraction. + +'--ignore-overflow' + Make '--diff' ignore differences in mtime caused by overflow on 32-bit + systems with a 32-bit time_t. + +'--keep-damaged' + Don't delete partially extracted files. If a decompression error + happens while extracting a file, keep the partial data extracted. Use + this option to recover as much data as possible from each damaged + member. It is recommended to run tarlz in single-threaded mode + ('--threads=0') when using this option. + +'--missing-crc' + Exit with error status 2 if the CRC of the extended records is + missing. When this option is used, tarlz detects any corruption in the + extended records (only limited by CRC collisions). But note that a + corrupt 'GNU.crc32' keyword, for example 'GNU.crc30', is reported as a + missing CRC instead of as a corrupt record. This misleading + 'Missing CRC' message is the consequence of a flaw in the POSIX pax + format; i.e., the lack of a mandatory check sequence of the extended + records. *Note crc32::. + +'--mtime=DATE' + When creating or appending, use DATE as the modification time for + files added to the archive instead of their actual modification times. + The value of DATE may be either '@' followed by the number of seconds + since (or before) the epoch, or a date in format + '[-]YYYY-MM-DD HH:MM:SS' or '[-]YYYY-MM-DDTHH:MM:SS', or the name of + an existing reference file starting with '.' or '/' whose modification + time is used. The time of day 'HH:MM:SS' in the date format is + optional and defaults to '00:00:00'. The epoch is + '1970-01-01 00:00:00 UTC'. Negative seconds or years define a + modification time before the epoch. + +'--out-slots=N' + Number of 1 MiB output packets buffered per worker thread during + multi-threaded creation or appending to compressed archives. + Increasing the number of packets may increase compression speed if the + files being archived are larger than 64 MiB compressed, but requires + more memory. Valid values range from 1 to 1024. The default value is + 64. + +'--warn-newer' + During archive creation, warn if any file being archived has a + modification time newer than the archive creation time. This option + may slow archive creation somewhat because it makes an extra call to + 'stat' after archiving each file, but it guarantees that file contents + were not modified during the creation of the archive. Note that the + file must be at least one second newer than the archive for it to be + detected as newer. + + + Exit status: 0 for a normal exit, 1 for environmental problems (file not +found, files differ, invalid command-line options, I/O errors, etc), 2 to +indicate a corrupt or invalid input file, 3 for an internal consistency +error (e.g., bug) which caused tarlz to panic. + + +File: tarlz.info, Node: Portable character set, Next: File format, Prev: Invoking tarlz, Up: Top + +3 POSIX portable filename character set +*************************************** + +The set of characters from which portable file names are constructed. + + A B C D E F G H I J K L M N O P Q R S T U V W X Y Z + a b c d e f g h i j k l m n o p q r s t u v w x y z + 0 1 2 3 4 5 6 7 8 9 . _ - + + The last three characters are the period, underscore, and hyphen-minus +characters, respectively. + + File names are identifiers. Therefore, archiving works better when file +names use only the portable character set without spaces added. + + +File: tarlz.info, Node: File format, Next: Amendments to pax format, Prev: Portable character set, Up: Top + +4 File format +************* + +In the diagram below, a box like this: + ++---+ +| | <-- the vertical bars might be missing ++---+ + + represents one byte; a box like this: + ++==============+ +| | ++==============+ + + represents a variable number of bytes or a fixed but large number of +bytes (for example 512). + + + A tar.lz file consists of one or more lzip members (compressed data +sets). The members simply appear one after another in the file, with no +additional information before, between, or after them. + + Each lzip member contains one or more tar members in a simplified POSIX +pax interchange format. The only pax typeflag value supported by tarlz (in +addition to the typeflag values defined by the ustar format) is 'x'. The +pax format is an extension on top of the ustar format that removes the size +limitations of the ustar format. + + Each tar member contains one file archived, and is represented by the +following sequence: + + * An optional extended header block followed by one or more blocks that + contain the extended header records as if they were the contents of a + file; i.e., the extended header records are included as the data for + this header block. This header block is of the form described in pax + header block, with a typeflag value of 'x'. + + * A header block in ustar format that describes the file. Any fields + defined in the preceding optional extended header records override the + associated fields in this header block for this file. + + * Zero or more blocks that contain the contents of the file. + + Each tar member must be contiguously stored in a lzip member for the +parallel decoding operations like '--list' to work. If any tar member is +split over two or more lzip members, the archive must be decoded +sequentially. *Note Multi-threaded decoding::. + + At the end of the archive file there are two 512-byte blocks filled with +binary zeros, interpreted as an end-of-archive indicator. These EOA blocks +are either compressed in a separate lzip member or compressed along with the +tar members contained in the last lzip member. For a compressed archive to +be recognized by tarlz as appendable, the last lzip member must contain +between 512 and 32256 zeros alone (without any non-zero bytes). + + The diagram below shows the correspondence between each tar member +(formed by one or two headers plus optional data) in the tar archive and +each lzip member in the resulting multimember tar.lz archive, when per file +compression is used: *Note File format: (lzip)File format. + +tar ++========+======+=================+===============+========+======+========+ +| header | data | extended header | extended data | header | data | EOA | ++========+======+=================+===============+========+======+========+ + +tar.lz ++===============+=================================================+========+ +| member | member | member | ++===============+=================================================+========+ + + +4.1 Pax header block +==================== + +The pax header block is identical to the ustar header block described below +except that the typeflag has the value 'x' (extended). The field 'size' is +the size of the extended header data in bytes. Most other fields in the pax +header block are zeroed on archive creation to prevent trouble if the +archive is read by an ustar tool, and are ignored by tarlz on archive +extraction. *Note flawed-compat::. + + Tarlz limits the size of the pax extended header data so that the whole +header set (extended header + extended data + ustar header) can be read and +decoded in a buffer of size INT_MAX. + + The pax extended header data consists of one or more records, each of +them constructed as follows: +'"%d %s=%s\n", <length>, <keyword>, <value>' + + The fields <length> and <keyword> in the record must be limited to the +portable character set (*note Portable character set::). The field <length> +contains the decimal length of the record in bytes, including the trailing +newline. The field <value> is stored as-is, without conversion to UTF-8 nor +any other transformation. The fields are separated by the ASCII characters +space, equal-sign, and newline. + + These are the <keyword> values currently supported by tarlz: + +'atime' + The signed decimal representation of the access time of the following + file in seconds since (or before) the epoch, obtained from the function + 'stat'. The atime record is created only for files with a modification + time outside of the ustar range. *Note ustar-mtime::. + +'gid' + The unsigned decimal representation of the group ID of the group that + owns the following file. The gid record is created only for files with + a group ID greater than 2_097_151 (octal 7_777_777). *Note + ustar-uid-gid::. + +'linkpath' + The file name of a link being created to another file, of any type, + previously archived. This record overrides the field 'linkname' in the + following ustar header block. The following ustar header block + determines the type of link created. If typeflag of the following + header block is 1, a hard link is created. If typeflag is 2, a + symbolic link is created and the linkpath value is used as the + contents of the symbolic link. The linkpath record is created only for + links with a link name that does not fit in the space provided by the + ustar header. + +'mtime' + The signed decimal representation of the modification time of the + following file in seconds since (or before) the epoch, obtained from + the function 'stat'. This record overrides the field 'mtime' in the + following ustar header block. The mtime record is created only for + files with a modification time outside of the ustar range. *Note + ustar-mtime::. + +'path' + The file name of the following file. This record overrides the fields + 'name' and 'prefix' in the following ustar header block. The path + record is created for files with a name that does not fit in the space + provided by the ustar header, but is also created for files that + require any other extended record so that the fields 'name' and + 'prefix' in the following ustar header block can be zeroed. + +'size' + The size of the file in bytes, expressed as a decimal number using + digits from the ISO/IEC 646:1991 (ASCII) standard. This record + overrides the field 'size' in the following ustar header block. The + size record is created only for files with a size value greater than + 8_589_934_591 (octal 77_777_777_777); that is, 8 GiB (2^33 bytes) or + larger. + +'uid' + The unsigned decimal representation of the user ID of the file owner + of the following file. The uid record is created only for files with a + user ID greater than 2_097_151 (octal 7_777_777). *Note + ustar-uid-gid::. + +'GNU.crc32' + CRC32-C (Castagnoli) of the extended header data excluding the 8 bytes + representing the CRC <value> itself. The <value> is represented as 8 + hexadecimal digits in big endian order, '22 GNU.crc32=00000000\n'. The + keyword of the CRC record is protected by the CRC to guarantee that + corruption is always detected when using '--missing-crc' (except in + case of CRC collision). A CRC was chosen because a checksum is too + weak for a potentially large list of variable sized records. A + checksum can't detect simple errors like the swapping of two bytes. + + + At verbosity level 1 or higher tarlz prints a diagnostic for each unknown +extended header keyword found in an archive, once per keyword. + + +4.2 Ustar header block +====================== + +The ustar header block has a length of 512 bytes and is structured as shown +in the following table. All lengths and offsets are in decimal. + +Field Name Offset Length (in bytes) +name 0 100 +mode 100 8 +uid 108 8 +gid 116 8 +size 124 12 +mtime 136 12 +chksum 148 8 +typeflag 156 1 +linkname 157 100 +magic 257 6 +version 263 2 +uname 265 32 +gname 297 32 +devmajor 329 8 +devminor 337 8 +prefix 345 155 + + All characters in the header block are coded using the ISO/IEC 646:1991 +(ASCII) standard, except in fields storing names for files, users, and +groups. For maximum portability between implementations, names should only +contain characters from the portable character set (*note Portable +character set::), but if an implementation supports the use of characters +outside of '/' and the portable character set in names for files, users, +and groups, tarlz will use the byte values in these names unmodified. + + The fields 'name', 'linkname', and 'prefix' are null-terminated +character strings except when all characters in the array contain non-null +characters including the last character. + + The fields 'name' and 'prefix' produce the file name. A new file name is +formed, if prefix is not an empty string (its first character is not null), +by concatenating prefix (up to the first null character), a slash +character, and name; otherwise, name is used alone. In either case, name is +terminated at the first null character. If prefix begins with a null +character, it is ignored. In this manner, file names of at most 256 +characters can be supported. If a file name does not fit in the space +provided, an extended record is used to store the file name. + + The field 'linkname' does not use the prefix to produce a file name. If +the link name does not fit in the 100 characters provided, an extended +record is used to store the link name. + + The field 'mode' provides 12 access permission bits. The following table +shows the symbolic name of each bit and its octal value: + +Bit Name Value Bit Name Value Bit Name Value +--------------------------------------------------- +S_ISUID 04000 S_ISGID 02000 S_ISVTX 01000 +S_IRUSR 00400 S_IWUSR 00200 S_IXUSR 00100 +S_IRGRP 00040 S_IWGRP 00020 S_IXGRP 00010 +S_IROTH 00004 S_IWOTH 00002 S_IXOTH 00001 + + The fields 'uid' and 'gid' are the user and group IDs of the owner and +group of the file, respectively. If the file uid or gid are greater than +2_097_151 (octal 7_777_777), an extended record is used to store the uid or +gid. + + The field 'size' contains the octal representation of the size of the +file in bytes. If the field 'typeflag' specifies a file of type '0' +(regular file) or '7' (high performance regular file), the number of logical +records following the header is (size / 512) rounded to the next integer. +For all other values of typeflag, tarlz either sets the size field to 0 or +ignores it, and does not store or expect any logical records following the +header. If the file size is larger than 8_589_934_591 bytes +(octal 77_777_777_777), an extended record is used to store the file size. + + The field 'mtime' contains the octal representation of the modification +time of the file at the time it was archived, obtained from the function +'stat'. If the modification time is negative or larger than 8_589_934_591 +(octal 77_777_777_777) seconds since the epoch, an extended record is used +to store the modification time. The ustar range of mtime goes from +'1970-01-01 00:00:00 UTC' to '2242-03-16 12:56:31 UTC'. + + The field 'chksum' contains the octal representation of the value of the +simple sum of all bytes in the header logical record. Each byte in the +header is treated as an unsigned value. When calculating the checksum, the +chksum field is treated as if it were all space characters. + + The field 'typeflag' contains a single character specifying the type of +file archived: + +''0'' + Regular file. + +''1'' + Hard link to another file, of any type, previously archived. Hard + links must not contain file data. + +''2'' + Symbolic link. + +''3', '4'' + Character special file and block special file respectively. In this + case the fields 'devmajor' and 'devminor' contain information defining + the device in unspecified format. + +''5'' + Directory. + +''6'' + FIFO special file. + +''7'' + Reserved to represent a file to which an implementation has associated + some high-performance attribute (contiguous file). Tarlz treats this + type of file as a regular file (type 0). + + + The field 'magic' contains the ASCII null-terminated string "ustar". The +field 'version' contains the characters "00" (0x30,0x30). The fields +'uname' and 'gname' are null-terminated character strings except when all +characters in the array contain non-null characters including the last +character. Each numeric field contains a leading space- or zero-filled, +optionally null-terminated octal number using digits from the ISO/IEC +646:1991 (ASCII) standard. Tarlz is able to decode numeric fields 1 byte +longer than standard ustar by not requiring a terminating null character. + + +File: tarlz.info, Node: Amendments to pax format, Next: Program design, Prev: File format, Up: Top + +5 The reasons for the differences with pax +****************************************** + +Tarlz creates safe archives that allow the reliable detection of invalid or +corrupt metadata during decoding even when the integrity checking of lzip +can't be used because the lzip members are only decompressed partially, as +it happens in parallel '--diff', '--list', and '--extract'. In order to +achieve this goal and avoid some other flaws in the pax format, tarlz makes +some changes to the variant of the pax format that it uses. This chapter +describes these changes and the concrete reasons to implement them. + + +5.1 Add a CRC of the extended records +===================================== + +The POSIX pax format has a serious flaw. The metadata stored in pax extended +records are not protected by any kind of check sequence. Corruption in a +long file name may cause the extraction of the file in the wrong place +without warning. Corruption in a large file size may cause the truncation of +the file or the appending of garbage to the file, both followed by a +spurious warning about a corrupt header far from the place of the undetected +corruption. + + Metadata like file name and file size must be always protected in an +archive format because of the adverse effects of undetected corruption in +them, potentially much worse that undetected corruption in the data. Even +more so in the case of pax because the amount of metadata it stores is +potentially large, making undetected corruption and archiver misbehavior +more probable. + + Headers and metadata must be protected separately from data because the +integrity checking of lzip may not be able to detect the corruption before +the metadata have been used, for example, to create a new file in the wrong +place. + + Because of the above, tarlz protects the extended records with a Cyclic +Redundancy Check (CRC) in a way compatible with standard tar tools. *Note +key_crc32::. + + +5.2 Remove flawed backward compatibility +======================================== + +In order to allow the extraction of pax archives by a tar utility conforming +to the POSIX-2:1993 standard, POSIX.1-2008 recommends selecting extended +header field values that allow such tar to create a regular file containing +the extended header records as data. This approach is broken because if the +extended header is needed because of a long file name, the fields 'name' +and 'prefix' are unable to contain the full file name. (Some tar +implementations store the truncated name in the field 'name' alone, +truncating the name to only 100 bytes instead of 256). Therefore the files +corresponding to both the extended header and the overridden ustar header +are extracted using truncated file names, perhaps overwriting existing +files or directories. It may be a security risk to extract a file with a +truncated file name. + + To avoid this problem, tarlz writes extended headers with all fields +zeroed except 'size' (which contains the size of the extended records), +'chksum', 'typeflag', 'magic', and 'version'. In particular, tarlz sets the +fields 'name' and 'prefix' to zero. This prevents old tar programs from +extracting the extended records as a file in the wrong place. Tarlz also +sets to zero those fields of the ustar header overridden by extended +records. Finally, tarlz skips members with zeroed 'name' and 'prefix' when +decoding, except when listing. This is needed to detect certain format +violations during parallel extraction. + + If an extended header is required for any reason (for example a file +size of 8 GiB or larger, or a link name longer than 100 bytes), tarlz also +moves the file name to the extended records to prevent an ustar tool from +trying to extract the file or link. This also makes easier during parallel +decoding the detection of a tar member split between two lzip members at +the boundary between the extended header and the ustar header. + + +5.3 As simple as possible (but not simpler) +=========================================== + +The tarlz format is mainly ustar. Extended pax headers are used only when +needed because the length of a file name or link name, or the size or other +attribute of a file exceed the limits of the ustar format. Adding 1 KiB of +extended header and records to each member just to save subsecond +timestamps seems wasteful for a backup format. Moreover, minimizing the +overhead may help recovering the archive with lziprecover in case of +corruption. + + Global pax headers are tolerated, but not supported; they are parsed and +ignored. Some operations may not behave as expected if the archive contains +global headers. + + +5.4 Improve reproducibility +=========================== + +Pax includes by default the process ID of the pax process in the ustar name +of the extended headers, making the archive not reproducible. Tarlz stores +the true name of the file just once, either in the ustar header or in the +extended records, making it easier to produce reproducible archives. + + Pax allows an extended record to have length x-1 or x if x is a power of +ten; '99<97_bytes>' or '100<97_bytes>'. Tarlz minimizes the length of the +record and always produces a length of x-1 in these cases. + + +5.5 No data in hard links +========================= + +Tarlz does not allow data in hard link members. The data (if any) must be in +the member determining the type of the file (which can't be a link). If all +the names of a file are stored as hard links, the type of the file is lost. +Not allowing data in hard links also prevents invalid actions like +extracting file data for a hard link to a symbolic link or to a directory. + + +5.6 Avoid misconversions to/from UTF-8 +====================================== + +There is no portable way to tell what charset a text string is coded into. +Therefore, tarlz stores all fields representing text strings unmodified, +without conversion to UTF-8 nor any other transformation. This prevents +accidental double UTF-8 conversions. If the need arises this behavior will +be adjusted with a command-line option in the future. + + +File: tarlz.info, Node: Program design, Next: Multi-threaded decoding, Prev: Amendments to pax format, Up: Top + +6 Internal structure of tarlz +***************************** + +The parts of tarlz related to sequential processing of the archive are more +or less similar to any other tar and won't be described here. The +interesting parts described here are those related to Multi-threaded +processing. + + The structure of the part of tarlz performing Multi-threaded archive +creation is somewhat similar to that of plzip with the added complication +of the solidity levels. *Note Program design: (plzip)Program design. A +grouper thread and several worker threads are created, acting the main +thread as muxer (multiplexer) thread. A "packet courier" takes care of data +transfers among threads and limits the maximum number of data blocks +(packets) being processed simultaneously. + + The grouper traverses the directory tree, groups together the metadata of +the files to be archived in each lzip member, and distributes them to the +workers. The workers compress the metadata received from the grouper along +with the file data read from the file system. The muxer collects processed +packets from the workers, and writes them to the archive. + +.--------. +| data|---> to each worker below +| | .------------. +| file | ,-->| worker 0 |--, +| system | | `------------' | +| | .---------. | .------------. | .-------. .---------. +|metadata|--->| grouper |-+-->| worker 1 |--+-->| muxer |-->| archive | +`--------' `---------' | `------------' | `-------' `---------' + | ... | + | .------------. | + `-->| worker N-1 |--' + `------------' + + Decoding an archive is somewhat similar to how plzip decompresses a +regular file to standard output, with the differences that it is not the +data but only messages what is written to stdout/stderr, and that each +worker may access files in the file system either to read them (diff) or +write them (extract). As in plzip, each worker reads members directly from +the archive. + +.--------. +| file |<---> data to/from each worker below +| system | +`--------' .------------. + ,-->| worker 0 |--, + | `------------' | +.---------. | .------------. | .-------. .--------. +| archive |-+-->| worker 1 |--+-->| muxer |-->| stdout | +`---------' | `------------' | `-------' | stderr | + | ... | `--------' + | .------------. | + `-->| worker N-1 |--' + `------------' + + As misaligned tar.lz archives can't be decoded in parallel, and the +misalignment can't be detected until after decoding has started, a +"mastership request" mechanism has been designed that allows the decoding to +continue instead of signalling an error. + + During parallel decoding, if a worker finds a misalignment, it requests +mastership to decode the rest of the archive. When mastership is requested, +an error_member_id is set, and all subsequently received packets with +member_id > error_member_id are rejected. All workers requesting mastership +are blocked at the request_mastership call until mastership is granted. +Mastership is granted to the delivering worker when its queue is empty to +make sure that all preceding packets have been processed. When mastership is +granted, all packets are deleted and all subsequently received packets not +coming from the master are rejected. + + If a worker can't continue decoding for any cause (for example lack of +memory or finding a split tar member at the beginning of a lzip member), it +requests mastership to print an error and terminate the program. Only if +some other worker requests mastership in a previous lzip member can this +error be avoided. + + +File: tarlz.info, Node: Multi-threaded decoding, Next: Minimum archive sizes, Prev: Program design, Up: Top + +7 Limitations of parallel tar decoding +************************************** + +Safely decoding an arbitrary tar archive in parallel is only possible if one +decodes the headers sequentially first. For example, if a tar archive +containing another tar archive is decoded starting from some position other +than the beginning, there is no way to know if the first header found there +belongs to the outer tar archive or to the inner tar archive. Tar is a +format inherently serial; it was designed for tapes. + + The pax format is even more serial than the ustar format. Two headers +need to be decoded sequentially for each file. The extended header may even +need parsing to reveal something as basic as file size. If a thread decodes +the ustar header skipping the preceding extended header, it may extract a +file of incorrect size at the wrong place. Moreover, a pax archive with +global headers can't be decoded in parallel because each thread can't know +about the global headers decoded by other threads. + + In the case of compressed tar archives, the start of each compressed +block determines one point through which the tar archive can be decoded in +parallel. Therefore, in tar.lz archives the decoding operations can't be +parallelized if the tar members are not aligned with the lzip members. Tar +archives compressed with plzip can't be decoded in parallel because tar and +plzip do not have a way to align both sets of members. Certainly one can +decompress one such archive with a multi-threaded tool like plzip, but the +increase in speed is not as large as it could be because plzip must +serialize the decompressed data and pass them to tar, which decodes them +sequentially, one tar member at a time. + + On the other hand, if the tar.lz archive is created with a tool like +tarlz, which can guarantee the alignment between tar members and lzip +members because it controls both archiving and compression, then the lzip +format becomes an indexed layer on top of the tar archive which makes +possible decoding it safely in parallel. + + Tarlz is able to automatically decode aligned and unaligned multimember +tar.lz archives, keeping backwards compatibility. If tarlz finds a member +misalignment during multi-threaded decoding, it switches to single-threaded +mode and continues decoding the archive. + + If the files in the archive are large, multi-threaded '--list' on a +regular (seekable) tar.lz archive can be hundreds of times faster than +sequential '--list' because, in addition to using several processors, it +only needs to decompress part of each lzip member. See the following +example listing the Silesia corpus on a dual core machine: + + tarlz -9 --no-solid -cf silesia.tar.lz silesia + time lzip -cd silesia.tar.lz | tar -tf - (5.032s) + time plzip -cd silesia.tar.lz | tar -tf - (3.256s) + time tarlz -tf silesia.tar.lz (0.020s) + + On the other hand, multi-threaded '--list' won't detect corruption in +the tar member data because it only decodes the part of each lzip member +corresponding to the tar member header. This is another reason why the tar +headers must provide their own integrity checking. + + +7.1 Limitations of multi-threaded extraction +============================================ + +Multi-threaded extraction may produce different output than single-threaded +extraction in some cases: + + During multi-threaded extraction, several independent threads are +simultaneously reading the archive and creating files in the file system. +The archive is not read sequentially. As a consequence, any error or +weirdness in the archive (like a corrupt member or an end-of-archive block +in the middle of the archive) won't be usually detected until part of the +archive beyond that point has been processed. + + If the archive contains two or more tar members with the same name, +single-threaded extraction extracts the members in the order they appear in +the archive and leaves in the file system the last version of the file. But +multi-threaded extraction may extract the members in any order and leave in +the file system any version of the file nondeterministically. It is +unspecified which of the tar members is extracted. + + If the same file is extracted through several paths (different member +names resolve to the same file in the file system), the result is undefined. +(Probably the resulting file will be mangled). + + Extraction of a hard link may fail if it is extracted before the file it +links to. + + +File: tarlz.info, Node: Minimum archive sizes, Next: Examples, Prev: Multi-threaded decoding, Up: Top + +8 Minimum archive sizes required for multi-threaded block compression +********************************************************************* + +When creating or appending to a compressed archive using multi-threaded +block compression, tarlz puts tar members together in blocks and compresses +as many blocks simultaneously as worker threads are chosen, creating a +multimember compressed archive. + + For this to work as expected (and roughly multiply the compression speed +by the number of available processors), the uncompressed archive must be at +least as large as the number of worker threads times the block size (*note +--data-size::). Else some processors do not get any data to compress, and +compression is proportionally slower. The maximum speed increase achievable +on a given archive is limited by the ratio (uncompressed_size / data_size). +For example, a tarball the size of gcc or linux scales up to 10 or 14 +processors at level -9. + + The following table shows the minimum uncompressed archive size needed +for full use of N processors at a given compression level, using the default +data size for each level: + +Processors 2 4 8 16 64 256 +------------------------------------------------------------------ +Level +-0 2 MiB 4 MiB 8 MiB 16 MiB 64 MiB 256 MiB +-1 4 MiB 8 MiB 16 MiB 32 MiB 128 MiB 512 MiB +-2 6 MiB 12 MiB 24 MiB 48 MiB 192 MiB 768 MiB +-3 8 MiB 16 MiB 32 MiB 64 MiB 256 MiB 1 GiB +-4 12 MiB 24 MiB 48 MiB 96 MiB 384 MiB 1.5 GiB +-5 16 MiB 32 MiB 64 MiB 128 MiB 512 MiB 2 GiB +-6 32 MiB 64 MiB 128 MiB 256 MiB 1 GiB 4 GiB +-7 64 MiB 128 MiB 256 MiB 512 MiB 2 GiB 8 GiB +-8 96 MiB 192 MiB 384 MiB 768 MiB 3 GiB 12 GiB +-9 128 MiB 256 MiB 512 MiB 1 GiB 4 GiB 16 GiB + + +File: tarlz.info, Node: Examples, Next: Problems, Prev: Minimum archive sizes, Up: Top + +9 A small tutorial with examples +******************************** + +Example 1: Create a multimember compressed archive 'archive.tar.lz' +containing files 'a', 'b' and 'c'. + + tarlz -cf archive.tar.lz a b c + + +Example 2: Append files 'd' and 'e' to the multimember compressed archive +'archive.tar.lz'. + + tarlz -rf archive.tar.lz d e + + +Example 3: Create a solidly compressed appendable archive 'archive.tar.lz' +containing files 'a', 'b' and 'c'. Then append files 'd' and 'e' to the +archive. + + tarlz --asolid -cf archive.tar.lz a b c + tarlz --asolid -rf archive.tar.lz d e + + +Example 4: Create a compressed appendable archive containing directories +'dir1', 'dir2' and 'dir3' with a separate lzip member per directory. Then +append files 'a', 'b', 'c', 'd' and 'e' to the archive, all of them +contained in a single lzip member. The resulting archive 'archive.tar.lz' +contains 5 lzip members (including the end-of-archive member). + + tarlz --dsolid -cf archive.tar.lz dir1 dir2 dir3 + tarlz --asolid -rf archive.tar.lz a b c d e + + +Example 5: Create a solidly compressed archive 'archive.tar.lz' containing +files 'a', 'b' and 'c'. Note that no more files can be later appended to +the archive. + + tarlz --solid -cf archive.tar.lz a b c + + +Example 6: Extract all files from archive 'archive.tar.lz'. + + tarlz -xf archive.tar.lz + + +Example 7: Extract files 'a' and 'c', and the whole tree under directory +'dir1' from archive 'archive.tar.lz'. + + tarlz -xf archive.tar.lz a c dir1 + + +Example 8: Copy the contents of directory 'sourcedir' to the directory +'destdir'. + + tarlz -C sourcedir --uncompressed -cf - . | tarlz -C destdir -xf - + + +Example 9: Compress the existing POSIX archive 'archive.tar' and write the +output to 'archive.tar.lz'. Compress each member individually for maximum +availability. (If one member in the compressed archive gets damaged, the +other members can still be extracted). + + tarlz -z --no-solid archive.tar + + +Example 10: Compress the archive 'archive.tar' and write the output to +'foo.tar.lz'. + + tarlz -z -o foo.tar.lz archive.tar + + +Example 11: Concatenate and compress two archives 'archive1.tar' and +'archive2.tar', and write the output to 'foo.tar.lz'. + + tarlz -A archive1.tar archive2.tar | tarlz -z -o foo.tar.lz + + +File: tarlz.info, Node: Problems, Next: Concept index, Prev: Examples, Up: Top + +10 Reporting bugs +***************** + +There are probably bugs in tarlz. There are certainly errors and omissions +in this manual. If you report them, they will get fixed. If you don't, no +one will ever know about them and they will remain unfixed for all +eternity, if not longer. + + If you find a bug in tarlz, please send electronic mail to +<lzip-bug@nongnu.org>. Include the version number, which you can find by +running 'tarlz --version' and 'tarlz -v --check-lib'. + + +File: tarlz.info, Node: Concept index, Prev: Problems, Up: Top + +Concept index +************* + + +* Menu: + +* Amendments to pax format: Amendments to pax format. (line 6) +* bugs: Problems. (line 6) +* examples: Examples. (line 6) +* file format: File format. (line 6) +* getting help: Problems. (line 6) +* introduction: Introduction. (line 6) +* invoking: Invoking tarlz. (line 6) +* minimum archive sizes: Minimum archive sizes. (line 6) +* options: Invoking tarlz. (line 6) +* parallel tar decoding: Multi-threaded decoding. (line 6) +* portable character set: Portable character set. (line 6) +* program design: Program design. (line 6) +* usage: Invoking tarlz. (line 6) +* version: Invoking tarlz. (line 6) + + + +Tag Table: +Node: Top216 +Node: Introduction1207 +Node: Invoking tarlz4032 +Ref: --data-size13076 +Ref: --bsolid17512 +Node: Portable character set23425 +Node: File format24068 +Ref: key_crc3231050 +Ref: ustar-uid-gid34315 +Ref: ustar-mtime35122 +Node: Amendments to pax format37125 +Ref: crc3237834 +Ref: flawed-compat39146 +Node: Program design43228 +Node: Multi-threaded decoding47153 +Ref: mt-extraction50434 +Node: Minimum archive sizes51740 +Node: Examples53867 +Node: Problems56234 +Node: Concept index56789 + +End Tag Table + + +Local Variables: +coding: iso-8859-15 +End: diff --git a/doc/tarlz.texi b/doc/tarlz.texi new file mode 100644 index 0000000..f37164f --- /dev/null +++ b/doc/tarlz.texi @@ -0,0 +1,1356 @@ +\input texinfo @c -*-texinfo-*- +@c %**start of header +@setfilename tarlz.info +@documentencoding ISO-8859-15 +@settitle Tarlz Manual +@finalout +@c %**end of header + +@set UPDATED 3 January 2024 +@set VERSION 0.25 + +@dircategory Archiving +@direntry +* Tarlz: (tarlz). Archiver with multimember lzip compression +@end direntry + + +@ifnothtml +@titlepage +@title Tarlz +@subtitle Archiver with multimember lzip compression +@subtitle for Tarlz version @value{VERSION}, @value{UPDATED} +@author by Antonio Diaz Diaz + +@page +@vskip 0pt plus 1filll +@end titlepage + +@contents +@end ifnothtml + +@ifnottex +@node Top +@top + +This manual is for Tarlz (version @value{VERSION}, @value{UPDATED}). + +@menu +* Introduction:: Purpose and features of tarlz +* Invoking tarlz:: Command-line interface +* Portable character set:: POSIX portable filename character set +* File format:: Detailed format of the compressed archive +* Amendments to pax format:: The reasons for the differences with pax +* Program design:: Internal structure of tarlz +* Multi-threaded decoding:: Limitations of parallel tar decoding +* Minimum archive sizes:: Sizes required for full multi-threaded speed +* Examples:: A small tutorial with examples +* Problems:: Reporting bugs +* Concept index:: Index of concepts +@end menu + +@sp 1 +Copyright @copyright{} 2013-2024 Antonio Diaz Diaz. + +This manual is free documentation: you have unlimited permission to copy, +distribute, and modify it. +@end ifnottex + + +@node Introduction +@chapter Introduction +@cindex introduction + +@uref{http://www.nongnu.org/lzip/tarlz.html,,Tarlz} is a massively parallel +(multi-threaded) combined implementation of the tar archiver and the +@uref{http://www.nongnu.org/lzip/lzip.html,,lzip} compressor. Tarlz uses the +compression library @uref{http://www.nongnu.org/lzip/lzlib.html,,lzlib}. + +Tarlz creates tar archives using a simplified and safer variant of the POSIX +pax format compressed in lzip format, keeping the alignment between tar +members and lzip members. The resulting multimember tar.lz archive is +backward compatible with standard tar tools like GNU tar, which treat it +like any other tar.lz archive. Tarlz can append files to the end of such +compressed archives. + +Keeping the alignment between tar members and lzip members has two +advantages. It adds an indexed lzip layer on top of the tar archive, making +it possible to decode the archive safely in parallel. It also minimizes the +amount of data lost in case of corruption. Compressing a tar archive with +plzip may even double the amount of files lost for each lzip member damaged +because it does not keep the members aligned. + +Tarlz can create tar archives with five levels of compression granularity: +per file (@option{--no-solid}), per block (@option{--bsolid}, default), per +directory (@option{--dsolid}), appendable solid (@option{--asolid}), and +solid (@option{--solid}). It can also create uncompressed tar archives. + +@noindent +Of course, compressing each file (or each directory) individually can't +achieve a compression ratio as high as compressing solidly the whole tar +archive, but it has the following advantages: + +@itemize @bullet +@item +The resulting multimember tar.lz archive can be decompressed in +parallel, multiplying the decompression speed. + +@item +New members can be appended to the archive (by removing the +end-of-archive member), and unwanted members can be deleted from the +archive. Just like an uncompressed tar archive. + +@item +It is a safe POSIX-style backup format. In case of corruption, tarlz +can extract all the undamaged members from the tar.lz archive, +skipping over the damaged members, just like the standard +(uncompressed) tar. Moreover, the option @option{--keep-damaged} can be used +to recover as much data as possible from each damaged member, and +lziprecover can be used to recover some of the damaged members. + +@item +A multimember tar.lz archive is usually smaller than the corresponding +solidly compressed tar.gz archive, except when individually +compressing files smaller than about @w{32 KiB}. +@end itemize + +Tarlz protects the extended records with a Cyclic Redundancy Check (CRC) in +a way compatible with standard tar tools. @xref{crc32}. + +Tarlz does not understand other tar formats like @samp{gnu}, @samp{oldgnu}, +@samp{star}, or @samp{v7}. The command +@w{@samp{tarlz -t -f archive.tar.lz > /dev/null}} can be used to check that +the format of the archive is compatible with tarlz. + + +@node Invoking tarlz +@chapter Invoking tarlz +@cindex invoking +@cindex options +@cindex usage +@cindex version + +The format for running tarlz is: + +@example +tarlz @var{operation} [@var{options}] [@var{files}] +@end example + +@noindent +All operations except @option{--concatenate} and @option{--compress} operate +on whole trees if any @var{file} is a directory. All operations except +@option{--compress} overwrite output files without warning. If no archive is +specified, tarlz tries to read it from standard input or write it to +standard output. Tarlz refuses to read archive data from a terminal or write +archive data to a terminal. Tarlz detects when the archive being created or +enlarged is among the files to be archived, appended, or concatenated, and +skips it. + +Tarlz does not use absolute file names nor file names above the current +working directory (perhaps changed by option @option{-C}). On archive creation +or appending tarlz archives the files specified, but removes from member +names any leading and trailing slashes and any file name prefixes containing +a @samp{..} component. On extraction, leading and trailing slashes are also +removed from member names, and archive members containing a @samp{..} +component in the file name are skipped. Tarlz does not follow symbolic links +during extraction; not even symbolic links replacing intermediate +directories. + +On extraction and listing, tarlz removes leading @samp{./} strings from +member names in the archive or given in the command line, so that +@w{@samp{tarlz -xf foo ./bar baz}} extracts members @samp{bar} and +@samp{./baz} from archive @samp{foo}. + +If several compression levels or @option{--*solid} options are given, the last +setting is used. For example @w{@option{-9 --solid --uncompressed -1}} is +equivalent to @w{@option{-1 --solid}}. + +tarlz supports the following operations: + +@table @code +@item --help +Print an informative help message describing the options and exit. + +@item -V +@itemx --version +Print the version number of tarlz on the standard output and exit. +This version number should be included in all bug reports. + +@item -A +@itemx --concatenate +Append one or more archives to the end of an archive. If no archive is +specified with the option @option{-f}, concatenate the input archives to +standard output. All the archives involved must be regular (seekable) files, +and must be either all compressed or all uncompressed. Compressed and +uncompressed archives can't be mixed. Compressed archives must be +multimember lzip files with the two end-of-archive blocks plus any zero +padding contained in the last lzip member of each archive. The intermediate +end-of-archive blocks are removed as each new archive is concatenated. If +the archive is uncompressed, tarlz parses tar headers until it finds the +end-of-archive blocks. Exit with status 0 without modifying the archive if +no @var{files} have been specified. + +Concatenating archives containing files in common results in two or more tar +members with the same name in the resulting archive, which may produce +nondeterministic behavior during multi-threaded extraction. +@xref{mt-extraction}. + +@item -c +@itemx --create +Create a new archive from @var{files}. + +@item -d +@itemx --diff +Compare and report differences between archive and file system. For each tar +member in the archive, check that the corresponding file in the file system +exists and is of the same type (regular file, directory, etc). Report on +standard output the differences found in type, mode (permissions), owner and +group IDs, modification time, file size, file contents (of regular files), +target (of symlinks) and device number (of block/character special files). + +As tarlz removes leading slashes from member names, the option @option{-C} may +be used in combination with @option{--diff} when absolute file names were used +on archive creation: @w{@samp{tarlz -C / -d}}. Alternatively, tarlz may be +run from the root directory to perform the comparison. + +@item --delete +Delete files and directories from an archive in place. It currently can +delete only from uncompressed archives and from archives with files +compressed individually (@option{--no-solid} archives). Note that files of +about @option{--data-size} or larger are compressed individually even if +@option{--bsolid} is used, and can therefore be deleted. Tarlz takes care to +not delete a tar member unless it is possible to do so. For example it won't +try to delete a tar member that is not compressed individually. Even in the +case of finding a corrupt member after having deleted some member(s), tarlz +stops and copies the rest of the file as soon as corruption is found, +leaving it just as corrupt as it was, but not worse. + +To delete a directory without deleting the files under it, use +@w{@samp{tarlz --delete -f foo --exclude='dir/*' dir}}. Deleting in place +may be dangerous. A corrupt archive, a power cut, or an I/O error may cause +data loss. + +@item -r +@itemx --append +Append files to the end of an archive. The archive must be a regular +(seekable) file either compressed or uncompressed. Compressed members can't +be appended to an uncompressed archive, nor vice versa. If the archive is +compressed, it must be a multimember lzip file with the two end-of-archive +blocks plus any zero padding contained in the last lzip member of the +archive. It is possible to append files to an archive with a different +compression granularity. Appending works as follows; first the +end-of-archive blocks are removed, then the new members are appended, and +finally two new end-of-archive blocks are appended to the archive. If the +archive is uncompressed, tarlz parses and skips tar headers until it finds +the end-of-archive blocks. Exit with status 0 without modifying the archive +if no @var{files} have been specified. + +Appending files already present in the archive results in two or more tar +members with the same name, which may produce nondeterministic behavior +during multi-threaded extraction. @xref{mt-extraction}. + +@item -t +@itemx --list +List the contents of an archive. If @var{files} are given, list only the +@var{files} given. + +@item -x +@itemx --extract +Extract files from an archive. If @var{files} are given, extract only the +@var{files} given. Else extract all the files in the archive. To extract a +directory without extracting the files under it, use +@w{@samp{tarlz -xf foo --exclude='dir/*' dir}}. Tarlz removes files and +empty directories unconditionally before extracting over them. Other than +that, it does not make any special effort to extract a file over an +incompatible type of file. For example, extracting a file over a non-empty +directory usually fails. + +@item -z +@itemx --compress +Compress existing POSIX tar archives aligning the lzip members to the tar +members with choice of granularity (@option{--bsolid} by default, +@option{--dsolid} works like @option{--asolid}). Exit with error status 2 if +any input archive is an empty file. The input archives are kept unchanged. +Existing compressed archives are not overwritten. A hyphen @samp{-} used as +the name of an input archive reads from standard input and writes to +standard output (unless the option @option{--output} is used). Tarlz can be +used as compressor for GNU tar by using a command like +@w{@samp{tar -c -Hustar foo | tarlz -z -o foo.tar.lz}}. Tarlz can be used as +compressor for zupdate (zutils) by using a command like +@w{@samp{zupdate --lz="tarlz -z" foo.tar.gz}}. Note that tarlz only works +reliably on archives without global headers, or with global headers whose +content can be ignored. + +The compression is reversible, including any garbage present after the +end-of-archive blocks. Tarlz stops parsing after the first end-of-archive +block is found, and then compresses the rest of the archive. Unless solid +compression is requested, the end-of-archive blocks are compressed in a lzip +member separated from the preceding members and from any non-zero garbage +following the end-of-archive blocks. @option{--compress} implies plzip +argument style, not tar style. Each input archive is compressed to a file +with the extension @samp{.lz} added unless the option @option{--output} is +used. When @option{--output} is used, only one input archive can be specified. +@option{-f} can't be used with @option{--compress}. + +@item --check-lib +Compare the +@uref{http://www.nongnu.org/lzip/manual/lzlib_manual.html#Library-version,,version of lzlib} +used to compile tarlz with the version actually being used at run time and +exit. Report any differences found. Exit with error status 1 if differences +are found. A mismatch may indicate that lzlib is not correctly installed or +that a different version of lzlib has been installed after compiling tarlz. +Exit with error status 2 if LZ_API_VERSION and LZ_version_string don't +match. @w{@samp{tarlz -v --check-lib}} shows the version of lzlib being used +and the value of LZ_API_VERSION (if defined). +@ifnothtml +@xref{Library version,,,lzlib}. +@end ifnothtml + +@end table + +tarlz supports the following +@uref{http://www.nongnu.org/arg-parser/manual/arg_parser_manual.html#Argument-syntax,,options}: +@ifnothtml +@xref{Argument syntax,,,arg_parser}. +@end ifnothtml + +@table @code +@anchor{--data-size} +@item -B @var{bytes} +@itemx --data-size=@var{bytes} +Set target size of input data blocks for the option @option{--bsolid}. +@xref{--bsolid}. Valid values range from @w{8 KiB} to @w{1 GiB}. Default +value is two times the dictionary size, except for option @option{-0} where it +defaults to @w{1 MiB}. @xref{Minimum archive sizes}. + +@item -C @var{dir} +@itemx --directory=@var{dir} +Change to directory @var{dir}. When creating, appending, comparing, or +extracting, the position of each @option{-C} option in the command line is +significant; it changes the current working directory for the following +@var{files} until a new @option{-C} option appears in the command line. +@option{--list} and @option{--delete} ignore any @option{-C} options +specified. @var{dir} is relative to the then current working directory, +perhaps changed by a previous @option{-C} option. + +Note that a process can only have one current working directory (CWD). +Therefore multi-threading can't be used to create or decode an archive if a +@option{-C} option appears after a (relative) file name in the command line. +(All file names are made relative when decoding). + +@item -f @var{archive} +@itemx --file=@var{archive} +Use archive file @var{archive}. A hyphen @samp{-} used as an @var{archive} +argument reads from standard input or writes to standard output. + +@item -h +@itemx --dereference +Follow symbolic links during archive creation, appending or comparison. +Archive or compare the files they point to instead of the links themselves. + +@item -n @var{n} +@itemx --threads=@var{n} +Set the number of (de)compression threads, overriding the system's default. +Valid values range from 0 to "as many as your system can support". A value +of 0 disables threads entirely. If this option is not used, tarlz tries to +detect the number of processors in the system and use it as default value. +@w{@samp{tarlz --help}} shows the system's default value. See the note about +multi-threading in the option @option{-C} above. + +Note that the number of usable threads is limited during compression to +@w{ceil( uncompressed_size / data_size )} (@pxref{Minimum archive sizes}), +and during decompression to the number of lzip members in the tar.lz +archive, which you can find by running @w{@samp{lzip -lv archive.tar.lz}}. + +@item -o @var{file} +@itemx --output=@var{file} +Write the compressed output to @var{file}. @w{@option{-o -}} writes the +compressed output to standard output. Currently @option{--output} only works +with @option{--compress}. + +@item -p +@itemx --preserve-permissions +On extraction, set file permissions as they appear in the archive. This is +the default behavior when tarlz is run by the superuser. The default for +other users is to subtract the umask of the user running tarlz from the +permissions specified in the archive. + +@item -q +@itemx --quiet +Quiet operation. Suppress all messages. + +@item -v +@itemx --verbose +Verbosely list files processed. Further -v's (up to 4) increase the +verbosity level. + +@item -0 .. -9 +Set the compression level for @option{--create}, @option{--append}, and +@option{--compress}. The default compression level is @option{-6}. Like lzip, +tarlz also minimizes the dictionary size of the lzip members it creates, +reducing the amount of memory required for decompression. + +@multitable {Level} {Dictionary size} {Match length limit} +@item Level @tab Dictionary size @tab Match length limit +@item -0 @tab 64 KiB @tab 16 bytes +@item -1 @tab 1 MiB @tab 5 bytes +@item -2 @tab 1.5 MiB @tab 6 bytes +@item -3 @tab 2 MiB @tab 8 bytes +@item -4 @tab 3 MiB @tab 12 bytes +@item -5 @tab 4 MiB @tab 20 bytes +@item -6 @tab 8 MiB @tab 36 bytes +@item -7 @tab 16 MiB @tab 68 bytes +@item -8 @tab 24 MiB @tab 132 bytes +@item -9 @tab 32 MiB @tab 273 bytes +@end multitable + +@item --uncompressed +With @option{--create}, don't compress the tar archive created. Create an +uncompressed tar archive instead. With @option{--append}, don't compress the +new members appended to the tar archive. Compressed members can't be +appended to an uncompressed archive, nor vice versa. @option{--uncompressed} +can be omitted if it can be deduced from the archive name. (An uncompressed +archive name lacks a @samp{.lz} or @samp{.tlz} extension). + +@item --asolid +When creating or appending to a compressed archive, use appendable solid +compression. All the files being added to the archive are compressed into a +single lzip member, but the end-of-archive blocks are compressed into a +separate lzip member. This creates a solidly compressed appendable archive. +Solid archives can't be created nor decoded in parallel. + +@anchor{--bsolid} +@item --bsolid +When creating or appending to a compressed archive, use block compression. +Tar members are compressed together in a lzip member until they approximate +a target uncompressed size. The size can't be exact because each solidly +compressed data block must contain an integer number of tar members. Block +compression is the default because it improves compression ratio for +archives with many files smaller than the block size. This option allows +tarlz revert to default behavior if, for example, it is invoked through an +alias like @w{@samp{tar='tarlz --solid'}}. @xref{--data-size}, to set the +target block size. + +@item --dsolid +When creating or appending to a compressed archive, compress each file +specified in the command line separately in its own lzip member, and use +solid compression for each directory specified in the command line. The +end-of-archive blocks are compressed into a separate lzip member. This +creates a compressed appendable archive with a separate lzip member for each +file or top-level directory specified. + +@item --no-solid +When creating or appending to a compressed archive, compress each file +separately in its own lzip member. The end-of-archive blocks are compressed +into a separate lzip member. This creates a compressed appendable archive +with a lzip member for each file. + +@item --solid +When creating or appending to a compressed archive, use solid compression. +The files being added to the archive, along with the end-of-archive blocks, +are compressed into a single lzip member. The resulting archive is not +appendable. No more files can be later appended to the archive. Solid +archives can't be created nor decoded in parallel. + +@item --anonymous +Equivalent to @w{@option{--owner=root --group=root}}. + +@item --owner=@var{owner} +When creating or appending, use @var{owner} for files added to the archive. +If @var{owner} is not a valid user name, it is decoded as a decimal numeric +user ID. + +@item --group=@var{group} +When creating or appending, use @var{group} for files added to the archive. +If @var{group} is not a valid group name, it is decoded as a decimal numeric +group ID. + +@item --exclude=@var{pattern} +Exclude files matching a shell pattern like @samp{*.o}. A file is considered +to match if any component of the file name matches. For example, @samp{*.o} +matches @samp{foo.o}, @samp{foo.o/bar} and @samp{foo/bar.o}. If +@var{pattern} contains a @samp{/}, it matches a corresponding @samp{/} in +the file name. For example, @samp{foo/*.o} matches @samp{foo/bar.o}. +Multiple @option{--exclude} options can be specified. + +@item --ignore-ids +Make @option{--diff} ignore differences in owner and group IDs. This option is +useful when comparing an @option{--anonymous} archive. + +@item --ignore-metadata +Make @option{--diff} ignore any differences in metadata (file permissions, +owner and group IDs, modification time). Compare only file type, file size, +and file content. This option is useful when file permissions have not been +fully restored because uid/gid changed on extraction. + +@item --ignore-overflow +Make @option{--diff} ignore differences in mtime caused by overflow on 32-bit +systems with a 32-bit time_t. + +@item --keep-damaged +Don't delete partially extracted files. If a decompression error happens +while extracting a file, keep the partial data extracted. Use this option to +recover as much data as possible from each damaged member. It is recommended +to run tarlz in single-threaded mode (@option{--threads=0}) when using this +option. + +@item --missing-crc +Exit with error status 2 if the CRC of the extended records is missing. When +this option is used, tarlz detects any corruption in the extended records +(only limited by CRC collisions). But note that a corrupt @samp{GNU.crc32} +keyword, for example @samp{GNU.crc30}, is reported as a missing CRC instead +of as a corrupt record. This misleading @w{@samp{Missing CRC}} message is +the consequence of a flaw in the POSIX pax format; i.e., the lack of a +mandatory check sequence of the extended records. @xref{crc32}. + +@item --mtime=@var{date} +When creating or appending, use @var{date} as the modification time for +files added to the archive instead of their actual modification times. The +value of @var{date} may be either @samp{@@} followed by the number of +seconds since (or before) the epoch, or a date in format +@w{@samp{[-]YYYY-MM-DD HH:MM:SS}} or @samp{[-]YYYY-MM-DDTHH:MM:SS}, or the +name of an existing reference file starting with @samp{.} or @samp{/} whose +modification time is used. The time of day @samp{HH:MM:SS} in the date +format is optional and defaults to @samp{00:00:00}. The epoch is +@w{@samp{1970-01-01 00:00:00 UTC}}. Negative seconds or years define a +modification time before the epoch. + +@item --out-slots=@var{n} +Number of @w{1 MiB} output packets buffered per worker thread during +multi-threaded creation or appending to compressed archives. Increasing the +number of packets may increase compression speed if the files being archived +are larger than @w{64 MiB} compressed, but requires more memory. Valid +values range from 1 to 1024. The default value is 64. + +@item --warn-newer +During archive creation, warn if any file being archived has a modification +time newer than the archive creation time. This option may slow archive +creation somewhat because it makes an extra call to @samp{stat} after +archiving each file, but it guarantees that file contents were not modified +during the creation of the archive. Note that the file must be at least one +second newer than the archive for it to be detected as newer. + +@ignore +@item --permissive +Allow some violations of the archive format, like consecutive extended +headers preceding a ustar header, or several records with the same +keyword appearing in the same block of extended records. +@end ignore + +@end table + +Exit status: 0 for a normal exit, 1 for environmental problems +(file not found, files differ, invalid command-line options, I/O errors, +etc), 2 to indicate a corrupt or invalid input file, 3 for an internal +consistency error (e.g., bug) which caused tarlz to panic. + + +@node Portable character set +@chapter POSIX portable filename character set +@cindex portable character set + +The set of characters from which portable file names are constructed. + +@example +A B C D E F G H I J K L M N O P Q R S T U V W X Y Z +a b c d e f g h i j k l m n o p q r s t u v w x y z +0 1 2 3 4 5 6 7 8 9 . _ - +@end example + +The last three characters are the period, underscore, and hyphen-minus +characters, respectively. + +File names are identifiers. Therefore, archiving works better when file +names use only the portable character set without spaces added. + + +@node File format +@chapter File format +@cindex file format + +In the diagram below, a box like this: + +@verbatim ++---+ +| | <-- the vertical bars might be missing ++---+ +@end verbatim + +represents one byte; a box like this: + +@verbatim ++==============+ +| | ++==============+ +@end verbatim + +represents a variable number of bytes or a fixed but large number of +bytes (for example 512). + +@sp 1 +A tar.lz file consists of one or more lzip members (compressed data sets). +The members simply appear one after another in the file, with no additional +information before, between, or after them. + +Each lzip member contains one or more tar members in a simplified POSIX pax +interchange format. The only pax typeflag value supported by tarlz (in +addition to the typeflag values defined by the ustar format) is @samp{x}. +The pax format is an extension on top of the ustar format that removes the +size limitations of the ustar format. + +Each tar member contains one file archived, and is represented by the +following sequence: + +@itemize @bullet +@item +An optional extended header block followed by one or more blocks that +contain the extended header records as if they were the contents of a file; +i.e., the extended header records are included as the data for this header +block. This header block is of the form described in pax header block, with +a typeflag value of @samp{x}. + +@item +A header block in ustar format that describes the file. Any fields defined +in the preceding optional extended header records override the associated +fields in this header block for this file. + +@item +Zero or more blocks that contain the contents of the file. +@end itemize + +Each tar member must be contiguously stored in a lzip member for the +parallel decoding operations like @option{--list} to work. If any tar member +is split over two or more lzip members, the archive must be decoded +sequentially. @xref{Multi-threaded decoding}. + +At the end of the archive file there are two 512-byte blocks filled with +binary zeros, interpreted as an end-of-archive indicator. These EOA blocks +are either compressed in a separate lzip member or compressed along with the +tar members contained in the last lzip member. For a compressed archive to +be recognized by tarlz as appendable, the last lzip member must contain +between 512 and 32256 zeros alone (without any non-zero bytes). + +The diagram below shows the correspondence between each tar member (formed +by one or two headers plus optional data) in the tar archive and each +@uref{http://www.nongnu.org/lzip/manual/lzip_manual.html#File-format,,lzip member} +in the resulting multimember tar.lz archive, when per file compression is +used: +@ifnothtml +@xref{File format,,,lzip}. +@end ifnothtml + +@verbatim +tar ++========+======+=================+===============+========+======+========+ +| header | data | extended header | extended data | header | data | EOA | ++========+======+=================+===============+========+======+========+ + +tar.lz ++===============+=================================================+========+ +| member | member | member | ++===============+=================================================+========+ +@end verbatim + +@ignore +When @option{--permissive} is used, the following violations of the +archive format are allowed:@* +If several extended headers precede an ustar header, only the last +extended header takes effect. The other extended headers are ignored. +Similarly, if several records with the same keyword appear in the same +block of extended records, only the last record for the repeated keyword +takes effect. The other records for the repeated keyword are ignored.@* +A global header inserted between an extended header and an ustar header.@* +An extended header just before the end-of-archive blocks. +@end ignore + +@sp 1 +@section Pax header block + +The pax header block is identical to the ustar header block described below +except that the typeflag has the value @samp{x} (extended). The field +@samp{size} is the size of the extended header data in bytes. Most other +fields in the pax header block are zeroed on archive creation to prevent +trouble if the archive is read by an ustar tool, and are ignored by tarlz on +archive extraction. @xref{flawed-compat}. + +Tarlz limits the size of the pax extended header data so that the whole +header set (extended header + extended data + ustar header) can be read and +decoded in a buffer of size INT_MAX. + +The pax extended header data consists of one or more records, each of +them constructed as follows:@* +@w{@samp{"%d %s=%s\n", <length>, <keyword>, <value>}} + +The fields <length> and <keyword> in the record must be limited to the +portable character set (@pxref{Portable character set}). The field <length> +contains the decimal length of the record in bytes, including the trailing +newline. The field <value> is stored as-is, without conversion to UTF-8 nor +any other transformation. The fields are separated by the ASCII characters +space, equal-sign, and newline. + +These are the <keyword> values currently supported by tarlz: + +@table @code +@item atime +The signed decimal representation of the access time of the following file +in seconds since (or before) the epoch, obtained from the function +@samp{stat}. The atime record is created only for files with a modification +time outside of the ustar range. @xref{ustar-mtime}. + +@item gid +The unsigned decimal representation of the group ID of the group that owns +the following file. The gid record is created only for files with a group ID +greater than 2_097_151 @w{(octal 7_777_777)}. @xref{ustar-uid-gid}. + +@item linkpath +The file name of a link being created to another file, of any type, +previously archived. This record overrides the field @samp{linkname} in the +following ustar header block. The following ustar header block determines +the type of link created. If typeflag of the following header block is 1, a +hard link is created. If typeflag is 2, a symbolic link is created and the +linkpath value is used as the contents of the symbolic link. The linkpath +record is created only for links with a link name that does not fit in the +space provided by the ustar header. + +@item mtime +The signed decimal representation of the modification time of the following +file in seconds since (or before) the epoch, obtained from the function +@samp{stat}. This record overrides the field @samp{mtime} in the following +ustar header block. The mtime record is created only for files with a +modification time outside of the ustar range. @xref{ustar-mtime}. + +@item path +The file name of the following file. This record overrides the fields +@samp{name} and @samp{prefix} in the following ustar header block. The path +record is created for files with a name that does not fit in the space +provided by the ustar header, but is also created for files that require any +other extended record so that the fields @samp{name} and @samp{prefix} in +the following ustar header block can be zeroed. + +@item size +The size of the file in bytes, expressed as a decimal number using digits +from the ISO/IEC 646:1991 (ASCII) standard. This record overrides the field +@samp{size} in the following ustar header block. The size record is created +only for files with a size value greater than 8_589_934_591 +@w{(octal 77_777_777_777)}; that is, @w{8 GiB} (2^33 bytes) or larger. + +@item uid +The unsigned decimal representation of the user ID of the file owner of the +following file. The uid record is created only for files with a user ID +greater than 2_097_151 @w{(octal 7_777_777)}. @xref{ustar-uid-gid}. + +@anchor{key_crc32} +@item GNU.crc32 +CRC32-C (Castagnoli) of the extended header data excluding the 8 bytes +representing the CRC <value> itself. The <value> is represented as 8 +hexadecimal digits in big endian order, +@w{@samp{22 GNU.crc32=00000000\n}}. The keyword of the CRC record is +protected by the CRC to guarantee that corruption is always detected when +using @option{--missing-crc} (except in case of CRC collision). A CRC was +chosen because a checksum is too weak for a potentially large list of +variable sized records. A checksum can't detect simple errors like the +swapping of two bytes. + +@end table + +At verbosity level 1 or higher tarlz prints a diagnostic for each unknown +extended header keyword found in an archive, once per keyword. + +@sp 1 +@section Ustar header block + +The ustar header block has a length of 512 bytes and is structured as +shown in the following table. All lengths and offsets are in decimal. + +@multitable {Field Name} {Offset} {Length (in bytes)} +@item Field Name @tab Offset @tab Length (in bytes) +@item name @tab 0 @tab 100 +@item mode @tab 100 @tab 8 +@item uid @tab 108 @tab 8 +@item gid @tab 116 @tab 8 +@item size @tab 124 @tab 12 +@item mtime @tab 136 @tab 12 +@item chksum @tab 148 @tab 8 +@item typeflag @tab 156 @tab 1 +@item linkname @tab 157 @tab 100 +@item magic @tab 257 @tab 6 +@item version @tab 263 @tab 2 +@item uname @tab 265 @tab 32 +@item gname @tab 297 @tab 32 +@item devmajor @tab 329 @tab 8 +@item devminor @tab 337 @tab 8 +@item prefix @tab 345 @tab 155 +@end multitable + +All characters in the header block are coded using the ISO/IEC 646:1991 +(ASCII) standard, except in fields storing names for files, users, and +groups. For maximum portability between implementations, names should only +contain characters from the portable character set (@pxref{Portable +character set}), but if an implementation supports the use of characters +outside of @samp{/} and the portable character set in names for files, +users, and groups, tarlz will use the byte values in these names unmodified. + +The fields @samp{name}, @samp{linkname}, and @samp{prefix} are +null-terminated character strings except when all characters in the array +contain non-null characters including the last character. + +The fields @samp{name} and @samp{prefix} produce the file name. A new file +name is formed, if prefix is not an empty string (its first character is not +null), by concatenating prefix (up to the first null character), a slash +character, and name; otherwise, name is used alone. In either case, name is +terminated at the first null character. If prefix begins with a null +character, it is ignored. In this manner, file names of at most 256 +characters can be supported. If a file name does not fit in the space +provided, an extended record is used to store the file name. + +The field @samp{linkname} does not use the prefix to produce a file name. If +the link name does not fit in the 100 characters provided, an extended +record is used to store the link name. + +The field @samp{mode} provides 12 access permission bits. The following +table shows the symbolic name of each bit and its octal value: + +@multitable {Bit Name} {Value} {Bit Name} {Value} {Bit Name} {Value} +@headitem Bit Name @tab Value @tab Bit Name @tab Value @tab Bit Name @tab Value +@item S_ISUID @tab 04000 @tab S_ISGID @tab 02000 @tab S_ISVTX @tab 01000 +@item S_IRUSR @tab 00400 @tab S_IWUSR @tab 00200 @tab S_IXUSR @tab 00100 +@item S_IRGRP @tab 00040 @tab S_IWGRP @tab 00020 @tab S_IXGRP @tab 00010 +@item S_IROTH @tab 00004 @tab S_IWOTH @tab 00002 @tab S_IXOTH @tab 00001 +@end multitable + +@anchor{ustar-uid-gid} +The fields @samp{uid} and @samp{gid} are the user and group IDs of the owner +and group of the file, respectively. If the file uid or gid are greater than +2_097_151 @w{(octal 7_777_777)}, an extended record is used to store the uid +or gid. + +The field @samp{size} contains the octal representation of the size of the +file in bytes. If the field @samp{typeflag} specifies a file of type '0' +(regular file) or '7' (high performance regular file), the number of logical +records following the header is @w{(size / 512)} rounded to the next +integer. For all other values of typeflag, tarlz either sets the size field +to 0 or ignores it, and does not store or expect any logical records +following the header. If the file size is larger than 8_589_934_591 bytes +@w{(octal 77_777_777_777)}, an extended record is used to store the file size. + +@anchor{ustar-mtime} +The field @samp{mtime} contains the octal representation of the modification +time of the file at the time it was archived, obtained from the function +@samp{stat}. If the modification time is negative or larger than +8_589_934_591 @w{(octal 77_777_777_777)} seconds since the epoch, an extended +record is used to store the modification time. The ustar range of mtime goes +from @w{@samp{1970-01-01 00:00:00 UTC}} to @w{@samp{2242-03-16 12:56:31 UTC}}. + +The field @samp{chksum} contains the octal representation of the value of +the simple sum of all bytes in the header logical record. Each byte in the +header is treated as an unsigned value. When calculating the checksum, the +chksum field is treated as if it were all space characters. + +The field @samp{typeflag} contains a single character specifying the type of +file archived: + +@table @code +@item '0' +Regular file. + +@item '1' +Hard link to another file, of any type, previously archived. Hard links must +not contain file data. + +@item '2' +Symbolic link. + +@item '3', '4' +Character special file and block special file respectively. In this case the +fields @samp{devmajor} and @samp{devminor} contain information defining the +device in unspecified format. + +@item '5' +Directory. + +@item '6' +FIFO special file. + +@item '7' +Reserved to represent a file to which an implementation has associated some +high-performance attribute (contiguous file). Tarlz treats this type of file +as a regular file (type 0). + +@end table + +The field @samp{magic} contains the ASCII null-terminated string "ustar". +The field @samp{version} contains the characters "00" (0x30,0x30). The +fields @samp{uname} and @samp{gname} are null-terminated character strings +except when all characters in the array contain non-null characters +including the last character. Each numeric field contains a leading space- +or zero-filled, optionally null-terminated octal number using digits from +the ISO/IEC 646:1991 (ASCII) standard. Tarlz is able to decode numeric +fields 1 byte longer than standard ustar by not requiring a terminating null +character. + + +@node Amendments to pax format +@chapter The reasons for the differences with pax +@cindex Amendments to pax format + +Tarlz creates safe archives that allow the reliable detection of invalid or +corrupt metadata during decoding even when the integrity checking of lzip +can't be used because the lzip members are only decompressed partially, as +it happens in parallel @option{--diff}, @option{--list}, and @option{--extract}. +In order to achieve this goal and avoid some other flaws in the pax format, +tarlz makes some changes to the variant of the pax format that it uses. This +chapter describes these changes and the concrete reasons to implement them. + +@sp 1 +@anchor{crc32} +@section Add a CRC of the extended records + +The POSIX pax format has a serious flaw. The metadata stored in pax extended +records are not protected by any kind of check sequence. Corruption in a +long file name may cause the extraction of the file in the wrong place +without warning. Corruption in a large file size may cause the truncation of +the file or the appending of garbage to the file, both followed by a +spurious warning about a corrupt header far from the place of the undetected +corruption. + +Metadata like file name and file size must be always protected in an archive +format because of the adverse effects of undetected corruption in them, +potentially much worse that undetected corruption in the data. Even more so +in the case of pax because the amount of metadata it stores is potentially +large, making undetected corruption and archiver misbehavior more probable. + +Headers and metadata must be protected separately from data because the +integrity checking of lzip may not be able to detect the corruption before +the metadata have been used, for example, to create a new file in the wrong +place. + +Because of the above, tarlz protects the extended records with a Cyclic +Redundancy Check (CRC) in a way compatible with standard tar tools. +@xref{key_crc32}. + +@sp 1 +@anchor{flawed-compat} +@section Remove flawed backward compatibility + +In order to allow the extraction of pax archives by a tar utility conforming +to the POSIX-2:1993 standard, POSIX.1-2008 recommends selecting extended +header field values that allow such tar to create a regular file containing +the extended header records as data. This approach is broken because if the +extended header is needed because of a long file name, the fields +@samp{name} and @samp{prefix} are unable to contain the full file name. +(Some tar implementations store the truncated name in the field @samp{name} +alone, truncating the name to only 100 bytes instead of 256). Therefore the +files corresponding to both the extended header and the overridden ustar +header are extracted using truncated file names, perhaps overwriting +existing files or directories. It may be a security risk to extract a file +with a truncated file name. + +To avoid this problem, tarlz writes extended headers with all fields zeroed +except @samp{size} (which contains the size of the extended records), +@samp{chksum}, @samp{typeflag}, @samp{magic}, and @samp{version}. In +particular, tarlz sets the fields @samp{name} and @samp{prefix} to zero. +This prevents old tar programs from extracting the extended records as a +file in the wrong place. Tarlz also sets to zero those fields of the ustar +header overridden by extended records. Finally, tarlz skips members with +zeroed @samp{name} and @samp{prefix} when decoding, except when listing. +This is needed to detect certain format violations during parallel +extraction. + +If an extended header is required for any reason (for example a file size of +@w{8 GiB} or larger, or a link name longer than 100 bytes), tarlz also moves +the file name to the extended records to prevent an ustar tool from trying +to extract the file or link. This also makes easier during parallel decoding +the detection of a tar member split between two lzip members at the boundary +between the extended header and the ustar header. + +@sp 1 +@section As simple as possible (but not simpler) + +The tarlz format is mainly ustar. Extended pax headers are used only when +needed because the length of a file name or link name, or the size or other +attribute of a file exceed the limits of the ustar format. Adding @w{1 KiB} +of extended header and records to each member just to save subsecond +timestamps seems wasteful for a backup format. Moreover, minimizing the +overhead may help recovering the archive with lziprecover in case of +corruption. + +Global pax headers are tolerated, but not supported; they are parsed and +ignored. Some operations may not behave as expected if the archive contains +global headers. + +@sp 1 +@section Improve reproducibility + +Pax includes by default the process ID of the pax process in the ustar name +of the extended headers, making the archive not reproducible. Tarlz stores +the true name of the file just once, either in the ustar header or in the +extended records, making it easier to produce reproducible archives. + +Pax allows an extended record to have length x-1 or x if x is a power of +ten; @samp{99<97_bytes>} or @samp{100<97_bytes>}. Tarlz minimizes the length +of the record and always produces a length of x-1 in these cases. + +@sp 1 +@section No data in hard links + +Tarlz does not allow data in hard link members. The data (if any) must be in +the member determining the type of the file (which can't be a link). If all +the names of a file are stored as hard links, the type of the file is lost. +Not allowing data in hard links also prevents invalid actions like +extracting file data for a hard link to a symbolic link or to a directory. + +@sp 1 +@section Avoid misconversions to/from UTF-8 + +There is no portable way to tell what charset a text string is coded into. +Therefore, tarlz stores all fields representing text strings unmodified, +without conversion to UTF-8 nor any other transformation. This prevents +accidental double UTF-8 conversions. If the need arises this behavior will +be adjusted with a command-line option in the future. + + +@node Program design +@chapter Internal structure of tarlz +@cindex program design + +The parts of tarlz related to sequential processing of the archive are more +or less similar to any other tar and won't be described here. The interesting +parts described here are those related to Multi-threaded processing. + +The structure of the part of tarlz performing Multi-threaded archive +creation is somewhat similar to that of +@uref{http://www.nongnu.org/lzip/plzip.html#Program-design,,plzip} with the +added complication of the solidity levels. +@ifnothtml +@xref{Program design,,,plzip}. +@end ifnothtml +A grouper thread and several worker threads are created, acting the main +thread as muxer (multiplexer) thread. A "packet courier" takes care of data +transfers among threads and limits the maximum number of data blocks +(packets) being processed simultaneously. + +The grouper traverses the directory tree, groups together the metadata of +the files to be archived in each lzip member, and distributes them to the +workers. The workers compress the metadata received from the grouper along +with the file data read from the file system. The muxer collects processed +packets from the workers, and writes them to the archive. + +@verbatim +.--------. +| data|---> to each worker below +| | .------------. +| file | ,-->| worker 0 |--, +| system | | `------------' | +| | .---------. | .------------. | .-------. .---------. +|metadata|--->| grouper |-+-->| worker 1 |--+-->| muxer |-->| archive | +`--------' `---------' | `------------' | `-------' `---------' + | ... | + | .------------. | + `-->| worker N-1 |--' + `------------' +@end verbatim + +Decoding an archive is somewhat similar to how plzip decompresses a regular +file to standard output, with the differences that it is not the data but +only messages what is written to stdout/stderr, and that each worker may +access files in the file system either to read them (diff) or write them +(extract). As in plzip, each worker reads members directly from the archive. + +@verbatim +.--------. +| file |<---> data to/from each worker below +| system | +`--------' .------------. + ,-->| worker 0 |--, + | `------------' | +.---------. | .------------. | .-------. .--------. +| archive |-+-->| worker 1 |--+-->| muxer |-->| stdout | +`---------' | `------------' | `-------' | stderr | + | ... | `--------' + | .------------. | + `-->| worker N-1 |--' + `------------' +@end verbatim + +As misaligned tar.lz archives can't be decoded in parallel, and the +misalignment can't be detected until after decoding has started, a +"mastership request" mechanism has been designed that allows the decoding to +continue instead of signalling an error. + +During parallel decoding, if a worker finds a misalignment, it requests +mastership to decode the rest of the archive. When mastership is requested, +an error_member_id is set, and all subsequently received packets with +member_id > error_member_id are rejected. All workers requesting mastership +are blocked at the request_mastership call until mastership is granted. +Mastership is granted to the delivering worker when its queue is empty to +make sure that all preceding packets have been processed. When mastership is +granted, all packets are deleted and all subsequently received packets not +coming from the master are rejected. + +If a worker can't continue decoding for any cause (for example lack of +memory or finding a split tar member at the beginning of a lzip member), it +requests mastership to print an error and terminate the program. Only if +some other worker requests mastership in a previous lzip member can this +error be avoided. + + +@node Multi-threaded decoding +@chapter Limitations of parallel tar decoding +@cindex parallel tar decoding + +Safely decoding an arbitrary tar archive in parallel is only possible if one +decodes the headers sequentially first. For example, if a tar archive +containing another tar archive is decoded starting from some position other +than the beginning, there is no way to know if the first header found there +belongs to the outer tar archive or to the inner tar archive. Tar is a +format inherently serial; it was designed for tapes. + +The pax format is even more serial than the ustar format. Two headers need +to be decoded sequentially for each file. The extended header may even need +parsing to reveal something as basic as file size. If a thread decodes the +ustar header skipping the preceding extended header, it may extract a file +of incorrect size at the wrong place. Moreover, a pax archive with global +headers can't be decoded in parallel because each thread can't know about +the global headers decoded by other threads. + +In the case of compressed tar archives, the start of each compressed block +determines one point through which the tar archive can be decoded in +parallel. Therefore, in tar.lz archives the decoding operations can't be +parallelized if the tar members are not aligned with the lzip members. Tar +archives compressed with plzip can't be decoded in parallel because tar and +plzip do not have a way to align both sets of members. Certainly one can +decompress one such archive with a multi-threaded tool like plzip, but the +increase in speed is not as large as it could be because plzip must +serialize the decompressed data and pass them to tar, which decodes them +sequentially, one tar member at a time. + +On the other hand, if the tar.lz archive is created with a tool like tarlz, +which can guarantee the alignment between tar members and lzip members +because it controls both archiving and compression, then the lzip format +becomes an indexed layer on top of the tar archive which makes possible +decoding it safely in parallel. + +Tarlz is able to automatically decode aligned and unaligned multimember +tar.lz archives, keeping backwards compatibility. If tarlz finds a member +misalignment during multi-threaded decoding, it switches to single-threaded +mode and continues decoding the archive. + +If the files in the archive are large, multi-threaded @option{--list} on a +regular (seekable) tar.lz archive can be hundreds of times faster than +sequential @option{--list} because, in addition to using several processors, +it only needs to decompress part of each lzip member. See the following +example listing the Silesia corpus on a dual core machine: + +@example +tarlz -9 --no-solid -cf silesia.tar.lz silesia +time lzip -cd silesia.tar.lz | tar -tf - (5.032s) +time plzip -cd silesia.tar.lz | tar -tf - (3.256s) +time tarlz -tf silesia.tar.lz (0.020s) +@end example + +On the other hand, multi-threaded @option{--list} won't detect corruption in +the tar member data because it only decodes the part of each lzip member +corresponding to the tar member header. This is another reason why the tar +headers must provide their own integrity checking. + +@sp 1 +@anchor{mt-extraction} +@section Limitations of multi-threaded extraction + +Multi-threaded extraction may produce different output than single-threaded +extraction in some cases: + +During multi-threaded extraction, several independent threads are +simultaneously reading the archive and creating files in the file system. +The archive is not read sequentially. As a consequence, any error or +weirdness in the archive (like a corrupt member or an end-of-archive block +in the middle of the archive) won't be usually detected until part of the +archive beyond that point has been processed. + +If the archive contains two or more tar members with the same name, +single-threaded extraction extracts the members in the order they appear in +the archive and leaves in the file system the last version of the file. But +multi-threaded extraction may extract the members in any order and leave in +the file system any version of the file nondeterministically. It is +unspecified which of the tar members is extracted. + +If the same file is extracted through several paths (different member names +resolve to the same file in the file system), the result is undefined. +(Probably the resulting file will be mangled). + +Extraction of a hard link may fail if it is extracted before the file it +links to. + + +@node Minimum archive sizes +@chapter Minimum archive sizes required for multi-threaded block compression +@cindex minimum archive sizes + +When creating or appending to a compressed archive using multi-threaded +block compression, tarlz puts tar members together in blocks and compresses +as many blocks simultaneously as worker threads are chosen, creating a +multimember compressed archive. + +For this to work as expected (and roughly multiply the compression speed by +the number of available processors), the uncompressed archive must be at +least as large as the number of worker threads times the block size +(@pxref{--data-size}). Else some processors do not get any data to compress, +and compression is proportionally slower. The maximum speed increase +achievable on a given archive is limited by the ratio +@w{(uncompressed_size / data_size)}. For example, a tarball the size of gcc +or linux scales up to 10 or 14 processors at level -9. + +The following table shows the minimum uncompressed archive size needed for +full use of N processors at a given compression level, using the default +data size for each level: + +@multitable {Processors} {512 MiB} {512 MiB} {512 MiB} {512 MiB} {512 MiB} {512 MiB} +@headitem Processors @tab 2 @tab 4 @tab 8 @tab 16 @tab 64 @tab 256 +@item Level +@item -0 @tab 2 MiB @tab 4 MiB @tab 8 MiB @tab 16 MiB @tab 64 MiB @tab 256 MiB +@item -1 @tab 4 MiB @tab 8 MiB @tab 16 MiB @tab 32 MiB @tab 128 MiB @tab 512 MiB +@item -2 @tab 6 MiB @tab 12 MiB @tab 24 MiB @tab 48 MiB @tab 192 MiB @tab 768 MiB +@item -3 @tab 8 MiB @tab 16 MiB @tab 32 MiB @tab 64 MiB @tab 256 MiB @tab 1 GiB +@item -4 @tab 12 MiB @tab 24 MiB @tab 48 MiB @tab 96 MiB @tab 384 MiB @tab 1.5 GiB +@item -5 @tab 16 MiB @tab 32 MiB @tab 64 MiB @tab 128 MiB @tab 512 MiB @tab 2 GiB +@item -6 @tab 32 MiB @tab 64 MiB @tab 128 MiB @tab 256 MiB @tab 1 GiB @tab 4 GiB +@item -7 @tab 64 MiB @tab 128 MiB @tab 256 MiB @tab 512 MiB @tab 2 GiB @tab 8 GiB +@item -8 @tab 96 MiB @tab 192 MiB @tab 384 MiB @tab 768 MiB @tab 3 GiB @tab 12 GiB +@item -9 @tab 128 MiB @tab 256 MiB @tab 512 MiB @tab 1 GiB @tab 4 GiB @tab 16 GiB +@end multitable + + +@node Examples +@chapter A small tutorial with examples +@cindex examples + +@noindent +Example 1: Create a multimember compressed archive @samp{archive.tar.lz} +containing files @samp{a}, @samp{b} and @samp{c}. + +@example +tarlz -cf archive.tar.lz a b c +@end example + +@sp 1 +@noindent +Example 2: Append files @samp{d} and @samp{e} to the multimember compressed +archive @samp{archive.tar.lz}. + +@example +tarlz -rf archive.tar.lz d e +@end example + +@sp 1 +@noindent +Example 3: Create a solidly compressed appendable archive +@samp{archive.tar.lz} containing files @samp{a}, @samp{b} and @samp{c}. +Then append files @samp{d} and @samp{e} to the archive. + +@example +tarlz --asolid -cf archive.tar.lz a b c +tarlz --asolid -rf archive.tar.lz d e +@end example + +@sp 1 +@noindent +Example 4: Create a compressed appendable archive containing directories +@samp{dir1}, @samp{dir2} and @samp{dir3} with a separate lzip member per +directory. Then append files @samp{a}, @samp{b}, @samp{c}, @samp{d} and +@samp{e} to the archive, all of them contained in a single lzip member. +The resulting archive @samp{archive.tar.lz} contains 5 lzip members +(including the end-of-archive member). + +@example +tarlz --dsolid -cf archive.tar.lz dir1 dir2 dir3 +tarlz --asolid -rf archive.tar.lz a b c d e +@end example + +@sp 1 +@noindent +Example 5: Create a solidly compressed archive @samp{archive.tar.lz} +containing files @samp{a}, @samp{b} and @samp{c}. Note that no more +files can be later appended to the archive. + +@example +tarlz --solid -cf archive.tar.lz a b c +@end example + +@sp 1 +@noindent +Example 6: Extract all files from archive @samp{archive.tar.lz}. + +@example +tarlz -xf archive.tar.lz +@end example + +@sp 1 +@noindent +Example 7: Extract files @samp{a} and @samp{c}, and the whole tree under +directory @samp{dir1} from archive @samp{archive.tar.lz}. + +@example +tarlz -xf archive.tar.lz a c dir1 +@end example + +@sp 1 +@noindent +Example 8: Copy the contents of directory @samp{sourcedir} to the directory +@samp{destdir}. + +@example +tarlz -C sourcedir --uncompressed -cf - . | tarlz -C destdir -xf - +@end example + +@sp 1 +@noindent +Example 9: Compress the existing POSIX archive @samp{archive.tar} and write +the output to @samp{archive.tar.lz}. Compress each member individually for +maximum availability. (If one member in the compressed archive gets damaged, +the other members can still be extracted). + +@example +tarlz -z --no-solid archive.tar +@end example + +@sp 1 +@noindent +Example 10: Compress the archive @samp{archive.tar} and write the output to +@samp{foo.tar.lz}. + +@example +tarlz -z -o foo.tar.lz archive.tar +@end example + +@sp 1 +@noindent +Example 11: Concatenate and compress two archives @samp{archive1.tar} and +@samp{archive2.tar}, and write the output to @samp{foo.tar.lz}. + +@example +tarlz -A archive1.tar archive2.tar | tarlz -z -o foo.tar.lz +@end example + + +@node Problems +@chapter Reporting bugs +@cindex bugs +@cindex getting help + +There are probably bugs in tarlz. There are certainly errors and +omissions in this manual. If you report them, they will get fixed. If +you don't, no one will ever know about them and they will remain unfixed +for all eternity, if not longer. + +If you find a bug in tarlz, please send electronic mail to +@email{lzip-bug@@nongnu.org}. Include the version number, which you can +find by running @w{@samp{tarlz --version}} and +@w{@samp{tarlz -v --check-lib}}. + + +@node Concept index +@unnumbered Concept index + +@printindex cp + +@bye |