summaryrefslogtreecommitdiffstats
path: root/doc/tarlz.info
diff options
context:
space:
mode:
Diffstat (limited to 'doc/tarlz.info')
-rw-r--r--doc/tarlz.info202
1 files changed, 131 insertions, 71 deletions
diff --git a/doc/tarlz.info b/doc/tarlz.info
index bf1e1f5..288c441 100644
--- a/doc/tarlz.info
+++ b/doc/tarlz.info
@@ -11,7 +11,7 @@ File: tarlz.info, Node: Top, Next: Introduction, Up: (dir)
Tarlz Manual
************
-This manual is for Tarlz (version 0.10, 31 January 2019).
+This manual is for Tarlz (version 0.11, 13 February 2019).
* Menu:
@@ -20,6 +20,7 @@ This manual is for Tarlz (version 0.10, 31 January 2019).
* File format:: Detailed format of the compressed archive
* Amendments to pax format:: The reasons for the differences with pax
* Multi-threaded tar:: Limitations of parallel tar decoding
+* Minimum archive sizes:: Sizes required for full multi-threaded speed
* Examples:: A small tutorial with examples
* Problems:: Reporting bugs
* Concept index:: Index of concepts
@@ -36,23 +37,23 @@ File: tarlz.info, Node: Introduction, Next: Invoking tarlz, Prev: Top, Up: T
1 Introduction
**************
-Tarlz is a combined implementation of the tar archiver and the lzip
-compressor. By default tarlz creates, lists and extracts archives in a
-simplified posix pax format compressed with lzip on a per file basis.
-Each tar member is compressed in its own lzip member, as well as the
-end-of-file blocks. This method adds an indexed lzip layer on top of
-the tar archive, making it possible to decode the archive safely in
-parallel. The resulting multimember tar.lz archive is fully backward
-compatible with standard tar tools like GNU tar, which treat it like
-any other tar.lz archive. Tarlz can append files to the end of such
-compressed archives.
+Tarlz is a massively parallel (multi-threaded) combined implementation
+of the tar archiver and the lzip compressor. Tarlz creates, lists and
+extracts archives in a simplified posix pax format compressed with
+lzip, keeping the alignment between tar members and lzip members. This
+method adds an indexed lzip layer on top of the tar archive, making it
+possible to decode the archive safely in parallel. The resulting
+multimember tar.lz archive is fully backward compatible with standard
+tar tools like GNU tar, which treat it like any other tar.lz archive.
+Tarlz can append files to the end of such compressed archives.
- Tarlz can create tar archives with four levels of compression
-granularity; per file, per directory, appendable solid, and solid.
+ Tarlz can create tar archives with five levels of compression
+granularity; per file, per block, per directory, appendable solid, and
+solid.
-Of course, compressing each file (or each directory) individually is
-less efficient than compressing the whole tar archive, but it has the
-following advantages:
+Of course, compressing each file (or each directory) individually can't
+achieve a compression ratio as high as compressing solidly the whole tar
+archive, but it has the following advantages:
* The resulting multimember tar.lz archive can be decompressed in
parallel, multiplying the decompression speed.
@@ -87,17 +88,23 @@ The format for running tarlz is:
tarlz [OPTIONS] [FILES]
-On archive creation or appending, tarlz removes leading and trailing
-slashes from filenames, as well as filename prefixes containing a '..'
-component. On extraction, archive members containing a '..' component
-are skipped. Tarlz detects when the archive being created or enlarged
-is among the files to be dumped, appended or concatenated, and skips it.
+On archive creation or appending tarlz archives the files specified, but
+removes from member names any leading and trailing slashes and any
+filename prefixes containing a '..' component. On extraction, leading
+and trailing slashes are also removed from member names, and archive
+members containing a '..' component in the filename are skipped. Tarlz
+detects when the archive being created or enlarged is among the files
+to be dumped, appended or concatenated, and skips it.
On extraction and listing, tarlz removes leading './' strings from
member names in the archive or given in the command line, so that
'tarlz -xf foo ./bar baz' extracts members 'bar' and './baz' from
archive 'foo'.
+ If several compression levels or '--*solid' options are given, the
+last setting is used. For example '-9 --solid --uncompressed -1' is
+equivalent to '-1 --solid'
+
tarlz supports the following options:
'-h'
@@ -125,7 +132,7 @@ archive 'foo'.
Set target size of input data blocks for the '--bsolid' option.
Valid values range from 8 KiB to 1 GiB. Default value is two times
the dictionary size, except for option '-0' where it defaults to
- 1 MiB.
+ 1 MiB. *Note Minimum archive sizes::.
'-c'
'--create'
@@ -142,6 +149,11 @@ archive 'foo'.
relative to the then current working directory, perhaps changed by
a previous '-C' option.
+ Note that a process can only have one current working directory
+ (CWD). Therefore multi-threading can't be used to create an
+ archive if a '-C' option appears after a relative filename in the
+ command line.
+
'-f ARCHIVE'
'--file=ARCHIVE'
Use archive file ARCHIVE. '-' used as an ARCHIVE argument reads
@@ -149,18 +161,21 @@ archive 'foo'.
'-n N'
'--threads=N'
- Set the number of decompression threads, overriding the system's
+ Set the number of (de)compression threads, overriding the system's
default. Valid values range from 0 to "as many as your system can
support". A value of 0 disables threads entirely. If this option
is not used, tarlz tries to detect the number of processors in the
system and use it as default value. 'tarlz --help' shows the
- system's default value. This option currently only has effect when
- listing the contents of a multimember compressed archive. *Note
+ system's default value. See the note about multi-threaded archive
+ creation in the '-C' option above. Multi-threaded extraction of
+ files from an archive is not yet implemented. *Note
Multi-threaded tar::.
Note that the number of usable threads is limited during
- decompression to the number of lzip members in the tar.lz archive,
- which you can find by running 'lzip -lv archive.tar.lz'.
+ compression to ceil( uncompressed_size / data_size ) (*note
+ Minimum archive sizes::), and during decompression to the number
+ of lzip members in the tar.lz archive, which you can find by
+ running 'lzip -lv archive.tar.lz'.
'-q'
'--quiet'
@@ -180,7 +195,7 @@ archive 'foo'.
'-t'
'--list'
List the contents of an archive. If FILES are given, list only the
- given FILES.
+ FILES given.
'-v'
'--verbose'
@@ -189,7 +204,7 @@ archive 'foo'.
'-x'
'--extract'
Extract files from an archive. If FILES are given, extract only
- the given FILES. Else extract all the files in the archive.
+ the FILES given. Else extract all the files in the archive.
'-0 .. -9'
Set the compression level. The default compression level is '-6'.
@@ -214,38 +229,43 @@ archive 'foo'.
solid compression. All the files being added to the archive are
compressed into a single lzip member, but the end-of-file blocks
are compressed into a separate lzip member. This creates a solidly
- compressed appendable archive.
+ compressed appendable archive. Solid archives can't be created
+ nor decoded in parallel.
'--bsolid'
- When creating or appending to a compressed archive, compress tar
- members together in a lzip member until they approximate a target
- uncompressed size. The size can't be exact because each solidly
- compressed data block must contain an integer number of tar
- members. This option improves compression efficiency for archives
- with lots of small files. *Note --data-size::, to set the target
+ When creating or appending to a compressed archive, use block
+ compression. Tar members are compressed together in a lzip member
+ until they approximate a target uncompressed size. The size can't
+ be exact because each solidly compressed data block must contain
+ an integer number of tar members. Block compression is the default
+ because it improves compression ratio for archives with many files
+ smaller than the block size. This option allows tarlz revert to
+ default behavior if, for example, it is invoked through an alias
+ like 'tar='tarlz --solid''. *Note --data-size::, to set the target
block size.
'--dsolid'
- When creating or appending to a compressed archive, use solid
- compression for each directory especified in the command line. The
- end-of-file blocks are compressed into a separate lzip member. This
- creates a compressed appendable archive with a separate lzip
- member for each top-level directory.
+ When creating or appending to a compressed archive, compress each
+ file specified in the command line separately in its own lzip
+ member, and use solid compression for each directory specified in
+ the command line. The end-of-file blocks are compressed into a
+ separate lzip member. This creates a compressed appendable archive
+ with a separate lzip member for each file or top-level directory
+ specified.
'--no-solid'
When creating or appending to a compressed archive, compress each
- file separately. The end-of-file blocks are compressed into a
- separate lzip member. This creates a compressed appendable archive
- with a separate lzip member for each file. This option allows
- tarlz revert to default behavior if, for example, tarlz is invoked
- through an alias like 'tar='tarlz --solid''.
+ file separately in its own lzip member. The end-of-file blocks are
+ compressed into a separate lzip member. This creates a compressed
+ appendable archive with a lzip member for each file.
'--solid'
When creating or appending to a compressed archive, use solid
- compression. The files being added to the archive, along with the
+ compression. The files being added to the archive, along with the
end-of-file blocks, are compressed into a single lzip member. The
resulting archive is not appendable. No more files can be later
- appended to the archive.
+ appended to the archive. Solid archives can't be created nor
+ decoded in parallel.
'--anonymous'
Equivalent to '--owner=root --group=root'.
@@ -341,9 +361,9 @@ blocks are either compressed in a separate lzip member or compressed
along with the tar members contained in the last lzip member.
The diagram below shows the correspondence between each tar member
-(formed by one or two headers plus optional data) in the tar archive and
-each lzip member in the resulting multimember tar.lz archive: *Note
-File format: (lzip)File format.
+(formed by one or two headers plus optional data) in the tar archive
+and each lzip member in the resulting multimember tar.lz archive, when
+per file compression is used: *Note File format: (lzip)File format.
tar
+========+======+=================+===============+========+======+========+
@@ -612,12 +632,12 @@ wasteful for a backup format.
There is no portable way to tell what charset a text string is coded
into. Therefore, tarlz stores all fields representing text strings
-as-is, without conversion to UTF-8 nor any other transformation. This
-prevents accidental double UTF-8 conversions. If the need arises this
-behavior will be adjusted with a command line option in the future.
+unmodified, without conversion to UTF-8 nor any other transformation.
+This prevents accidental double UTF-8 conversions. If the need arises
+this behavior will be adjusted with a command line option in the future.

-File: tarlz.info, Node: Multi-threaded tar, Next: Examples, Prev: Amendments to pax format, Up: Top
+File: tarlz.info, Node: Multi-threaded tar, Next: Minimum archive sizes, Prev: Amendments to pax format, Up: Top
5 Limitations of parallel tar decoding
**************************************
@@ -659,15 +679,53 @@ sequential '--list' because, in addition to using several processors,
it only needs to decompress part of each lzip member. See the following
example listing the Silesia corpus on a dual core machine:
- tarlz -9 -cf silesia.tar.lz silesia
+ tarlz -9 --no-solid -cf silesia.tar.lz silesia
time lzip -cd silesia.tar.lz | tar -tf - (5.032s)
time plzip -cd silesia.tar.lz | tar -tf - (3.256s)
time tarlz -tf silesia.tar.lz (0.020s)

-File: tarlz.info, Node: Examples, Next: Problems, Prev: Multi-threaded tar, Up: Top
+File: tarlz.info, Node: Minimum archive sizes, Next: Examples, Prev: Multi-threaded tar, Up: Top
+
+6 Minimum archive sizes required for multi-threaded block compression
+*********************************************************************
+
+When creating or appending to a compressed archive using multi-threaded
+block compression, tarlz puts tar members together in blocks and
+compresses as many blocks simultaneously as worker threads are chosen,
+creating a multimember compressed archive.
+
+ For this to work as expected (and roughly multiply the compression
+speed by the number of available processors), the uncompressed archive
+must be at least as large as the number of worker threads times the
+block size (*note --data-size::). Else some processors will not get any
+data to compress, and compression will be proportionally slower. The
+maximum speed increase achievable on a given file is limited by the
+ratio (uncompressed_size / data_size). For example, a tarball the size
+of gcc or linux will scale up to 10 or 12 processors at level -9.
+
+ The following table shows the minimum uncompressed archive size
+needed for full use of N processors at a given compression level, using
+the default data size for each level:
+
+Processors 2 4 8 16 64 256
+------------------------------------------------------------------
+Level
+-0 2 MiB 4 MiB 8 MiB 16 MiB 64 MiB 256 MiB
+-1 4 MiB 8 MiB 16 MiB 32 MiB 128 MiB 512 MiB
+-2 6 MiB 12 MiB 24 MiB 48 MiB 192 MiB 768 MiB
+-3 8 MiB 16 MiB 32 MiB 64 MiB 256 MiB 1 GiB
+-4 12 MiB 24 MiB 48 MiB 96 MiB 384 MiB 1.5 GiB
+-5 16 MiB 32 MiB 64 MiB 128 MiB 512 MiB 2 GiB
+-6 32 MiB 64 MiB 128 MiB 256 MiB 1 GiB 4 GiB
+-7 64 MiB 128 MiB 256 MiB 512 MiB 2 GiB 8 GiB
+-8 96 MiB 192 MiB 384 MiB 768 MiB 3 GiB 12 GiB
+-9 128 MiB 256 MiB 512 MiB 1 GiB 4 GiB 16 GiB
+
+
+File: tarlz.info, Node: Examples, Next: Problems, Prev: Minimum archive sizes, Up: Top
-6 A small tutorial with examples
+7 A small tutorial with examples
********************************
Example 1: Create a multimember compressed archive 'archive.tar.lz'
@@ -725,7 +783,7 @@ Example 8: Copy the contents of directory 'sourcedir' to the directory

File: tarlz.info, Node: Problems, Next: Concept index, Prev: Examples, Up: Top
-7 Reporting bugs
+8 Reporting bugs
****************
There are probably bugs in tarlz. There are certainly errors and
@@ -754,6 +812,7 @@ Concept index
* getting help: Problems. (line 6)
* introduction: Introduction. (line 6)
* invoking: Invoking tarlz. (line 6)
+* minimum archive sizes: Minimum archive sizes. (line 6)
* options: Invoking tarlz. (line 6)
* usage: Invoking tarlz. (line 6)
* version: Invoking tarlz. (line 6)
@@ -762,18 +821,19 @@ Concept index

Tag Table:
Node: Top223
-Node: Introduction1013
-Node: Invoking tarlz3125
-Ref: --data-size4717
-Node: File format11536
-Ref: key_crc3216321
-Node: Amendments to pax format21738
-Ref: crc3222262
-Ref: flawed-compat23287
-Node: Multi-threaded tar25649
-Node: Examples28164
-Node: Problems29830
-Node: Concept index30356
+Node: Introduction1089
+Node: Invoking tarlz3218
+Ref: --data-size5097
+Node: File format12673
+Ref: key_crc3217493
+Node: Amendments to pax format22910
+Ref: crc3223434
+Ref: flawed-compat24459
+Node: Multi-threaded tar26826
+Node: Minimum archive sizes29365
+Node: Examples31495
+Node: Problems33164
+Node: Concept index33690

End Tag Table