summaryrefslogtreecommitdiffstats
path: root/README
blob: 5f1dedb3ebb0504bf3627b710a45bbd15754a59d (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
Description

Tarlz is a massively parallel (multi-threaded) combined implementation of
the tar archiver and the lzip compressor. Tarlz uses the compression library
lzlib.

Tarlz creates tar archives using a simplified and safer variant of the POSIX
pax format compressed in lzip format, keeping the alignment between tar
members and lzip members. The resulting multimember tar.lz archive is
backward compatible with standard tar tools like GNU tar, which treat it
like any other tar.lz archive. Tarlz can append files to the end of such
compressed archives.

Keeping the alignment between tar members and lzip members has two
advantages. It adds an indexed lzip layer on top of the tar archive, making
it possible to decode the archive safely in parallel. It also minimizes the
amount of data lost in case of corruption. Compressing a tar archive with
plzip may even double the amount of files lost for each lzip member damaged
because it does not keep the members aligned.

Tarlz can create tar archives with five levels of compression granularity:
per file (--no-solid), per block (--bsolid, default), per directory
(--dsolid), appendable solid (--asolid), and solid (--solid). It can also
create uncompressed tar archives.

Of course, compressing each file (or each directory) individually can't
achieve a compression ratio as high as compressing solidly the whole tar
archive, but it has the following advantages:

   * The resulting multimember tar.lz archive can be decompressed in
     parallel, multiplying the decompression speed.

   * New members can be appended to the archive (by removing the
     end-of-archive member), and unwanted members can be deleted from the
     archive. Just like an uncompressed tar archive.

   * It is a safe POSIX-style backup format. In case of corruption, tarlz
     can extract all the undamaged members from the tar.lz archive,
     skipping over the damaged members, just like the standard
     (uncompressed) tar. Moreover, the option '--keep-damaged' can be used
     to recover as much data as possible from each damaged member, and
     lziprecover can be used to recover some of the damaged members.

   * A multimember tar.lz archive is usually smaller than the corresponding
     solidly compressed tar.gz archive, except when individually
     compressing files smaller than about 32 KiB.

Note that the POSIX pax format has a serious flaw. The metadata stored in
pax extended records are not protected by any kind of check sequence.
Corruption in a long file name may cause the extraction of the file in the
wrong place without warning. Corruption in a large file size may cause the
truncation of the file or the appending of garbage to the file, both
followed by a spurious warning about a corrupt header far from the place of
the undetected corruption.

Metadata like file name and file size must be always protected in an archive
format because of the adverse effects of undetected corruption in them,
potentially much worse that undetected corruption in the data. Even more so
in the case of pax because the amount of metadata it stores is potentially
large, making undetected corruption and archiver misbehavior more probable.

Headers and metadata must be protected separately from data because the
integrity checking of lzip may not be able to detect the corruption before
the metadata have been used, for example, to create a new file in the wrong
place.

Because of the above, tarlz protects the extended records with a Cyclic
Redundancy Check (CRC) in a way compatible with standard tar tools.

Tarlz does not understand other tar formats like gnu, oldgnu, star, or v7.
The command 'tarlz -t -f archive.tar.lz > /dev/null' can be used to check
that the format of the archive is compatible with tarlz.

The diagram below shows the correspondence between each tar member (formed
by one or two headers plus optional data) in the tar archive and each lzip
member in the resulting multimember tar.lz archive, when per file
compression is used:

tar
+========+======+=================+===============+========+======+========+
| header | data | extended header | extended data | header | data |   EOA  |
+========+======+=================+===============+========+======+========+

tar.lz
+===============+=================================================+========+
|     member    |                      member                     | member |
+===============+=================================================+========+


Copyright (C) 2013-2024 Antonio Diaz Diaz.

This file is free documentation: you have unlimited permission to copy,
distribute, and modify it.

The file Makefile.in is a data file used by configure to produce the Makefile.
It has the same copyright owner and permissions that configure itself.