1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
|
LZMA specification (DRAFT version)
----------------------------------
Author: Igor Pavlov
Date: 2015-06-14
This specification defines the format of LZMA compressed data and lzma file format.
Notation
--------
We use the syntax of C++ programming language.
We use the following types in C++ code:
unsigned - unsigned integer, at least 16 bits in size
int - signed integer, at least 16 bits in size
UInt64 - 64-bit unsigned integer
UInt32 - 32-bit unsigned integer
UInt16 - 16-bit unsigned integer
Byte - 8-bit unsigned integer
bool - boolean type with two possible values: false, true
lzma file format
================
The lzma file contains the raw LZMA stream and the header with related properties.
The files in that format use ".lzma" extension.
The lzma file format layout:
Offset Size Description
0 1 LZMA model properties (lc, lp, pb) in encoded form
1 4 Dictionary size (32-bit unsigned integer, little-endian)
5 8 Uncompressed size (64-bit unsigned integer, little-endian)
13 Compressed data (LZMA stream)
LZMA properties:
name Range Description
lc [0, 8] the number of "literal context" bits
lp [0, 4] the number of "literal pos" bits
pb [0, 4] the number of "pos" bits
dictSize [0, 2^32 - 1] the dictionary size
The following code encodes LZMA properties:
void EncodeProperties(Byte *properties)
{
properties[0] = (Byte)((pb * 5 + lp) * 9 + lc);
Set_UInt32_LittleEndian(properties + 1, dictSize);
}
If the value of dictionary size in properties is smaller than (1 << 12),
the LZMA decoder must set the dictionary size variable to (1 << 12).
#define LZMA_DIC_MIN (1 << 12)
unsigned lc, pb, lp;
UInt32 dictSize;
UInt32 dictSizeInProperties;
void DecodeProperties(const Byte *properties)
{
unsigned d = properties[0];
if (d >= (9 * 5 * 5))
throw "Incorrect LZMA properties";
lc = d % 9;
d /= 9;
pb = d / 5;
lp = d % 5;
dictSizeInProperties = 0;
for (int i = 0; i < 4; i++)
dictSizeInProperties |= (UInt32)properties[i + 1] << (8 * i);
dictSize = dictSizeInProperties;
if (dictSize < LZMA_DIC_MIN)
dictSize = LZMA_DIC_MIN;
}
If "Uncompressed size" field contains ones in all 64 bits, it means that
uncompressed size is unknown and there is the "end marker" in stream,
that indicates the end of decoding point.
In opposite case, if the value from "Uncompressed size" field is not
equal to ((2^64) - 1), the LZMA stream decoding must be finished after
specified number of bytes (Uncompressed size) is decoded. And if there
is the "end marker", the LZMA decoder must read that marker also.
The new scheme to encode LZMA properties
----------------------------------------
If LZMA compression is used for some another format, it's recommended to
use a new improved scheme to encode LZMA properties. That new scheme was
used in xz format that uses the LZMA2 compression algorithm.
The LZMA2 is a new compression algorithm that is based on the LZMA algorithm.
The dictionary size in LZMA2 is encoded with just one byte and LZMA2 supports
only reduced set of dictionary sizes:
(2 << 11), (3 << 11),
(2 << 12), (3 << 12),
...
(2 << 30), (3 << 30),
(2 << 31) - 1
The dictionary size can be extracted from encoded value with the following code:
dictSize = (p == 40) ? 0xFFFFFFFF : (((UInt32)2 | ((p) & 1)) << ((p) / 2 + 11));
Also there is additional limitation (lc + lp <= 4) in LZMA2 for values of
"lc" and "lp" properties:
if (lc + lp > 4)
throw "Unsupported properties: (lc + lp) > 4";
There are some advantages for LZMA decoder with such (lc + lp) value
limitation. It reduces the maximum size of tables allocated by decoder.
And it reduces the complexity of initialization procedure, that can be
important to keep high speed of decoding of big number of small LZMA streams.
It's recommended to use that limitation (lc + lp <= 4) for any new format
that uses LZMA compression. Note that the combinations of "lc" and "lp"
parameters, where (lc + lp > 4), can provide significant improvement in
compression ratio only in some rare cases.
The LZMA properties can be encoded into two bytes in new scheme:
Offset Size Description
0 1 The dictionary size encoded with LZMA2 scheme
1 1 LZMA model properties (lc, lp, pb) in encoded form
The RAM usage
=============
The RAM usage for LZMA decoder is determined by the following parts:
1) The Sliding Window (from 4 KiB to 4 GiB).
2) The probability model counter arrays (arrays of 16-bit variables).
3) Some additional state variables (about 10 variables of 32-bit integers).
The RAM usage for Sliding Window
--------------------------------
There are two main scenarios of decoding:
1) The decoding of full stream to one RAM buffer.
If we decode full LZMA stream to one output buffer in RAM, the decoder
can use that output buffer as sliding window. So the decoder doesn't
need additional buffer allocated for sliding window.
2) The decoding to some external storage.
If we decode LZMA stream to external storage, the decoder must allocate
the buffer for sliding window. The size of that buffer must be equal
or larger than the value of dictionary size from properties of LZMA stream.
In this specification we describe the code for decoding to some external
storage. The optimized version of code for decoding of full stream to one
output RAM buffer can require some minor changes in code.
The RAM usage for the probability model counters
------------------------------------------------
The size of the probability model counter arrays is calculated with the
following formula:
size_of_prob_arrays = 1846 + 768 * (1 << (lp + lc))
Each probability model counter is 11-bit unsigned integer.
If we use 16-bit integer variables (2-byte integers) for these probability
model counters, the RAM usage required by probability model counter arrays
can be estimated with the following formula:
RAM = 4 KiB + 1.5 KiB * (1 << (lp + lc))
For example, for default LZMA parameters (lp = 0 and lc = 3), the RAM usage is
RAM_lc3_lp0 = 4 KiB + 1.5 KiB * 8 = 16 KiB
The maximum RAM state usage is required for decoding the stream with lp = 4
and lc = 8:
RAM_lc8_lp4 = 4 KiB + 1.5 KiB * 4096 = 6148 KiB
If the decoder uses LZMA2's limited property condition
(lc + lp <= 4), the RAM usage will be not larger than
RAM_lc_lp_4 = 4 KiB + 1.5 KiB * 16 = 28 KiB
The RAM usage for encoder
-------------------------
There are many variants for LZMA encoding code.
These variants have different values for memory consumption.
Note that memory consumption for LZMA Encoder can not be
smaller than memory consumption of LZMA Decoder for same stream.
The RAM usage required by modern effective implementation of
LZMA Encoder can be estimated with the following formula:
Encoder_RAM_Usage = 4 MiB + 11 * dictionarySize.
But there are some modes of the encoder that require less memory.
LZMA Decoding
=============
The LZMA compression algorithm uses LZ-based compression with Sliding Window
and Range Encoding as entropy coding method.
Sliding Window
--------------
LZMA uses Sliding Window compression similar to LZ77 algorithm.
LZMA stream must be decoded to the sequence that consists
of MATCHES and LITERALS:
- a LITERAL is a 8-bit character (one byte).
The decoder just puts that LITERAL to the uncompressed stream.
- a MATCH is a pair of two numbers (DISTANCE-LENGTH pair).
The decoder takes one byte exactly "DISTANCE" characters behind
current position in the uncompressed stream and puts it to
uncompressed stream. The decoder must repeat it "LENGTH" times.
The "DISTANCE" can not be larger than dictionary size.
And the "DISTANCE" can not be larger than the number of bytes in
the uncompressed stream that were decoded before that match.
In this specification we use cyclic buffer to implement Sliding Window
for LZMA decoder:
class COutWindow
{
Byte *Buf;
UInt32 Pos;
UInt32 Size;
bool IsFull;
public:
unsigned TotalPos;
COutStream OutStream;
COutWindow(): Buf(NULL) {}
~COutWindow() { delete []Buf; }
void Create(UInt32 dictSize)
{
Buf = new Byte[dictSize];
Pos = 0;
Size = dictSize;
IsFull = false;
TotalPos = 0;
}
void PutByte(Byte b)
{
TotalPos++;
Buf[Pos++] = b;
if (Pos == Size)
{
Pos = 0;
IsFull = true;
}
OutStream.WriteByte(b);
}
Byte GetByte(UInt32 dist) const
{
return Buf[dist <= Pos ? Pos - dist : Size - dist + Pos];
}
void CopyMatch(UInt32 dist, unsigned len)
{
for (; len > 0; len--)
PutByte(GetByte(dist));
}
bool CheckDistance(UInt32 dist) const
{
return dist <= Pos || IsFull;
}
bool IsEmpty() const
{
return Pos == 0 && !IsFull;
}
};
In another implementation it's possible to use one buffer that contains
Sliding Window and the whole data stream after uncompressing.
Range Decoder
-------------
LZMA algorithm uses Range Encoding (1) as entropy coding method.
LZMA stream contains just one very big number in big-endian encoding.
LZMA decoder uses the Range Decoder to extract a sequence of binary
symbols from that big number.
The state of the Range Decoder:
struct CRangeDecoder
{
UInt32 Range;
UInt32 Code;
InputStream *InStream;
bool Corrupted;
}
The notes about UInt32 type for the "Range" and "Code" variables:
It's possible to use 64-bit (unsigned or signed) integer type
for the "Range" and the "Code" variables instead of 32-bit unsigned,
but some additional code must be used to truncate the values to
low 32-bits after some operations.
If the programming language does not support 32-bit unsigned integer type
(like in case of JAVA language), it's possible to use 32-bit signed integer,
but some code must be changed. For example, it's required to change the code
that uses comparison operations for UInt32 variables in this specification.
The Range Decoder can be in some states that can be treated as
"Corruption" in LZMA stream. The Range Decoder uses the variable "Corrupted":
(Corrupted == false), if the Range Decoder has not detected any corruption.
(Corrupted == true), if the Range Decoder has detected some corruption.
The reference LZMA Decoder ignores the value of the "Corrupted" variable.
So it continues to decode the stream, even if the corruption can be detected
in the Range Decoder. To provide the full compatibility with output of the
reference LZMA Decoder, another LZMA Decoder implementations must also
ignore the value of the "Corrupted" variable.
The LZMA Encoder is required to create only such LZMA streams, that will not
lead the Range Decoder to states, where the "Corrupted" variable is set to true.
The Range Decoder reads first 5 bytes from input stream to initialize
the state:
bool CRangeDecoder::Init()
{
Corrupted = false;
Range = 0xFFFFFFFF;
Code = 0;
Byte b = InStream->ReadByte();
for (int i = 0; i < 4; i++)
Code = (Code << 8) | InStream->ReadByte();
if (b != 0 || Code == Range)
Corrupted = true;
return b == 0;
}
The LZMA Encoder always writes ZERO in initial byte of compressed stream.
That scheme allows to simplify the code of the Range Encoder in the
LZMA Encoder. If initial byte is not equal to ZERO, the LZMA Decoder must
stop decoding and report error.
After the last bit of data was decoded by Range Decoder, the value of the
"Code" variable must be equal to 0. The LZMA Decoder must check it by
calling the IsFinishedOK() function:
bool IsFinishedOK() const { return Code == 0; }
If there is corruption in data stream, there is big probability that
the "Code" value will be not equal to 0 in the Finish() function. So that
check in the IsFinishedOK() function provides very good feature for
corruption detection.
The value of the "Range" variable before each bit decoding can not be smaller
than ((UInt32)1 << 24). The Normalize() function keeps the "Range" value in
described range.
#define kTopValue ((UInt32)1 << 24)
void CRangeDecoder::Normalize()
{
if (Range < kTopValue)
{
Range <<= 8;
Code = (Code << 8) | InStream->ReadByte();
}
}
Notes: if the size of the "Code" variable is larger than 32 bits, it's
required to keep only low 32 bits of the "Code" variable after the change
in Normalize() function.
If the LZMA Stream is not corrupted, the value of the "Code" variable is
always smaller than value of the "Range" variable.
But the Range Decoder ignores some types of corruptions, so the value of
the "Code" variable can be equal or larger than value of the "Range" variable
for some "Corrupted" archives.
LZMA uses Range Encoding only with binary symbols of two types:
1) binary symbols with fixed and equal probabilities (direct bits)
2) binary symbols with predicted probabilities
The DecodeDirectBits() function decodes the sequence of direct bits:
UInt32 CRangeDecoder::DecodeDirectBits(unsigned numBits)
{
UInt32 res = 0;
do
{
Range >>= 1;
Code -= Range;
UInt32 t = 0 - ((UInt32)Code >> 31);
Code += Range & t;
if (Code == Range)
Corrupted = true;
Normalize();
res <<= 1;
res += t + 1;
}
while (--numBits);
return res;
}
The Bit Decoding with Probability Model
---------------------------------------
The task of Bit Probability Model is to estimate probabilities of binary
symbols. And then it provides the Range Decoder with that information.
The better prediction provides better compression ratio.
The Bit Probability Model uses statistical data of previous decoded
symbols.
That estimated probability is presented as 11-bit unsigned integer value
that represents the probability of symbol "0".
#define kNumBitModelTotalBits 11
Mathematical probabilities can be presented with the following formulas:
probability(symbol_0) = prob / 2048.
probability(symbol_1) = 1 - Probability(symbol_0) =
= 1 - prob / 2048 =
= (2048 - prob) / 2048
where the "prob" variable contains 11-bit integer probability counter.
It's recommended to use 16-bit unsigned integer type, to store these 11-bit
probability values:
typedef UInt16 CProb;
Each probability value must be initialized with value ((1 << 11) / 2),
that represents the state, where probabilities of symbols 0 and 1
are equal to 0.5:
#define PROB_INIT_VAL ((1 << kNumBitModelTotalBits) / 2)
The INIT_PROBS macro is used to initialize the array of CProb variables:
#define INIT_PROBS(p) \
{ for (unsigned i = 0; i < sizeof(p) / sizeof(p[0]); i++) p[i] = PROB_INIT_VAL; }
The DecodeBit() function decodes one bit.
The LZMA decoder provides the pointer to CProb variable that contains
information about estimated probability for symbol 0 and the Range Decoder
updates that CProb variable after decoding. The Range Decoder increases
estimated probability of the symbol that was decoded:
#define kNumMoveBits 5
unsigned CRangeDecoder::DecodeBit(CProb *prob)
{
unsigned v = *prob;
UInt32 bound = (Range >> kNumBitModelTotalBits) * v;
unsigned symbol;
if (Code < bound)
{
v += ((1 << kNumBitModelTotalBits) - v) >> kNumMoveBits;
Range = bound;
symbol = 0;
}
else
{
v -= v >> kNumMoveBits;
Code -= bound;
Range -= bound;
symbol = 1;
}
*prob = (CProb)v;
Normalize();
return symbol;
}
The Binary Tree of bit model counters
-------------------------------------
LZMA uses a tree of Bit model variables to decode symbol that needs
several bits for storing. There are two versions of such trees in LZMA:
1) the tree that decodes bits from high bit to low bit (the normal scheme).
2) the tree that decodes bits from low bit to high bit (the reverse scheme).
Each binary tree structure supports different size of decoded symbol
(the size of binary sequence that contains value of symbol).
If that size of decoded symbol is "NumBits" bits, the tree structure
uses the array of (2 << NumBits) counters of CProb type.
But only ((2 << NumBits) - 1) items are used by encoder and decoder.
The first item (the item with index equal to 0) in array is unused.
That scheme with unused array's item allows to simplify the code.
unsigned BitTreeReverseDecode(CProb *probs, unsigned numBits, CRangeDecoder *rc)
{
unsigned m = 1;
unsigned symbol = 0;
for (unsigned i = 0; i < numBits; i++)
{
unsigned bit = rc->DecodeBit(&probs[m]);
m <<= 1;
m += bit;
symbol |= (bit << i);
}
return symbol;
}
template <unsigned NumBits>
class CBitTreeDecoder
{
CProb Probs[(unsigned)1 << NumBits];
public:
void Init()
{
INIT_PROBS(Probs);
}
unsigned Decode(CRangeDecoder *rc)
{
unsigned m = 1;
for (unsigned i = 0; i < NumBits; i++)
m = (m << 1) + rc->DecodeBit(&Probs[m]);
return m - ((unsigned)1 << NumBits);
}
unsigned ReverseDecode(CRangeDecoder *rc)
{
return BitTreeReverseDecode(Probs, NumBits, rc);
}
};
LZ part of LZMA
---------------
LZ part of LZMA describes details about the decoding of MATCHES and LITERALS.
The Literal Decoding
--------------------
The LZMA Decoder uses (1 << (lc + lp)) tables with CProb values, where
each table contains 0x300 CProb values:
CProb *LitProbs;
void CreateLiterals()
{
LitProbs = new CProb[(UInt32)0x300 << (lc + lp)];
}
void InitLiterals()
{
UInt32 num = (UInt32)0x300 << (lc + lp);
for (UInt32 i = 0; i < num; i++)
LitProbs[i] = PROB_INIT_VAL;
}
To select the table for decoding it uses the context that consists of
(lc) high bits from previous literal and (lp) low bits from value that
represents current position in outputStream.
If (State > 7), the Literal Decoder also uses "matchByte" that represents
the byte in OutputStream at position the is the DISTANCE bytes before
current position, where the DISTANCE is the distance in DISTANCE-LENGTH pair
of latest decoded match.
The following code decodes one literal and puts it to Sliding Window buffer:
void DecodeLiteral(unsigned state, UInt32 rep0)
{
unsigned prevByte = 0;
if (!OutWindow.IsEmpty())
prevByte = OutWindow.GetByte(1);
unsigned symbol = 1;
unsigned litState = ((OutWindow.TotalPos & ((1 << lp) - 1)) << lc) + (prevByte >> (8 - lc));
CProb *probs = &LitProbs[(UInt32)0x300 * litState];
if (state >= 7)
{
unsigned matchByte = OutWindow.GetByte(rep0 + 1);
do
{
unsigned matchBit = (matchByte >> 7) & 1;
matchByte <<= 1;
unsigned bit = RangeDec.DecodeBit(&probs[((1 + matchBit) << 8) + symbol]);
symbol = (symbol << 1) | bit;
if (matchBit != bit)
break;
}
while (symbol < 0x100);
}
while (symbol < 0x100)
symbol = (symbol << 1) | RangeDec.DecodeBit(&probs[symbol]);
OutWindow.PutByte((Byte)(symbol - 0x100));
}
The match length decoding
-------------------------
The match length decoder returns normalized (zero-based value)
length of match. That value can be converted to real length of the match
with the following code:
#define kMatchMinLen 2
matchLen = len + kMatchMinLen;
The match length decoder can return the values from 0 to 271.
And the corresponded real match length values can be in the range
from 2 to 273.
The following scheme is used for the match length encoding:
Binary encoding Binary Tree structure Zero-based match length
sequence (binary + decimal):
0 xxx LowCoder[posState] xxx
1 0 yyy MidCoder[posState] yyy + 8
1 1 zzzzzzzz HighCoder zzzzzzzz + 16
LZMA uses bit model variable "Choice" to decode the first selection bit.
If the first selection bit is equal to 0, the decoder uses binary tree
LowCoder[posState] to decode 3-bit zero-based match length (xxx).
If the first selection bit is equal to 1, the decoder uses bit model
variable "Choice2" to decode the second selection bit.
If the second selection bit is equal to 0, the decoder uses binary tree
MidCoder[posState] to decode 3-bit "yyy" value, and zero-based match
length is equal to (yyy + 8).
If the second selection bit is equal to 1, the decoder uses binary tree
HighCoder to decode 8-bit "zzzzzzzz" value, and zero-based
match length is equal to (zzzzzzzz + 16).
LZMA uses "posState" value as context to select the binary tree
from LowCoder and MidCoder binary tree arrays:
unsigned posState = OutWindow.TotalPos & ((1 << pb) - 1);
The full code of the length decoder:
class CLenDecoder
{
CProb Choice;
CProb Choice2;
CBitTreeDecoder<3> LowCoder[1 << kNumPosBitsMax];
CBitTreeDecoder<3> MidCoder[1 << kNumPosBitsMax];
CBitTreeDecoder<8> HighCoder;
public:
void Init()
{
Choice = PROB_INIT_VAL;
Choice2 = PROB_INIT_VAL;
HighCoder.Init();
for (unsigned i = 0; i < (1 << kNumPosBitsMax); i++)
{
LowCoder[i].Init();
MidCoder[i].Init();
}
}
unsigned Decode(CRangeDecoder *rc, unsigned posState)
{
if (rc->DecodeBit(&Choice) == 0)
return LowCoder[posState].Decode(rc);
if (rc->DecodeBit(&Choice2) == 0)
return 8 + MidCoder[posState].Decode(rc);
return 16 + HighCoder.Decode(rc);
}
};
The LZMA decoder uses two instances of CLenDecoder class.
The first instance is for the matches of "Simple Match" type,
and the second instance is for the matches of "Rep Match" type:
CLenDecoder LenDecoder;
CLenDecoder RepLenDecoder;
The match distance decoding
---------------------------
LZMA supports dictionary sizes up to 4 GiB minus 1.
The value of match distance (decoded by distance decoder) can be
from 1 to 2^32. But the distance value that is equal to 2^32 is used to
indicate the "End of stream" marker. So real largest match distance
that is used for LZ-window match is (2^32 - 1).
LZMA uses normalized match length (zero-based length)
to calculate the context state "lenState" do decode the distance value:
#define kNumLenToPosStates 4
unsigned lenState = len;
if (lenState > kNumLenToPosStates - 1)
lenState = kNumLenToPosStates - 1;
The distance decoder returns the "dist" value that is zero-based value
of match distance. The real match distance can be calculated with the
following code:
matchDistance = dist + 1;
The state of the distance decoder and the initialization code:
#define kEndPosModelIndex 14
#define kNumFullDistances (1 << (kEndPosModelIndex >> 1))
#define kNumAlignBits 4
CBitTreeDecoder<6> PosSlotDecoder[kNumLenToPosStates];
CProb PosDecoders[1 + kNumFullDistances - kEndPosModelIndex];
CBitTreeDecoder<kNumAlignBits> AlignDecoder;
void InitDist()
{
for (unsigned i = 0; i < kNumLenToPosStates; i++)
PosSlotDecoder[i].Init();
AlignDecoder.Init();
INIT_PROBS(PosDecoders);
}
At first stage the distance decoder decodes 6-bit "posSlot" value with bit
tree decoder from PosSlotDecoder array. It's possible to get 2^6=64 different
"posSlot" values.
unsigned posSlot = PosSlotDecoder[lenState].Decode(&RangeDec);
The encoding scheme for distance value is shown in the following table:
posSlot (decimal) /
zero-based distance (binary)
0 0
1 1
2 10
3 11
4 10 x
5 11 x
6 10 xx
7 11 xx
8 10 xxx
9 11 xxx
10 10 xxxx
11 11 xxxx
12 10 xxxxx
13 11 xxxxx
14 10 yy zzzz
15 11 yy zzzz
16 10 yyy zzzz
17 11 yyy zzzz
...
62 10 yyyyyyyyyyyyyyyyyyyyyyyyyy zzzz
63 11 yyyyyyyyyyyyyyyyyyyyyyyyyy zzzz
where
"x ... x" means the sequence of binary symbols encoded with binary tree and
"Reverse" scheme. It uses separated binary tree for each posSlot from 4 to 13.
"y" means direct bit encoded with range coder.
"zzzz" means the sequence of four binary symbols encoded with binary
tree with "Reverse" scheme, where one common binary tree "AlignDecoder"
is used for all posSlot values.
If (posSlot < 4), the "dist" value is equal to posSlot value.
If (posSlot >= 4), the decoder uses "posSlot" value to calculate the value of
the high bits of "dist" value and the number of the low bits.
If (4 <= posSlot < kEndPosModelIndex), the decoder uses bit tree decoders.
(one separated bit tree decoder per one posSlot value) and "Reverse" scheme.
In this implementation we use one CProb array "PosDecoders" that contains
all CProb variables for all these bit decoders.
if (posSlot >= kEndPosModelIndex), the middle bits are decoded as direct
bits from RangeDecoder and the low 4 bits are decoded with a bit tree
decoder "AlignDecoder" with "Reverse" scheme.
The code to decode zero-based match distance:
unsigned DecodeDistance(unsigned len)
{
unsigned lenState = len;
if (lenState > kNumLenToPosStates - 1)
lenState = kNumLenToPosStates - 1;
unsigned posSlot = PosSlotDecoder[lenState].Decode(&RangeDec);
if (posSlot < 4)
return posSlot;
unsigned numDirectBits = (unsigned)((posSlot >> 1) - 1);
UInt32 dist = ((2 | (posSlot & 1)) << numDirectBits);
if (posSlot < kEndPosModelIndex)
dist += BitTreeReverseDecode(PosDecoders + dist - posSlot, numDirectBits, &RangeDec);
else
{
dist += RangeDec.DecodeDirectBits(numDirectBits - kNumAlignBits) << kNumAlignBits;
dist += AlignDecoder.ReverseDecode(&RangeDec);
}
return dist;
}
LZMA Decoding modes
-------------------
There are 2 types of LZMA streams:
1) The stream with "End of stream" marker.
2) The stream without "End of stream" marker.
And the LZMA Decoder supports 3 modes of decoding:
1) The unpack size is undefined. The LZMA decoder stops decoding after
getting "End of stream" marker.
The input variables for that case:
markerIsMandatory = true
unpackSizeDefined = false
unpackSize contains any value
2) The unpack size is defined and LZMA decoder supports both variants,
where the stream can contain "End of stream" marker or the stream is
finished without "End of stream" marker. The LZMA decoder must detect
any of these situations.
The input variables for that case:
markerIsMandatory = false
unpackSizeDefined = true
unpackSize contains unpack size
3) The unpack size is defined and the LZMA stream must contain
"End of stream" marker
The input variables for that case:
markerIsMandatory = true
unpackSizeDefined = true
unpackSize contains unpack size
The main loop of decoder
------------------------
The main loop of LZMA decoder:
Initialize the LZMA state.
loop
{
// begin of loop
Check "end of stream" conditions.
Decode Type of MATCH / LITERAL.
If it's LITERAL, decode LITERAL value and put the LITERAL to Window.
If it's MATCH, decode the length of match and the match distance.
Check error conditions, check end of stream conditions and copy
the sequence of match bytes from sliding window to current position
in window.
Go to begin of loop
}
The reference implementation of LZMA decoder uses "unpackSize" variable
to keep the number of remaining bytes in output stream. So it reduces
"unpackSize" value after each decoded LITERAL or MATCH.
The following code contains the "end of stream" condition check at the start
of the loop:
if (unpackSizeDefined && unpackSize == 0 && !markerIsMandatory)
if (RangeDec.IsFinishedOK())
return LZMA_RES_FINISHED_WITHOUT_MARKER;
LZMA uses three types of matches:
1) "Simple Match" - the match with distance value encoded with bit models.
2) "Rep Match" - the match that uses the distance from distance
history table.
3) "Short Rep Match" - the match of single byte length, that uses the latest
distance from distance history table.
The LZMA decoder keeps the history of latest 4 match distances that were used
by decoder. That set of 4 variables contains zero-based match distances and
these variables are initialized with zero values:
UInt32 rep0 = 0, rep1 = 0, rep2 = 0, rep3 = 0;
The LZMA decoder uses binary model variables to select type of MATCH or LITERAL:
#define kNumStates 12
#define kNumPosBitsMax 4
CProb IsMatch[kNumStates << kNumPosBitsMax];
CProb IsRep[kNumStates];
CProb IsRepG0[kNumStates];
CProb IsRepG1[kNumStates];
CProb IsRepG2[kNumStates];
CProb IsRep0Long[kNumStates << kNumPosBitsMax];
The decoder uses "state" variable value to select exact variable
from "IsRep", "IsRepG0", "IsRepG1" and "IsRepG2" arrays.
The "state" variable can get the value from 0 to 11.
Initial value for "state" variable is zero:
unsigned state = 0;
The "state" variable is updated after each LITERAL or MATCH with one of the
following functions:
unsigned UpdateState_Literal(unsigned state)
{
if (state < 4) return 0;
else if (state < 10) return state - 3;
else return state - 6;
}
unsigned UpdateState_Match (unsigned state) { return state < 7 ? 7 : 10; }
unsigned UpdateState_Rep (unsigned state) { return state < 7 ? 8 : 11; }
unsigned UpdateState_ShortRep(unsigned state) { return state < 7 ? 9 : 11; }
The decoder calculates "state2" variable value to select exact variable from
"IsMatch" and "IsRep0Long" arrays:
unsigned posState = OutWindow.TotalPos & ((1 << pb) - 1);
unsigned state2 = (state << kNumPosBitsMax) + posState;
The decoder uses the following code flow scheme to select exact
type of LITERAL or MATCH:
IsMatch[state2] decode
0 - the Literal
1 - the Match
IsRep[state] decode
0 - Simple Match
1 - Rep Match
IsRepG0[state] decode
0 - the distance is rep0
IsRep0Long[state2] decode
0 - Short Rep Match
1 - Rep Match 0
1 -
IsRepG1[state] decode
0 - Rep Match 1
1 -
IsRepG2[state] decode
0 - Rep Match 2
1 - Rep Match 3
LITERAL symbol
--------------
If the value "0" was decoded with IsMatch[state2] decoding, we have "LITERAL" type.
At first the LZMA decoder must check that it doesn't exceed
specified uncompressed size:
if (unpackSizeDefined && unpackSize == 0)
return LZMA_RES_ERROR;
Then it decodes literal value and puts it to sliding window:
DecodeLiteral(state, rep0);
Then the decoder must update the "state" value and "unpackSize" value;
state = UpdateState_Literal(state);
unpackSize--;
Then the decoder must go to the begin of main loop to decode next Match or Literal.
Simple Match
------------
If the value "1" was decoded with IsMatch[state2] decoding,
we have the "Simple Match" type.
The distance history table is updated with the following scheme:
rep3 = rep2;
rep2 = rep1;
rep1 = rep0;
The zero-based length is decoded with "LenDecoder":
len = LenDecoder.Decode(&RangeDec, posState);
The state is update with UpdateState_Match function:
state = UpdateState_Match(state);
and the new "rep0" value is decoded with DecodeDistance:
rep0 = DecodeDistance(len);
That "rep0" will be used as zero-based distance for current match.
If the value of "rep0" is equal to 0xFFFFFFFF, it means that we have
"End of stream" marker, so we can stop decoding and check finishing
condition in Range Decoder:
if (rep0 == 0xFFFFFFFF)
return RangeDec.IsFinishedOK() ?
LZMA_RES_FINISHED_WITH_MARKER :
LZMA_RES_ERROR;
If uncompressed size is defined, LZMA decoder must check that it doesn't
exceed that specified uncompressed size:
if (unpackSizeDefined && unpackSize == 0)
return LZMA_RES_ERROR;
Also the decoder must check that "rep0" value is not larger than dictionary size
and is not larger than the number of already decoded bytes:
if (rep0 >= dictSize || !OutWindow.CheckDistance(rep0))
return LZMA_RES_ERROR;
Then the decoder must copy match bytes as described in
"The match symbols copying" section.
Rep Match
---------
If the LZMA decoder has decoded the value "1" with IsRep[state] variable,
we have "Rep Match" type.
At first the LZMA decoder must check that it doesn't exceed
specified uncompressed size:
if (unpackSizeDefined && unpackSize == 0)
return LZMA_RES_ERROR;
Also the decoder must return error, if the LZ window is empty:
if (OutWindow.IsEmpty())
return LZMA_RES_ERROR;
If the match type is "Rep Match", the decoder uses one of the 4 variables of
distance history table to get the value of distance for current match.
And there are 4 corresponding ways of decoding flow.
The decoder updates the distance history with the following scheme
depending from type of match:
- "Rep Match 0" or "Short Rep Match":
; LZMA doesn't update the distance history
- "Rep Match 1":
UInt32 dist = rep1;
rep1 = rep0;
rep0 = dist;
- "Rep Match 2":
UInt32 dist = rep2;
rep2 = rep1;
rep1 = rep0;
rep0 = dist;
- "Rep Match 3":
UInt32 dist = rep3;
rep3 = rep2;
rep2 = rep1;
rep1 = rep0;
rep0 = dist;
Then the decoder decodes exact subtype of "Rep Match" using "IsRepG0", "IsRep0Long",
"IsRepG1", "IsRepG2".
If the subtype is "Short Rep Match", the decoder updates the state, puts
the one byte from window to current position in window and goes to next
MATCH/LITERAL symbol (the begin of main loop):
state = UpdateState_ShortRep(state);
OutWindow.PutByte(OutWindow.GetByte(rep0 + 1));
unpackSize--;
continue;
In other cases (Rep Match 0/1/2/3), it decodes the zero-based
length of match with "RepLenDecoder" decoder:
len = RepLenDecoder.Decode(&RangeDec, posState);
Then it updates the state:
state = UpdateState_Rep(state);
Then the decoder must copy match bytes as described in
"The Match symbols copying" section.
The match symbols copying
-------------------------
If we have the match (Simple Match or Rep Match 0/1/2/3), the decoder must
copy the sequence of bytes with calculated match distance and match length.
If uncompressed size is defined, LZMA decoder must check that it doesn't
exceed that specified uncompressed size:
len += kMatchMinLen;
bool isError = false;
if (unpackSizeDefined && unpackSize < len)
{
len = (unsigned)unpackSize;
isError = true;
}
OutWindow.CopyMatch(rep0 + 1, len);
unpackSize -= len;
if (isError)
return LZMA_RES_ERROR;
Then the decoder must go to the begin of main loop to decode next MATCH or LITERAL.
NOTES
-----
This specification doesn't describe the variant of decoder implementation
that supports partial decoding. Such partial decoding case can require some
changes in "end of stream" condition checks code. Also such code
can use additional status codes, returned by decoder.
This specification uses C++ code with templates to simplify describing.
The optimized version of LZMA decoder doesn't need templates.
Such optimized version can use just two arrays of CProb variables:
1) The dynamic array of CProb variables allocated for the Literal Decoder.
2) The one common array that contains all other CProb variables.
References:
1. G. N. N. Martin, Range encoding: an algorithm for removing redundancy
from a digitized message, Video & Data Recording Conference,
Southampton, UK, July 24-27, 1979.
|