648db22b |
1 | Zstandard Compression Format |
2 | ============================ |
3 | |
4 | ### Notices |
5 | |
6 | Copyright (c) Meta Platforms, Inc. and affiliates. |
7 | |
8 | Permission is granted to copy and distribute this document |
9 | for any purpose and without charge, |
10 | including translations into other languages |
11 | and incorporation into compilations, |
12 | provided that the copyright notice and this notice are preserved, |
13 | and that any substantive changes or deletions from the original |
14 | are clearly marked. |
15 | Distribution of this document is unlimited. |
16 | |
17 | ### Version |
18 | |
19 | 0.3.9 (2023-03-08) |
20 | |
21 | |
22 | Introduction |
23 | ------------ |
24 | |
25 | The purpose of this document is to define a lossless compressed data format, |
26 | that is independent of CPU type, operating system, |
27 | file system and character set, suitable for |
28 | file compression, pipe and streaming compression, |
29 | using the [Zstandard algorithm](https://facebook.github.io/zstd/). |
30 | The text of the specification assumes a basic background in programming |
31 | at the level of bits and other primitive data representations. |
32 | |
33 | The data can be produced or consumed, |
34 | even for an arbitrarily long sequentially presented input data stream, |
35 | using only an a priori bounded amount of intermediate storage, |
36 | and hence can be used in data communications. |
37 | The format uses the Zstandard compression method, |
38 | and optional [xxHash-64 checksum method](https://cyan4973.github.io/xxHash/), |
39 | for detection of data corruption. |
40 | |
41 | The data format defined by this specification |
42 | does not attempt to allow random access to compressed data. |
43 | |
44 | Unless otherwise indicated below, |
45 | a compliant compressor must produce data sets |
46 | that conform to the specifications presented here. |
47 | It doesn’t need to support all options though. |
48 | |
49 | A compliant decompressor must be able to decompress |
50 | at least one working set of parameters |
51 | that conforms to the specifications presented here. |
52 | It may also ignore informative fields, such as checksum. |
53 | Whenever it does not support a parameter defined in the compressed stream, |
54 | it must produce a non-ambiguous error code and associated error message |
55 | explaining which parameter is unsupported. |
56 | |
57 | This specification is intended for use by implementers of software |
58 | to compress data into Zstandard format and/or decompress data from Zstandard format. |
59 | The Zstandard format is supported by an open source reference implementation, |
60 | written in portable C, and available at : https://github.com/facebook/zstd . |
61 | |
62 | |
63 | ### Overall conventions |
64 | In this document: |
65 | - square brackets i.e. `[` and `]` are used to indicate optional fields or parameters. |
66 | - the naming convention for identifiers is `Mixed_Case_With_Underscores` |
67 | |
68 | ### Definitions |
69 | Content compressed by Zstandard is transformed into a Zstandard __frame__. |
70 | Multiple frames can be appended into a single file or stream. |
71 | A frame is completely independent, has a defined beginning and end, |
72 | and a set of parameters which tells the decoder how to decompress it. |
73 | |
74 | A frame encapsulates one or multiple __blocks__. |
75 | Each block contains arbitrary content, which is described by its header, |
76 | and has a guaranteed maximum content size, which depends on frame parameters. |
77 | Unlike frames, each block depends on previous blocks for proper decoding. |
78 | However, each block can be decompressed without waiting for its successor, |
79 | allowing streaming operations. |
80 | |
81 | Overview |
82 | --------- |
83 | - [Frames](#frames) |
84 | - [Zstandard frames](#zstandard-frames) |
85 | - [Blocks](#blocks) |
86 | - [Literals Section](#literals-section) |
87 | - [Sequences Section](#sequences-section) |
88 | - [Sequence Execution](#sequence-execution) |
89 | - [Skippable frames](#skippable-frames) |
90 | - [Entropy Encoding](#entropy-encoding) |
91 | - [FSE](#fse) |
92 | - [Huffman Coding](#huffman-coding) |
93 | - [Dictionary Format](#dictionary-format) |
94 | |
95 | Frames |
96 | ------ |
97 | Zstandard compressed data is made of one or more __frames__. |
98 | Each frame is independent and can be decompressed independently of other frames. |
99 | The decompressed content of multiple concatenated frames is the concatenation of |
100 | each frame decompressed content. |
101 | |
102 | There are two frame formats defined by Zstandard: |
103 | Zstandard frames and Skippable frames. |
104 | Zstandard frames contain compressed data, while |
105 | skippable frames contain custom user metadata. |
106 | |
107 | ## Zstandard frames |
108 | The structure of a single Zstandard frame is following: |
109 | |
110 | | `Magic_Number` | `Frame_Header` |`Data_Block`| [More data blocks] | [`Content_Checksum`] | |
111 | |:--------------:|:--------------:|:----------:| ------------------ |:--------------------:| |
112 | | 4 bytes | 2-14 bytes | n bytes | | 0-4 bytes | |
113 | |
114 | __`Magic_Number`__ |
115 | |
116 | 4 Bytes, __little-endian__ format. |
117 | Value : 0xFD2FB528 |
118 | Note: This value was selected to be less probable to find at the beginning of some random file. |
119 | It avoids trivial patterns (0x00, 0xFF, repeated bytes, increasing bytes, etc.), |
120 | contains byte values outside of ASCII range, |
121 | and doesn't map into UTF8 space. |
122 | It reduces the chances that a text file represent this value by accident. |
123 | |
124 | __`Frame_Header`__ |
125 | |
126 | 2 to 14 Bytes, detailed in [`Frame_Header`](#frame_header). |
127 | |
128 | __`Data_Block`__ |
129 | |
130 | Detailed in [`Blocks`](#blocks). |
131 | That’s where compressed data is stored. |
132 | |
133 | __`Content_Checksum`__ |
134 | |
135 | An optional 32-bit checksum, only present if `Content_Checksum_flag` is set. |
136 | The content checksum is the result |
137 | of [xxh64() hash function](https://cyan4973.github.io/xxHash/) |
138 | digesting the original (decoded) data as input, and a seed of zero. |
139 | The low 4 bytes of the checksum are stored in __little-endian__ format. |
140 | |
141 | ### `Frame_Header` |
142 | |
143 | The `Frame_Header` has a variable size, with a minimum of 2 bytes, |
144 | and up to 14 bytes depending on optional parameters. |
145 | The structure of `Frame_Header` is following: |
146 | |
147 | | `Frame_Header_Descriptor` | [`Window_Descriptor`] | [`Dictionary_ID`] | [`Frame_Content_Size`] | |
148 | | ------------------------- | --------------------- | ----------------- | ---------------------- | |
149 | | 1 byte | 0-1 byte | 0-4 bytes | 0-8 bytes | |
150 | |
151 | #### `Frame_Header_Descriptor` |
152 | |
153 | The first header's byte is called the `Frame_Header_Descriptor`. |
154 | It describes which other fields are present. |
155 | Decoding this byte is enough to tell the size of `Frame_Header`. |
156 | |
157 | | Bit number | Field name | |
158 | | ---------- | ---------- | |
159 | | 7-6 | `Frame_Content_Size_flag` | |
160 | | 5 | `Single_Segment_flag` | |
161 | | 4 | `Unused_bit` | |
162 | | 3 | `Reserved_bit` | |
163 | | 2 | `Content_Checksum_flag` | |
164 | | 1-0 | `Dictionary_ID_flag` | |
165 | |
166 | In this table, bit 7 is the highest bit, while bit 0 is the lowest one. |
167 | |
168 | __`Frame_Content_Size_flag`__ |
169 | |
170 | This is a 2-bits flag (`= Frame_Header_Descriptor >> 6`), |
171 | specifying if `Frame_Content_Size` (the decompressed data size) |
172 | is provided within the header. |
173 | `Flag_Value` provides `FCS_Field_Size`, |
174 | which is the number of bytes used by `Frame_Content_Size` |
175 | according to the following table: |
176 | |
177 | | `Flag_Value` | 0 | 1 | 2 | 3 | |
178 | | -------------- | ------ | --- | --- | --- | |
179 | |`FCS_Field_Size`| 0 or 1 | 2 | 4 | 8 | |
180 | |
181 | When `Flag_Value` is `0`, `FCS_Field_Size` depends on `Single_Segment_flag` : |
182 | if `Single_Segment_flag` is set, `FCS_Field_Size` is 1. |
183 | Otherwise, `FCS_Field_Size` is 0 : `Frame_Content_Size` is not provided. |
184 | |
185 | __`Single_Segment_flag`__ |
186 | |
187 | If this flag is set, |
188 | data must be regenerated within a single continuous memory segment. |
189 | |
190 | In this case, `Window_Descriptor` byte is skipped, |
191 | but `Frame_Content_Size` is necessarily present. |
192 | As a consequence, the decoder must allocate a memory segment |
193 | of size equal or larger than `Frame_Content_Size`. |
194 | |
195 | In order to preserve the decoder from unreasonable memory requirements, |
196 | a decoder is allowed to reject a compressed frame |
197 | which requests a memory size beyond decoder's authorized range. |
198 | |
199 | For broader compatibility, decoders are recommended to support |
200 | memory sizes of at least 8 MB. |
201 | This is only a recommendation, |
202 | each decoder is free to support higher or lower limits, |
203 | depending on local limitations. |
204 | |
205 | __`Unused_bit`__ |
206 | |
207 | A decoder compliant with this specification version shall not interpret this bit. |
208 | It might be used in any future version, |
209 | to signal a property which is transparent to properly decode the frame. |
210 | An encoder compliant with this specification version must set this bit to zero. |
211 | |
212 | __`Reserved_bit`__ |
213 | |
214 | This bit is reserved for some future feature. |
215 | Its value _must be zero_. |
216 | A decoder compliant with this specification version must ensure it is not set. |
217 | This bit may be used in a future revision, |
218 | to signal a feature that must be interpreted to decode the frame correctly. |
219 | |
220 | __`Content_Checksum_flag`__ |
221 | |
222 | If this flag is set, a 32-bits `Content_Checksum` will be present at frame's end. |
223 | See `Content_Checksum` paragraph. |
224 | |
225 | __`Dictionary_ID_flag`__ |
226 | |
227 | This is a 2-bits flag (`= FHD & 3`), |
228 | telling if a dictionary ID is provided within the header. |
229 | It also specifies the size of this field as `DID_Field_Size`. |
230 | |
231 | |`Flag_Value` | 0 | 1 | 2 | 3 | |
232 | | -------------- | --- | --- | --- | --- | |
233 | |`DID_Field_Size`| 0 | 1 | 2 | 4 | |
234 | |
235 | #### `Window_Descriptor` |
236 | |
237 | Provides guarantees on minimum memory buffer required to decompress a frame. |
238 | This information is important for decoders to allocate enough memory. |
239 | |
240 | The `Window_Descriptor` byte is optional. |
241 | When `Single_Segment_flag` is set, `Window_Descriptor` is not present. |
242 | In this case, `Window_Size` is `Frame_Content_Size`, |
243 | which can be any value from 0 to 2^64-1 bytes (16 ExaBytes). |
244 | |
245 | | Bit numbers | 7-3 | 2-0 | |
246 | | ----------- | ---------- | ---------- | |
247 | | Field name | `Exponent` | `Mantissa` | |
248 | |
249 | The minimum memory buffer size is called `Window_Size`. |
250 | It is described by the following formulas : |
251 | ``` |
252 | windowLog = 10 + Exponent; |
253 | windowBase = 1 << windowLog; |
254 | windowAdd = (windowBase / 8) * Mantissa; |
255 | Window_Size = windowBase + windowAdd; |
256 | ``` |
257 | The minimum `Window_Size` is 1 KB. |
258 | The maximum `Window_Size` is `(1<<41) + 7*(1<<38)` bytes, which is 3.75 TB. |
259 | |
260 | In general, larger `Window_Size` tend to improve compression ratio, |
261 | but at the cost of memory usage. |
262 | |
263 | To properly decode compressed data, |
264 | a decoder will need to allocate a buffer of at least `Window_Size` bytes. |
265 | |
266 | In order to preserve decoder from unreasonable memory requirements, |
267 | a decoder is allowed to reject a compressed frame |
268 | which requests a memory size beyond decoder's authorized range. |
269 | |
270 | For improved interoperability, |
271 | it's recommended for decoders to support `Window_Size` of up to 8 MB, |
272 | and it's recommended for encoders to not generate frame requiring `Window_Size` larger than 8 MB. |
273 | It's merely a recommendation though, |
274 | decoders are free to support larger or lower limits, |
275 | depending on local limitations. |
276 | |
277 | #### `Dictionary_ID` |
278 | |
279 | This is a variable size field, which contains |
280 | the ID of the dictionary required to properly decode the frame. |
281 | `Dictionary_ID` field is optional. When it's not present, |
282 | it's up to the decoder to know which dictionary to use. |
283 | |
284 | `Dictionary_ID` field size is provided by `DID_Field_Size`. |
285 | `DID_Field_Size` is directly derived from value of `Dictionary_ID_flag`. |
286 | 1 byte can represent an ID 0-255. |
287 | 2 bytes can represent an ID 0-65535. |
288 | 4 bytes can represent an ID 0-4294967295. |
289 | Format is __little-endian__. |
290 | |
291 | It's allowed to represent a small ID (for example `13`) |
292 | with a large 4-bytes dictionary ID, even if it is less efficient. |
293 | |
294 | A value of `0` has same meaning as no `Dictionary_ID`, |
295 | in which case the frame may or may not need a dictionary to be decoded, |
296 | and the ID of such a dictionary is not specified. |
297 | The decoder must know this information by other means. |
298 | |
299 | #### `Frame_Content_Size` |
300 | |
301 | This is the original (uncompressed) size. This information is optional. |
302 | `Frame_Content_Size` uses a variable number of bytes, provided by `FCS_Field_Size`. |
303 | `FCS_Field_Size` is provided by the value of `Frame_Content_Size_flag`. |
304 | `FCS_Field_Size` can be equal to 0 (not present), 1, 2, 4 or 8 bytes. |
305 | |
306 | | `FCS_Field_Size` | Range | |
307 | | ---------------- | ---------- | |
308 | | 0 | unknown | |
309 | | 1 | 0 - 255 | |
310 | | 2 | 256 - 65791| |
311 | | 4 | 0 - 2^32-1 | |
312 | | 8 | 0 - 2^64-1 | |
313 | |
314 | `Frame_Content_Size` format is __little-endian__. |
315 | When `FCS_Field_Size` is 1, 4 or 8 bytes, the value is read directly. |
316 | When `FCS_Field_Size` is 2, _the offset of 256 is added_. |
317 | It's allowed to represent a small size (for example `18`) using any compatible variant. |
318 | |
319 | |
320 | Blocks |
321 | ------- |
322 | |
323 | After `Magic_Number` and `Frame_Header`, there are some number of blocks. |
324 | Each frame must have at least one block, |
325 | but there is no upper limit on the number of blocks per frame. |
326 | |
327 | The structure of a block is as follows: |
328 | |
329 | | `Block_Header` | `Block_Content` | |
330 | |:--------------:|:---------------:| |
331 | | 3 bytes | n bytes | |
332 | |
333 | __`Block_Header`__ |
334 | |
335 | `Block_Header` uses 3 bytes, written using __little-endian__ convention. |
336 | It contains 3 fields : |
337 | |
338 | | `Last_Block` | `Block_Type` | `Block_Size` | |
339 | |:------------:|:------------:|:------------:| |
340 | | bit 0 | bits 1-2 | bits 3-23 | |
341 | |
342 | __`Last_Block`__ |
343 | |
344 | The lowest bit signals if this block is the last one. |
345 | The frame will end after this last block. |
346 | It may be followed by an optional `Content_Checksum` |
347 | (see [Zstandard Frames](#zstandard-frames)). |
348 | |
349 | __`Block_Type`__ |
350 | |
351 | The next 2 bits represent the `Block_Type`. |
352 | `Block_Type` influences the meaning of `Block_Size`. |
353 | There are 4 block types : |
354 | |
355 | | Value | 0 | 1 | 2 | 3 | |
356 | | ------------ | ----------- | ----------- | ------------------ | --------- | |
357 | | `Block_Type` | `Raw_Block` | `RLE_Block` | `Compressed_Block` | `Reserved`| |
358 | |
359 | - `Raw_Block` - this is an uncompressed block. |
360 | `Block_Content` contains `Block_Size` bytes. |
361 | |
362 | - `RLE_Block` - this is a single byte, repeated `Block_Size` times. |
363 | `Block_Content` consists of a single byte. |
364 | On the decompression side, this byte must be repeated `Block_Size` times. |
365 | |
366 | - `Compressed_Block` - this is a [Zstandard compressed block](#compressed-blocks), |
367 | explained later on. |
368 | `Block_Size` is the length of `Block_Content`, the compressed data. |
369 | The decompressed size is not known, |
370 | but its maximum possible value is guaranteed (see below) |
371 | |
372 | - `Reserved` - this is not a block. |
373 | This value cannot be used with current version of this specification. |
374 | If such a value is present, it is considered corrupted data. |
375 | |
376 | __`Block_Size`__ |
377 | |
378 | The upper 21 bits of `Block_Header` represent the `Block_Size`. |
379 | |
380 | When `Block_Type` is `Compressed_Block` or `Raw_Block`, |
381 | `Block_Size` is the size of `Block_Content` (hence excluding `Block_Header`). |
382 | |
383 | When `Block_Type` is `RLE_Block`, since `Block_Content`’s size is always 1, |
384 | `Block_Size` represents the number of times this byte must be repeated. |
385 | |
386 | `Block_Size` is limited by `Block_Maximum_Size` (see below). |
387 | |
388 | __`Block_Content`__ and __`Block_Maximum_Size`__ |
389 | |
390 | The size of `Block_Content` is limited by `Block_Maximum_Size`, |
391 | which is the smallest of: |
392 | - `Window_Size` |
393 | - 128 KB |
394 | |
395 | `Block_Maximum_Size` is constant for a given frame. |
396 | This maximum is applicable to both the decompressed size |
397 | and the compressed size of any block in the frame. |
398 | |
399 | The reasoning for this limit is that a decoder can read this information |
400 | at the beginning of a frame and use it to allocate buffers. |
401 | The guarantees on the size of blocks ensure that |
402 | the buffers will be large enough for any following block of the valid frame. |
403 | |
404 | |
405 | Compressed Blocks |
406 | ----------------- |
407 | To decompress a compressed block, the compressed size must be provided |
408 | from `Block_Size` field within `Block_Header`. |
409 | |
410 | A compressed block consists of 2 sections : |
411 | - [Literals Section](#literals-section) |
412 | - [Sequences Section](#sequences-section) |
413 | |
414 | The results of the two sections are then combined to produce the decompressed |
415 | data in [Sequence Execution](#sequence-execution) |
416 | |
417 | #### Prerequisites |
418 | To decode a compressed block, the following elements are necessary : |
419 | - Previous decoded data, up to a distance of `Window_Size`, |
420 | or beginning of the Frame, whichever is smaller. |
421 | - List of "recent offsets" from previous `Compressed_Block`. |
422 | - The previous Huffman tree, required by `Treeless_Literals_Block` type |
423 | - Previous FSE decoding tables, required by `Repeat_Mode` |
424 | for each symbol type (literals lengths, match lengths, offsets) |
425 | |
426 | Note that decoding tables aren't always from the previous `Compressed_Block`. |
427 | |
428 | - Every decoding table can come from a dictionary. |
429 | - The Huffman tree comes from the previous `Compressed_Literals_Block`. |
430 | |
431 | Literals Section |
432 | ---------------- |
433 | All literals are regrouped in the first part of the block. |
434 | They can be decoded first, and then copied during [Sequence Execution], |
435 | or they can be decoded on the flow during [Sequence Execution]. |
436 | |
437 | Literals can be stored uncompressed or compressed using Huffman prefix codes. |
438 | When compressed, a tree description may optionally be present, |
439 | followed by 1 or 4 streams. |
440 | |
441 | | `Literals_Section_Header` | [`Huffman_Tree_Description`] | [jumpTable] | Stream1 | [Stream2] | [Stream3] | [Stream4] | |
442 | | ------------------------- | ---------------------------- | ----------- | ------- | --------- | --------- | --------- | |
443 | |
444 | |
445 | ### `Literals_Section_Header` |
446 | |
447 | Header is in charge of describing how literals are packed. |
448 | It's a byte-aligned variable-size bitfield, ranging from 1 to 5 bytes, |
449 | using __little-endian__ convention. |
450 | |
451 | | `Literals_Block_Type` | `Size_Format` | `Regenerated_Size` | [`Compressed_Size`] | |
452 | | --------------------- | ------------- | ------------------ | ------------------- | |
453 | | 2 bits | 1 - 2 bits | 5 - 20 bits | 0 - 18 bits | |
454 | |
455 | In this representation, bits on the left are the lowest bits. |
456 | |
457 | __`Literals_Block_Type`__ |
458 | |
459 | This field uses 2 lowest bits of first byte, describing 4 different block types : |
460 | |
461 | | `Literals_Block_Type` | Value | |
462 | | --------------------------- | ----- | |
463 | | `Raw_Literals_Block` | 0 | |
464 | | `RLE_Literals_Block` | 1 | |
465 | | `Compressed_Literals_Block` | 2 | |
466 | | `Treeless_Literals_Block` | 3 | |
467 | |
468 | - `Raw_Literals_Block` - Literals are stored uncompressed. |
469 | - `RLE_Literals_Block` - Literals consist of a single byte value |
470 | repeated `Regenerated_Size` times. |
471 | - `Compressed_Literals_Block` - This is a standard Huffman-compressed block, |
472 | starting with a Huffman tree description. |
473 | In this mode, there are at least 2 different literals represented in the Huffman tree description. |
474 | See details below. |
475 | - `Treeless_Literals_Block` - This is a Huffman-compressed block, |
476 | using Huffman tree _from previous Huffman-compressed literals block_. |
477 | `Huffman_Tree_Description` will be skipped. |
478 | Note: If this mode is triggered without any previous Huffman-table in the frame |
479 | (or [dictionary](#dictionary-format)), this should be treated as data corruption. |
480 | |
481 | __`Size_Format`__ |
482 | |
483 | `Size_Format` is divided into 2 families : |
484 | |
485 | - For `Raw_Literals_Block` and `RLE_Literals_Block`, |
486 | it's only necessary to decode `Regenerated_Size`. |
487 | There is no `Compressed_Size` field. |
488 | - For `Compressed_Block` and `Treeless_Literals_Block`, |
489 | it's required to decode both `Compressed_Size` |
490 | and `Regenerated_Size` (the decompressed size). |
491 | It's also necessary to decode the number of streams (1 or 4). |
492 | |
493 | For values spanning several bytes, convention is __little-endian__. |
494 | |
495 | __`Size_Format` for `Raw_Literals_Block` and `RLE_Literals_Block`__ : |
496 | |
497 | `Size_Format` uses 1 _or_ 2 bits. |
498 | Its value is : `Size_Format = (Literals_Section_Header[0]>>2) & 3` |
499 | |
500 | - `Size_Format` == 00 or 10 : `Size_Format` uses 1 bit. |
501 | `Regenerated_Size` uses 5 bits (0-31). |
502 | `Literals_Section_Header` uses 1 byte. |
503 | `Regenerated_Size = Literals_Section_Header[0]>>3` |
504 | - `Size_Format` == 01 : `Size_Format` uses 2 bits. |
505 | `Regenerated_Size` uses 12 bits (0-4095). |
506 | `Literals_Section_Header` uses 2 bytes. |
507 | `Regenerated_Size = (Literals_Section_Header[0]>>4) + (Literals_Section_Header[1]<<4)` |
508 | - `Size_Format` == 11 : `Size_Format` uses 2 bits. |
509 | `Regenerated_Size` uses 20 bits (0-1048575). |
510 | `Literals_Section_Header` uses 3 bytes. |
511 | `Regenerated_Size = (Literals_Section_Header[0]>>4) + (Literals_Section_Header[1]<<4) + (Literals_Section_Header[2]<<12)` |
512 | |
513 | Only Stream1 is present for these cases. |
514 | Note : it's allowed to represent a short value (for example `27`) |
515 | using a long format, even if it's less efficient. |
516 | |
517 | __`Size_Format` for `Compressed_Literals_Block` and `Treeless_Literals_Block`__ : |
518 | |
519 | `Size_Format` always uses 2 bits. |
520 | |
521 | - `Size_Format` == 00 : _A single stream_. |
522 | Both `Regenerated_Size` and `Compressed_Size` use 10 bits (0-1023). |
523 | `Literals_Section_Header` uses 3 bytes. |
524 | - `Size_Format` == 01 : 4 streams. |
525 | Both `Regenerated_Size` and `Compressed_Size` use 10 bits (6-1023). |
526 | `Literals_Section_Header` uses 3 bytes. |
527 | - `Size_Format` == 10 : 4 streams. |
528 | Both `Regenerated_Size` and `Compressed_Size` use 14 bits (6-16383). |
529 | `Literals_Section_Header` uses 4 bytes. |
530 | - `Size_Format` == 11 : 4 streams. |
531 | Both `Regenerated_Size` and `Compressed_Size` use 18 bits (6-262143). |
532 | `Literals_Section_Header` uses 5 bytes. |
533 | |
534 | Both `Compressed_Size` and `Regenerated_Size` fields follow __little-endian__ convention. |
535 | Note: `Compressed_Size` __includes__ the size of the Huffman Tree description |
536 | _when_ it is present. |
537 | Note 2: `Compressed_Size` can never be `==0`. |
538 | Even in single-stream scenario, assuming an empty content, it must be `>=1`, |
539 | since it contains at least the final end bit flag. |
540 | In 4-streams scenario, a valid `Compressed_Size` is necessarily `>= 10` |
541 | (6 bytes for the jump table, + 4x1 bytes for the 4 streams). |
542 | |
543 | 4 streams is faster than 1 stream in decompression speed, |
544 | by exploiting instruction level parallelism. |
545 | But it's also more expensive, |
546 | costing on average ~7.3 bytes more than the 1 stream mode, mostly from the jump table. |
547 | |
548 | In general, use the 4 streams mode when there are more literals to decode, |
549 | to favor higher decompression speeds. |
550 | Note that beyond >1KB of literals, the 4 streams mode is compulsory. |
551 | |
552 | Note that a minimum of 6 bytes is required for the 4 streams mode. |
553 | That's a technical minimum, but it's not recommended to employ the 4 streams mode |
554 | for such a small quantity, that would be wasteful. |
555 | A more practical lower bound would be around ~256 bytes. |
556 | |
557 | #### Raw Literals Block |
558 | The data in Stream1 is `Regenerated_Size` bytes long, |
559 | it contains the raw literals data to be used during [Sequence Execution]. |
560 | |
561 | #### RLE Literals Block |
562 | Stream1 consists of a single byte which should be repeated `Regenerated_Size` times |
563 | to generate the decoded literals. |
564 | |
565 | #### Compressed Literals Block and Treeless Literals Block |
566 | Both of these modes contain Huffman encoded data. |
567 | |
568 | For `Treeless_Literals_Block`, |
569 | the Huffman table comes from previously compressed literals block, |
570 | or from a dictionary. |
571 | |
572 | |
573 | ### `Huffman_Tree_Description` |
574 | This section is only present when `Literals_Block_Type` type is `Compressed_Literals_Block` (`2`). |
575 | The tree describes the weights of all literals symbols that can be present in the literals block, at least 2 and up to 256. |
576 | The format of the Huffman tree description can be found at [Huffman Tree description](#huffman-tree-description). |
577 | The size of `Huffman_Tree_Description` is determined during decoding process, |
578 | it must be used to determine where streams begin. |
579 | `Total_Streams_Size = Compressed_Size - Huffman_Tree_Description_Size`. |
580 | |
581 | |
582 | ### Jump Table |
583 | The Jump Table is only present when there are 4 Huffman-coded streams. |
584 | |
585 | Reminder : Huffman compressed data consists of either 1 or 4 streams. |
586 | |
587 | If only one stream is present, it is a single bitstream occupying the entire |
588 | remaining portion of the literals block, encoded as described in |
589 | [Huffman-Coded Streams](#huffman-coded-streams). |
590 | |
591 | If there are four streams, `Literals_Section_Header` only provided |
592 | enough information to know the decompressed and compressed sizes |
593 | of all four streams _combined_. |
594 | The decompressed size of _each_ stream is equal to `(Regenerated_Size+3)/4`, |
595 | except for the last stream which may be up to 3 bytes smaller, |
596 | to reach a total decompressed size as specified in `Regenerated_Size`. |
597 | |
598 | The compressed size of each stream is provided explicitly in the Jump Table. |
599 | Jump Table is 6 bytes long, and consists of three 2-byte __little-endian__ fields, |
600 | describing the compressed sizes of the first three streams. |
601 | `Stream4_Size` is computed from `Total_Streams_Size` minus sizes of other streams: |
602 | |
603 | `Stream4_Size = Total_Streams_Size - 6 - Stream1_Size - Stream2_Size - Stream3_Size`. |
604 | |
605 | `Stream4_Size` is necessarily `>= 1`. Therefore, |
606 | if `Total_Streams_Size < Stream1_Size + Stream2_Size + Stream3_Size + 6 + 1`, |
607 | data is considered corrupted. |
608 | |
609 | Each of these 4 bitstreams is then decoded independently as a Huffman-Coded stream, |
610 | as described in [Huffman-Coded Streams](#huffman-coded-streams) |
611 | |
612 | |
613 | Sequences Section |
614 | ----------------- |
615 | A compressed block is a succession of _sequences_ . |
616 | A sequence is a literal copy command, followed by a match copy command. |
617 | A literal copy command specifies a length. |
618 | It is the number of bytes to be copied (or extracted) from the Literals Section. |
619 | A match copy command specifies an offset and a length. |
620 | |
621 | When all _sequences_ are decoded, |
622 | if there are literals left in the _literals section_, |
623 | these bytes are added at the end of the block. |
624 | |
625 | This is described in more detail in [Sequence Execution](#sequence-execution). |
626 | |
627 | The `Sequences_Section` regroup all symbols required to decode commands. |
628 | There are 3 symbol types : literals lengths, offsets and match lengths. |
629 | They are encoded together, interleaved, in a single _bitstream_. |
630 | |
631 | The `Sequences_Section` starts by a header, |
632 | followed by optional probability tables for each symbol type, |
633 | followed by the bitstream. |
634 | |
635 | | `Sequences_Section_Header` | [`Literals_Length_Table`] | [`Offset_Table`] | [`Match_Length_Table`] | bitStream | |
636 | | -------------------------- | ------------------------- | ---------------- | ---------------------- | --------- | |
637 | |
638 | To decode the `Sequences_Section`, it's required to know its size. |
639 | Its size is deduced from the size of `Literals_Section`: |
640 | `Sequences_Section_Size = Block_Size - Literals_Section_Size`. |
641 | |
642 | |
643 | #### `Sequences_Section_Header` |
644 | |
645 | Consists of 2 items: |
646 | - `Number_of_Sequences` |
647 | - Symbol compression modes |
648 | |
649 | __`Number_of_Sequences`__ |
650 | |
651 | This is a variable size field using between 1 and 3 bytes. |
652 | Let's call its first byte `byte0`. |
653 | - `if (byte0 == 0)` : there are no sequences. |
654 | The sequence section stops there. |
655 | Decompressed content is defined entirely as Literals Section content. |
656 | The FSE tables used in `Repeat_Mode` aren't updated. |
657 | - `if (byte0 < 128)` : `Number_of_Sequences = byte0` . Uses 1 byte. |
658 | - `if (byte0 < 255)` : `Number_of_Sequences = ((byte0-128) << 8) + byte1` . Uses 2 bytes. |
659 | - `if (byte0 == 255)`: `Number_of_Sequences = byte1 + (byte2<<8) + 0x7F00` . Uses 3 bytes. |
660 | |
661 | __Symbol compression modes__ |
662 | |
663 | This is a single byte, defining the compression mode of each symbol type. |
664 | |
665 | |Bit number| 7-6 | 5-4 | 3-2 | 1-0 | |
666 | | -------- | ----------------------- | -------------- | -------------------- | ---------- | |
667 | |Field name| `Literals_Lengths_Mode` | `Offsets_Mode` | `Match_Lengths_Mode` | `Reserved` | |
668 | |
669 | The last field, `Reserved`, must be all-zeroes. |
670 | |
671 | `Literals_Lengths_Mode`, `Offsets_Mode` and `Match_Lengths_Mode` define the `Compression_Mode` of |
672 | literals lengths, offsets, and match lengths symbols respectively. |
673 | |
674 | They follow the same enumeration : |
675 | |
676 | | Value | 0 | 1 | 2 | 3 | |
677 | | ------------------ | ----------------- | ---------- | --------------------- | ------------- | |
678 | | `Compression_Mode` | `Predefined_Mode` | `RLE_Mode` | `FSE_Compressed_Mode` | `Repeat_Mode` | |
679 | |
680 | - `Predefined_Mode` : A predefined FSE distribution table is used, defined in |
681 | [default distributions](#default-distributions). |
682 | No distribution table will be present. |
683 | - `RLE_Mode` : The table description consists of a single byte, which contains the symbol's value. |
684 | This symbol will be used for all sequences. |
685 | - `FSE_Compressed_Mode` : standard FSE compression. |
686 | A distribution table will be present. |
687 | The format of this distribution table is described in [FSE Table Description](#fse-table-description). |
688 | Note that the maximum allowed accuracy log for literals length and match length tables is 9, |
689 | and the maximum accuracy log for the offsets table is 8. |
690 | `FSE_Compressed_Mode` must not be used when only one symbol is present, |
691 | `RLE_Mode` should be used instead (although any other mode will work). |
692 | - `Repeat_Mode` : The table used in the previous `Compressed_Block` with `Number_of_Sequences > 0` will be used again, |
693 | or if this is the first block, table in the dictionary will be used. |
694 | Note that this includes `RLE_mode`, so if `Repeat_Mode` follows `RLE_Mode`, the same symbol will be repeated. |
695 | It also includes `Predefined_Mode`, in which case `Repeat_Mode` will have same outcome as `Predefined_Mode`. |
696 | No distribution table will be present. |
697 | If this mode is used without any previous sequence table in the frame |
698 | (nor [dictionary](#dictionary-format)) to repeat, this should be treated as corruption. |
699 | |
700 | #### The codes for literals lengths, match lengths, and offsets. |
701 | |
702 | Each symbol is a _code_ in its own context, |
703 | which specifies `Baseline` and `Number_of_Bits` to add. |
704 | _Codes_ are FSE compressed, |
705 | and interleaved with raw additional bits in the same bitstream. |
706 | |
707 | ##### Literals length codes |
708 | |
709 | Literals length codes are values ranging from `0` to `35` included. |
710 | They define lengths from 0 to 131071 bytes. |
711 | The literals length is equal to the decoded `Baseline` plus |
712 | the result of reading `Number_of_Bits` bits from the bitstream, |
713 | as a __little-endian__ value. |
714 | |
715 | | `Literals_Length_Code` | 0-15 | |
716 | | ---------------------- | ---------------------- | |
717 | | length | `Literals_Length_Code` | |
718 | | `Number_of_Bits` | 0 | |
719 | |
720 | | `Literals_Length_Code` | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | |
721 | | ---------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | |
722 | | `Baseline` | 16 | 18 | 20 | 22 | 24 | 28 | 32 | 40 | |
723 | | `Number_of_Bits` | 1 | 1 | 1 | 1 | 2 | 2 | 3 | 3 | |
724 | |
725 | | `Literals_Length_Code` | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | |
726 | | ---------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | |
727 | | `Baseline` | 48 | 64 | 128 | 256 | 512 | 1024 | 2048 | 4096 | |
728 | | `Number_of_Bits` | 4 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | |
729 | |
730 | | `Literals_Length_Code` | 32 | 33 | 34 | 35 | |
731 | | ---------------------- | ---- | ---- | ---- | ---- | |
732 | | `Baseline` | 8192 |16384 |32768 |65536 | |
733 | | `Number_of_Bits` | 13 | 14 | 15 | 16 | |
734 | |
735 | |
736 | ##### Match length codes |
737 | |
738 | Match length codes are values ranging from `0` to `52` included. |
739 | They define lengths from 3 to 131074 bytes. |
740 | The match length is equal to the decoded `Baseline` plus |
741 | the result of reading `Number_of_Bits` bits from the bitstream, |
742 | as a __little-endian__ value. |
743 | |
744 | | `Match_Length_Code` | 0-31 | |
745 | | ------------------- | ----------------------- | |
746 | | value | `Match_Length_Code` + 3 | |
747 | | `Number_of_Bits` | 0 | |
748 | |
749 | | `Match_Length_Code` | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | |
750 | | ------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | |
751 | | `Baseline` | 35 | 37 | 39 | 41 | 43 | 47 | 51 | 59 | |
752 | | `Number_of_Bits` | 1 | 1 | 1 | 1 | 2 | 2 | 3 | 3 | |
753 | |
754 | | `Match_Length_Code` | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | |
755 | | ------------------- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | ---- | |
756 | | `Baseline` | 67 | 83 | 99 | 131 | 259 | 515 | 1027 | 2051 | |
757 | | `Number_of_Bits` | 4 | 4 | 5 | 7 | 8 | 9 | 10 | 11 | |
758 | |
759 | | `Match_Length_Code` | 48 | 49 | 50 | 51 | 52 | |
760 | | ------------------- | ---- | ---- | ---- | ---- | ---- | |
761 | | `Baseline` | 4099 | 8195 |16387 |32771 |65539 | |
762 | | `Number_of_Bits` | 12 | 13 | 14 | 15 | 16 | |
763 | |
764 | ##### Offset codes |
765 | |
766 | Offset codes are values ranging from `0` to `N`. |
767 | |
768 | A decoder is free to limit its maximum `N` supported. |
769 | Recommendation is to support at least up to `22`. |
770 | For information, at the time of this writing. |
771 | the reference decoder supports a maximum `N` value of `31`. |
772 | |
773 | An offset code is also the number of additional bits to read in __little-endian__ fashion, |
774 | and can be translated into an `Offset_Value` using the following formulas : |
775 | |
776 | ``` |
777 | Offset_Value = (1 << offsetCode) + readNBits(offsetCode); |
778 | if (Offset_Value > 3) offset = Offset_Value - 3; |
779 | ``` |
780 | It means that maximum `Offset_Value` is `(2^(N+1))-1` |
781 | supporting back-reference distances up to `(2^(N+1))-4`, |
782 | but is limited by [maximum back-reference distance](#window_descriptor). |
783 | |
784 | `Offset_Value` from 1 to 3 are special : they define "repeat codes". |
785 | This is described in more detail in [Repeat Offsets](#repeat-offsets). |
786 | |
787 | #### Decoding Sequences |
788 | FSE bitstreams are read in reverse direction than written. In zstd, |
789 | the compressor writes bits forward into a block and the decompressor |
790 | must read the bitstream _backwards_. |
791 | |
792 | To find the start of the bitstream it is therefore necessary to |
793 | know the offset of the last byte of the block which can be found |
794 | by counting `Block_Size` bytes after the block header. |
795 | |
796 | After writing the last bit containing information, the compressor |
797 | writes a single `1`-bit and then fills the byte with 0-7 `0` bits of |
798 | padding. The last byte of the compressed bitstream cannot be `0` for |
799 | that reason. |
800 | |
801 | When decompressing, the last byte containing the padding is the first |
802 | byte to read. The decompressor needs to skip 0-7 initial `0`-bits and |
803 | the first `1`-bit it occurs. Afterwards, the useful part of the bitstream |
804 | begins. |
805 | |
806 | FSE decoding requires a 'state' to be carried from symbol to symbol. |
807 | For more explanation on FSE decoding, see the [FSE section](#fse). |
808 | |
809 | For sequence decoding, a separate state keeps track of each |
810 | literal lengths, offsets, and match lengths symbols. |
811 | Some FSE primitives are also used. |
812 | For more details on the operation of these primitives, see the [FSE section](#fse). |
813 | |
814 | ##### Starting states |
815 | The bitstream starts with initial FSE state values, |
816 | each using the required number of bits in their respective _accuracy_, |
817 | decoded previously from their normalized distribution. |
818 | |
819 | It starts by `Literals_Length_State`, |
820 | followed by `Offset_State`, |
821 | and finally `Match_Length_State`. |
822 | |
823 | Reminder : always keep in mind that all values are read _backward_, |
824 | so the 'start' of the bitstream is at the highest position in memory, |
825 | immediately before the last `1`-bit for padding. |
826 | |
827 | After decoding the starting states, a single sequence is decoded |
828 | `Number_Of_Sequences` times. |
829 | These sequences are decoded in order from first to last. |
830 | Since the compressor writes the bitstream in the forward direction, |
831 | this means the compressor must encode the sequences starting with the last |
832 | one and ending with the first. |
833 | |
834 | ##### Decoding a sequence |
835 | For each of the symbol types, the FSE state can be used to determine the appropriate code. |
836 | The code then defines the `Baseline` and `Number_of_Bits` to read for each type. |
837 | See the [description of the codes] for how to determine these values. |
838 | |
839 | [description of the codes]: #the-codes-for-literals-lengths-match-lengths-and-offsets |
840 | |
841 | Decoding starts by reading the `Number_of_Bits` required to decode `Offset`. |
842 | It then does the same for `Match_Length`, and then for `Literals_Length`. |
843 | This sequence is then used for [sequence execution](#sequence-execution). |
844 | |
845 | If it is not the last sequence in the block, |
846 | the next operation is to update states. |
847 | Using the rules pre-calculated in the decoding tables, |
848 | `Literals_Length_State` is updated, |
849 | followed by `Match_Length_State`, |
850 | and then `Offset_State`. |
851 | See the [FSE section](#fse) for details on how to update states from the bitstream. |
852 | |
853 | This operation will be repeated `Number_of_Sequences` times. |
854 | At the end, the bitstream shall be entirely consumed, |
855 | otherwise the bitstream is considered corrupted. |
856 | |
857 | #### Default Distributions |
858 | If `Predefined_Mode` is selected for a symbol type, |
859 | its FSE decoding table is generated from a predefined distribution table defined here. |
860 | For details on how to convert this distribution into a decoding table, see the [FSE section]. |
861 | |
862 | [FSE section]: #from-normalized-distribution-to-decoding-tables |
863 | |
864 | ##### Literals Length |
865 | The decoding table uses an accuracy log of 6 bits (64 states). |
866 | ``` |
867 | short literalsLength_defaultDistribution[36] = |
868 | { 4, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, |
869 | 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 1, 1, 1, 1, 1, |
870 | -1,-1,-1,-1 }; |
871 | ``` |
872 | |
873 | ##### Match Length |
874 | The decoding table uses an accuracy log of 6 bits (64 states). |
875 | ``` |
876 | short matchLengths_defaultDistribution[53] = |
877 | { 1, 4, 3, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, |
878 | 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, |
879 | 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,-1,-1, |
880 | -1,-1,-1,-1,-1 }; |
881 | ``` |
882 | |
883 | ##### Offset Codes |
884 | The decoding table uses an accuracy log of 5 bits (32 states), |
885 | and supports a maximum `N` value of 28, allowing offset values up to 536,870,908 . |
886 | |
887 | If any sequence in the compressed block requires a larger offset than this, |
888 | it's not possible to use the default distribution to represent it. |
889 | ``` |
890 | short offsetCodes_defaultDistribution[29] = |
891 | { 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, |
892 | 1, 1, 1, 1, 1, 1, 1, 1,-1,-1,-1,-1,-1 }; |
893 | ``` |
894 | |
895 | |
896 | Sequence Execution |
897 | ------------------ |
898 | Once literals and sequences have been decoded, |
899 | they are combined to produce the decoded content of a block. |
900 | |
901 | Each sequence consists of a tuple of (`literals_length`, `offset_value`, `match_length`), |
902 | decoded as described in the [Sequences Section](#sequences-section). |
903 | To execute a sequence, first copy `literals_length` bytes |
904 | from the decoded literals to the output. |
905 | |
906 | Then `match_length` bytes are copied from previous decoded data. |
907 | The offset to copy from is determined by `offset_value`: |
908 | if `offset_value > 3`, then the offset is `offset_value - 3`. |
909 | If `offset_value` is from 1-3, the offset is a special repeat offset value. |
910 | See the [repeat offset](#repeat-offsets) section for how the offset is determined |
911 | in this case. |
912 | |
913 | The offset is defined as from the current position, so an offset of 6 |
914 | and a match length of 3 means that 3 bytes should be copied from 6 bytes back. |
915 | Note that all offsets leading to previously decoded data |
916 | must be smaller than `Window_Size` defined in `Frame_Header_Descriptor`. |
917 | |
918 | #### Repeat offsets |
919 | As seen in [Sequence Execution](#sequence-execution), |
920 | the first 3 values define a repeated offset and we will call them |
921 | `Repeated_Offset1`, `Repeated_Offset2`, and `Repeated_Offset3`. |
922 | They are sorted in recency order, with `Repeated_Offset1` meaning "most recent one". |
923 | |
924 | If `offset_value == 1`, then the offset used is `Repeated_Offset1`, etc. |
925 | |
926 | There is an exception though, when current sequence's `literals_length = 0`. |
927 | In this case, repeated offsets are shifted by one, |
928 | so an `offset_value` of 1 means `Repeated_Offset2`, |
929 | an `offset_value` of 2 means `Repeated_Offset3`, |
930 | and an `offset_value` of 3 means `Repeated_Offset1 - 1_byte`. |
931 | |
932 | For the first block, the starting offset history is populated with following values : |
933 | `Repeated_Offset1`=1, `Repeated_Offset2`=4, `Repeated_Offset3`=8, |
934 | unless a dictionary is used, in which case they come from the dictionary. |
935 | |
936 | Then each block gets its starting offset history from the ending values of the most recent `Compressed_Block`. |
937 | Note that blocks which are not `Compressed_Block` are skipped, they do not contribute to offset history. |
938 | |
939 | [Offset Codes]: #offset-codes |
940 | |
941 | ###### Offset updates rules |
942 | |
943 | During the execution of the sequences of a `Compressed_Block`, the |
944 | `Repeated_Offsets`' values are kept up to date, so that they always represent |
945 | the three most-recently used offsets. In order to achieve that, they are |
946 | updated after executing each sequence in the following way: |
947 | |
948 | When the sequence's `offset_value` does not refer to one of the |
949 | `Repeated_Offsets`--when it has value greater than 3, or when it has value 3 |
950 | and the sequence's `literals_length` is zero--the `Repeated_Offsets`' values |
951 | are shifted back one, and `Repeated_Offset1` takes on the value of the |
952 | just-used offset. |
953 | |
954 | Otherwise, when the sequence's `offset_value` refers to one of the |
955 | `Repeated_Offsets`--when it has value 1 or 2, or when it has value 3 and the |
956 | sequence's `literals_length` is non-zero--the `Repeated_Offsets` are re-ordered |
957 | so that `Repeated_Offset1` takes on the value of the used Repeated_Offset, and |
958 | the existing values are pushed back from the first `Repeated_Offset` through to |
959 | the `Repeated_Offset` selected by the `offset_value`. This effectively performs |
960 | a single-stepped wrapping rotation of the values of these offsets, so that |
961 | their order again reflects the recency of their use. |
962 | |
963 | The following table shows the values of the `Repeated_Offsets` as a series of |
964 | sequences are applied to them: |
965 | |
966 | | `offset_value` | `literals_length` | `Repeated_Offset1` | `Repeated_Offset2` | `Repeated_Offset3` | Comment | |
967 | |:--------------:|:-----------------:|:------------------:|:------------------:|:------------------:|:-----------------------:| |
968 | | | | 1 | 4 | 8 | starting values | |
969 | | 1114 | 11 | 1111 | 1 | 4 | non-repeat | |
970 | | 1 | 22 | 1111 | 1 | 4 | repeat 1: no change | |
971 | | 2225 | 22 | 2222 | 1111 | 1 | non-repeat | |
972 | | 1114 | 111 | 1111 | 2222 | 1111 | non-repeat | |
973 | | 3336 | 33 | 3333 | 1111 | 2222 | non-repeat | |
974 | | 2 | 22 | 1111 | 3333 | 2222 | repeat 2: swap 1 & 2 | |
975 | | 3 | 33 | 2222 | 1111 | 3333 | repeat 3: rotate 3 to 1 | |
976 | | 3 | 0 | 2221 | 2222 | 1111 | special case : insert `repeat1 - 1` | |
977 | | 1 | 0 | 2222 | 2221 | 1111 | == repeat 2 | |
978 | |
979 | |
980 | Skippable Frames |
981 | ---------------- |
982 | |
983 | | `Magic_Number` | `Frame_Size` | `User_Data` | |
984 | |:--------------:|:------------:|:-----------:| |
985 | | 4 bytes | 4 bytes | n bytes | |
986 | |
987 | Skippable frames allow the insertion of user-defined metadata |
988 | into a flow of concatenated frames. |
989 | |
990 | Skippable frames defined in this specification are compatible with [LZ4] ones. |
991 | |
992 | [LZ4]:https://lz4.github.io/lz4/ |
993 | |
994 | From a compliant decoder perspective, skippable frames need just be skipped, |
995 | and their content ignored, resuming decoding after the skippable frame. |
996 | |
997 | It can be noted that a skippable frame |
998 | can be used to watermark a stream of concatenated frames |
999 | embedding any kind of tracking information (even just a UUID). |
1000 | Users wary of such possibility should scan the stream of concatenated frames |
1001 | in an attempt to detect such frame for analysis or removal. |
1002 | |
1003 | __`Magic_Number`__ |
1004 | |
1005 | 4 Bytes, __little-endian__ format. |
1006 | Value : 0x184D2A5?, which means any value from 0x184D2A50 to 0x184D2A5F. |
1007 | All 16 values are valid to identify a skippable frame. |
1008 | This specification doesn't detail any specific tagging for skippable frames. |
1009 | |
1010 | __`Frame_Size`__ |
1011 | |
1012 | This is the size, in bytes, of the following `User_Data` |
1013 | (without including the magic number nor the size field itself). |
1014 | This field is represented using 4 Bytes, __little-endian__ format, unsigned 32-bits. |
1015 | This means `User_Data` can’t be bigger than (2^32-1) bytes. |
1016 | |
1017 | __`User_Data`__ |
1018 | |
1019 | The `User_Data` can be anything. Data will just be skipped by the decoder. |
1020 | |
1021 | |
1022 | |
1023 | Entropy Encoding |
1024 | ---------------- |
1025 | Two types of entropy encoding are used by the Zstandard format: |
1026 | FSE, and Huffman coding. |
1027 | Huffman is used to compress literals, |
1028 | while FSE is used for all other symbols |
1029 | (`Literals_Length_Code`, `Match_Length_Code`, offset codes) |
1030 | and to compress Huffman headers. |
1031 | |
1032 | |
1033 | FSE |
1034 | --- |
1035 | FSE, short for Finite State Entropy, is an entropy codec based on [ANS]. |
1036 | FSE encoding/decoding involves a state that is carried over between symbols, |
1037 | so decoding must be done in the opposite direction as encoding. |
1038 | Therefore, all FSE bitstreams are read from end to beginning. |
1039 | Note that the order of the bits in the stream is not reversed, |
1040 | we just read the elements in the reverse order they are written. |
1041 | |
1042 | For additional details on FSE, see [Finite State Entropy]. |
1043 | |
1044 | [Finite State Entropy]:https://github.com/Cyan4973/FiniteStateEntropy/ |
1045 | |
1046 | FSE decoding involves a decoding table which has a power of 2 size, and contain three elements: |
1047 | `Symbol`, `Num_Bits`, and `Baseline`. |
1048 | The `log2` of the table size is its `Accuracy_Log`. |
1049 | An FSE state value represents an index in this table. |
1050 | |
1051 | To obtain the initial state value, consume `Accuracy_Log` bits from the stream as a __little-endian__ value. |
1052 | The next symbol in the stream is the `Symbol` indicated in the table for that state. |
1053 | To obtain the next state value, |
1054 | the decoder should consume `Num_Bits` bits from the stream as a __little-endian__ value and add it to `Baseline`. |
1055 | |
1056 | [ANS]: https://en.wikipedia.org/wiki/Asymmetric_Numeral_Systems |
1057 | |
1058 | ### FSE Table Description |
1059 | To decode FSE streams, it is necessary to construct the decoding table. |
1060 | The Zstandard format encodes FSE table descriptions as follows: |
1061 | |
1062 | An FSE distribution table describes the probabilities of all symbols |
1063 | from `0` to the last present one (included) |
1064 | on a normalized scale of `1 << Accuracy_Log` . |
1065 | Note that there must be two or more symbols with nonzero probability. |
1066 | |
1067 | It's a bitstream which is read forward, in __little-endian__ fashion. |
1068 | It's not necessary to know bitstream exact size, |
1069 | it will be discovered and reported by the decoding process. |
1070 | |
1071 | The bitstream starts by reporting on which scale it operates. |
1072 | Let's `low4Bits` designate the lowest 4 bits of the first byte : |
1073 | `Accuracy_Log = low4bits + 5`. |
1074 | |
1075 | Then follows each symbol value, from `0` to last present one. |
1076 | The number of bits used by each field is variable. |
1077 | It depends on : |
1078 | |
1079 | - Remaining probabilities + 1 : |
1080 | __example__ : |
1081 | Presuming an `Accuracy_Log` of 8, |
1082 | and presuming 100 probabilities points have already been distributed, |
1083 | the decoder may read any value from `0` to `256 - 100 + 1 == 157` (inclusive). |
1084 | Therefore, it must read `log2sup(157) == 8` bits. |
1085 | |
1086 | - Value decoded : small values use 1 less bit : |
1087 | __example__ : |
1088 | Presuming values from 0 to 157 (inclusive) are possible, |
1089 | 255-157 = 98 values are remaining in an 8-bits field. |
1090 | They are used this way : |
1091 | first 98 values (hence from 0 to 97) use only 7 bits, |
1092 | values from 98 to 157 use 8 bits. |
1093 | This is achieved through this scheme : |
1094 | |
1095 | | Value read | Value decoded | Number of bits used | |
1096 | | ---------- | ------------- | ------------------- | |
1097 | | 0 - 97 | 0 - 97 | 7 | |
1098 | | 98 - 127 | 98 - 127 | 8 | |
1099 | | 128 - 225 | 0 - 97 | 7 | |
1100 | | 226 - 255 | 128 - 157 | 8 | |
1101 | |
1102 | Symbols probabilities are read one by one, in order. |
1103 | |
1104 | Probability is obtained from Value decoded by following formula : |
1105 | `Proba = value - 1` |
1106 | |
1107 | It means value `0` becomes negative probability `-1`. |
1108 | `-1` is a special probability, which means "less than 1". |
1109 | Its effect on distribution table is described in the [next section]. |
1110 | For the purpose of calculating total allocated probability points, it counts as one. |
1111 | |
1112 | [next section]:#from-normalized-distribution-to-decoding-tables |
1113 | |
1114 | When a symbol has a __probability__ of `zero`, |
1115 | it is followed by a 2-bits repeat flag. |
1116 | This repeat flag tells how many probabilities of zeroes follow the current one. |
1117 | It provides a number ranging from 0 to 3. |
1118 | If it is a 3, another 2-bits repeat flag follows, and so on. |
1119 | |
1120 | When last symbol reaches cumulated total of `1 << Accuracy_Log`, |
1121 | decoding is complete. |
1122 | If the last symbol makes cumulated total go above `1 << Accuracy_Log`, |
1123 | distribution is considered corrupted. |
1124 | |
1125 | Then the decoder can tell how many bytes were used in this process, |
1126 | and how many symbols are present. |
1127 | The bitstream consumes a round number of bytes. |
1128 | Any remaining bit within the last byte is just unused. |
1129 | |
1130 | #### From normalized distribution to decoding tables |
1131 | |
1132 | The distribution of normalized probabilities is enough |
1133 | to create a unique decoding table. |
1134 | |
1135 | It follows the following build rule : |
1136 | |
1137 | The table has a size of `Table_Size = 1 << Accuracy_Log`. |
1138 | Each cell describes the symbol decoded, |
1139 | and instructions to get the next state (`Number_of_Bits` and `Baseline`). |
1140 | |
1141 | Symbols are scanned in their natural order for "less than 1" probabilities. |
1142 | Symbols with this probability are being attributed a single cell, |
1143 | starting from the end of the table and retreating. |
1144 | These symbols define a full state reset, reading `Accuracy_Log` bits. |
1145 | |
1146 | Then, all remaining symbols, sorted in natural order, are allocated cells. |
1147 | Starting from symbol `0` (if it exists), and table position `0`, |
1148 | each symbol gets allocated as many cells as its probability. |
1149 | Cell allocation is spread, not linear : |
1150 | each successor position follows this rule : |
1151 | |
1152 | ``` |
1153 | position += (tableSize>>1) + (tableSize>>3) + 3; |
1154 | position &= tableSize-1; |
1155 | ``` |
1156 | |
1157 | A position is skipped if already occupied by a "less than 1" probability symbol. |
1158 | `position` does not reset between symbols, it simply iterates through |
1159 | each position in the table, switching to the next symbol when enough |
1160 | states have been allocated to the current one. |
1161 | |
1162 | The process guarantees that the table is entirely filled. |
1163 | Each cell corresponds to a state value, which contains the symbol being decoded. |
1164 | |
1165 | To add the `Number_of_Bits` and `Baseline` required to retrieve next state, |
1166 | it's first necessary to sort all occurrences of each symbol in state order. |
1167 | Lower states will need 1 more bit than higher ones. |
1168 | The process is repeated for each symbol. |
1169 | |
1170 | __Example__ : |
1171 | Presuming a symbol has a probability of 5, |
1172 | it receives 5 cells, corresponding to 5 state values. |
1173 | These state values are then sorted in natural order. |
1174 | |
1175 | Next power of 2 after 5 is 8. |
1176 | Space of probabilities must be divided into 8 equal parts. |
1177 | Presuming the `Accuracy_Log` is 7, it defines a space of 128 states. |
1178 | Divided by 8, each share is 16 large. |
1179 | |
1180 | In order to reach 8 shares, 8-5=3 lowest states will count "double", |
1181 | doubling their shares (32 in width), hence requiring one more bit. |
1182 | |
1183 | Baseline is assigned starting from the higher states using fewer bits, |
1184 | increasing at each state, then resuming at the first state, |
1185 | each state takes its allocated width from Baseline. |
1186 | |
1187 | | state value | 1 | 39 | 77 | 84 | 122 | |
1188 | | state order | 0 | 1 | 2 | 3 | 4 | |
1189 | | ---------------- | ----- | ----- | ------ | ---- | ------ | |
1190 | | width | 32 | 32 | 32 | 16 | 16 | |
1191 | | `Number_of_Bits` | 5 | 5 | 5 | 4 | 4 | |
1192 | | range number | 2 | 4 | 6 | 0 | 1 | |
1193 | | `Baseline` | 32 | 64 | 96 | 0 | 16 | |
1194 | | range | 32-63 | 64-95 | 96-127 | 0-15 | 16-31 | |
1195 | |
1196 | During decoding, the next state value is determined from current state value, |
1197 | by reading the required `Number_of_Bits`, and adding the specified `Baseline`. |
1198 | |
1199 | See [Appendix A] for the results of this process applied to the default distributions. |
1200 | |
1201 | [Appendix A]: #appendix-a---decoding-tables-for-predefined-codes |
1202 | |
1203 | |
1204 | Huffman Coding |
1205 | -------------- |
1206 | Zstandard Huffman-coded streams are read backwards, |
1207 | similar to the FSE bitstreams. |
1208 | Therefore, to find the start of the bitstream, it is required to |
1209 | know the offset of the last byte of the Huffman-coded stream. |
1210 | |
1211 | After writing the last bit containing information, the compressor |
1212 | writes a single `1`-bit and then fills the byte with 0-7 `0` bits of |
1213 | padding. The last byte of the compressed bitstream cannot be `0` for |
1214 | that reason. |
1215 | |
1216 | When decompressing, the last byte containing the padding is the first |
1217 | byte to read. The decompressor needs to skip 0-7 initial `0`-bits and |
1218 | the first `1`-bit it occurs. Afterwards, the useful part of the bitstream |
1219 | begins. |
1220 | |
1221 | The bitstream contains Huffman-coded symbols in __little-endian__ order, |
1222 | with the codes defined by the method below. |
1223 | |
1224 | ### Huffman Tree Description |
1225 | |
1226 | Prefix coding represents symbols from an a priori known alphabet |
1227 | by bit sequences (codewords), one codeword for each symbol, |
1228 | in a manner such that different symbols may be represented |
1229 | by bit sequences of different lengths, |
1230 | but a parser can always parse an encoded string |
1231 | unambiguously symbol-by-symbol. |
1232 | |
1233 | Given an alphabet with known symbol frequencies, |
1234 | the Huffman algorithm allows the construction of an optimal prefix code |
1235 | using the fewest bits of any possible prefix codes for that alphabet. |
1236 | |
1237 | Prefix code must not exceed a maximum code length. |
1238 | More bits improve accuracy but cost more header size, |
1239 | and require more memory or more complex decoding operations. |
1240 | This specification limits maximum code length to 11 bits. |
1241 | |
1242 | #### Representation |
1243 | |
1244 | All literal values from zero (included) to last present one (excluded) |
1245 | are represented by `Weight` with values from `0` to `Max_Number_of_Bits`. |
1246 | Transformation from `Weight` to `Number_of_Bits` follows this formula : |
1247 | ``` |
1248 | Number_of_Bits = Weight ? (Max_Number_of_Bits + 1 - Weight) : 0 |
1249 | ``` |
1250 | When a literal value is not present, it receives a `Weight` of 0. |
1251 | The least frequent symbol receives a `Weight` of 1. |
1252 | Consequently, the `Weight` 1 is necessarily present. |
1253 | The most frequent symbol receives a `Weight` anywhere between 1 and 11 (max). |
1254 | The last symbol's `Weight` is deduced from previously retrieved Weights, |
1255 | by completing to the nearest power of 2. It's necessarily non 0. |
1256 | If it's not possible to reach a clean power of 2 with a single `Weight` value, |
1257 | the Huffman Tree Description is considered invalid. |
1258 | This final power of 2 gives `Max_Number_of_Bits`, the depth of the current tree. |
1259 | `Max_Number_of_Bits` must be <= 11, |
1260 | otherwise the representation is considered corrupted. |
1261 | |
1262 | __Example__ : |
1263 | Let's presume the following Huffman tree must be described : |
1264 | |
1265 | | literal value | 0 | 1 | 2 | 3 | 4 | 5 | |
1266 | | ---------------- | --- | --- | --- | --- | --- | --- | |
1267 | | `Number_of_Bits` | 1 | 2 | 3 | 0 | 4 | 4 | |
1268 | |
1269 | The tree depth is 4, since its longest elements uses 4 bits |
1270 | (longest elements are the one with smallest frequency). |
1271 | Literal value `5` will not be listed, as it can be determined from previous values 0-4, |
1272 | nor will values above `5` as they are all 0. |
1273 | Values from `0` to `4` will be listed using `Weight` instead of `Number_of_Bits`. |
1274 | Weight formula is : |
1275 | ``` |
1276 | Weight = Number_of_Bits ? (Max_Number_of_Bits + 1 - Number_of_Bits) : 0 |
1277 | ``` |
1278 | It gives the following series of weights : |
1279 | |
1280 | | literal value | 0 | 1 | 2 | 3 | 4 | |
1281 | | ------------- | --- | --- | --- | --- | --- | |
1282 | | `Weight` | 4 | 3 | 2 | 0 | 1 | |
1283 | |
1284 | The decoder will do the inverse operation : |
1285 | having collected weights of literal symbols from `0` to `4`, |
1286 | it knows the last literal, `5`, is present with a non-zero `Weight`. |
1287 | The `Weight` of `5` can be determined by advancing to the next power of 2. |
1288 | The sum of `2^(Weight-1)` (excluding 0's) is : |
1289 | `8 + 4 + 2 + 0 + 1 = 15`. |
1290 | Nearest larger power of 2 value is 16. |
1291 | Therefore, `Max_Number_of_Bits = 4` and `Weight[5] = log_2(16 - 15) + 1 = 1`. |
1292 | |
1293 | #### Huffman Tree header |
1294 | |
1295 | This is a single byte value (0-255), |
1296 | which describes how the series of weights is encoded. |
1297 | |
1298 | - if `headerByte` < 128 : |
1299 | the series of weights is compressed using FSE (see below). |
1300 | The length of the FSE-compressed series is equal to `headerByte` (0-127). |
1301 | |
1302 | - if `headerByte` >= 128 : |
1303 | + the series of weights uses a direct representation, |
1304 | where each `Weight` is encoded directly as a 4 bits field (0-15). |
1305 | + They are encoded forward, 2 weights to a byte, |
1306 | first weight taking the top four bits and second one taking the bottom four. |
1307 | * e.g. the following operations could be used to read the weights: |
1308 | `Weight[0] = (Byte[0] >> 4), Weight[1] = (Byte[0] & 0xf)`, etc. |
1309 | + The full representation occupies `Ceiling(Number_of_Weights/2)` bytes, |
1310 | meaning it uses only full bytes even if `Number_of_Weights` is odd. |
1311 | + `Number_of_Weights = headerByte - 127`. |
1312 | * Note that maximum `Number_of_Weights` is 255-127 = 128, |
1313 | therefore, only up to 128 `Weight` can be encoded using direct representation. |
1314 | * Since the last non-zero `Weight` is _not_ encoded, |
1315 | this scheme is compatible with alphabet sizes of up to 129 symbols, |
1316 | hence including literal symbol 128. |
1317 | * If any literal symbol > 128 has a non-zero `Weight`, |
1318 | direct representation is not possible. |
1319 | In such case, it's necessary to use FSE compression. |
1320 | |
1321 | |
1322 | #### Finite State Entropy (FSE) compression of Huffman weights |
1323 | |
1324 | In this case, the series of Huffman weights is compressed using FSE compression. |
1325 | It's a single bitstream with 2 interleaved states, |
1326 | sharing a single distribution table. |
1327 | |
1328 | To decode an FSE bitstream, it is necessary to know its compressed size. |
1329 | Compressed size is provided by `headerByte`. |
1330 | It's also necessary to know its _maximum possible_ decompressed size, |
1331 | which is `255`, since literal values span from `0` to `255`, |
1332 | and last symbol's `Weight` is not represented. |
1333 | |
1334 | An FSE bitstream starts by a header, describing probabilities distribution. |
1335 | It will create a Decoding Table. |
1336 | For a list of Huffman weights, the maximum accuracy log is 6 bits. |
1337 | For more description see the [FSE header description](#fse-table-description) |
1338 | |
1339 | The Huffman header compression uses 2 states, |
1340 | which share the same FSE distribution table. |
1341 | The first state (`State1`) encodes the even indexed symbols, |
1342 | and the second (`State2`) encodes the odd indexed symbols. |
1343 | `State1` is initialized first, and then `State2`, and they take turns |
1344 | decoding a single symbol and updating their state. |
1345 | For more details on these FSE operations, see the [FSE section](#fse). |
1346 | |
1347 | The number of symbols to decode is determined |
1348 | by tracking bitStream overflow condition: |
1349 | If updating state after decoding a symbol would require more bits than |
1350 | remain in the stream, it is assumed that extra bits are 0. Then, |
1351 | symbols for each of the final states are decoded and the process is complete. |
1352 | |
1353 | #### Conversion from weights to Huffman prefix codes |
1354 | |
1355 | All present symbols shall now have a `Weight` value. |
1356 | It is possible to transform weights into `Number_of_Bits`, using this formula: |
1357 | ``` |
1358 | Number_of_Bits = (Weight>0) ? Max_Number_of_Bits + 1 - Weight : 0 |
1359 | ``` |
1360 | Symbols are sorted by `Weight`. |
1361 | Within same `Weight`, symbols keep natural sequential order. |
1362 | Symbols with a `Weight` of zero are removed. |
1363 | Then, starting from lowest `Weight`, prefix codes are distributed in sequential order. |
1364 | |
1365 | __Example__ : |
1366 | Let's presume the following list of weights has been decoded : |
1367 | |
1368 | | Literal | 0 | 1 | 2 | 3 | 4 | 5 | |
1369 | | -------- | --- | --- | --- | --- | --- | --- | |
1370 | | `Weight` | 4 | 3 | 2 | 0 | 1 | 1 | |
1371 | |
1372 | Sorted by weight and then natural sequential order, |
1373 | it gives the following distribution : |
1374 | |
1375 | | Literal | 3 | 4 | 5 | 2 | 1 | 0 | |
1376 | | ---------------- | --- | --- | --- | --- | --- | ---- | |
1377 | | `Weight` | 0 | 1 | 1 | 2 | 3 | 4 | |
1378 | | `Number_of_Bits` | 0 | 4 | 4 | 3 | 2 | 1 | |
1379 | | prefix codes | N/A | 0000| 0001| 001 | 01 | 1 | |
1380 | |
1381 | ### Huffman-coded Streams |
1382 | |
1383 | Given a Huffman decoding table, |
1384 | it's possible to decode a Huffman-coded stream. |
1385 | |
1386 | Each bitstream must be read _backward_, |
1387 | that is starting from the end down to the beginning. |
1388 | Therefore it's necessary to know the size of each bitstream. |
1389 | |
1390 | It's also necessary to know exactly which _bit_ is the last one. |
1391 | This is detected by a final bit flag : |
1392 | the highest bit of latest byte is a final-bit-flag. |
1393 | Consequently, a last byte of `0` is not possible. |
1394 | And the final-bit-flag itself is not part of the useful bitstream. |
1395 | Hence, the last byte contains between 0 and 7 useful bits. |
1396 | |
1397 | Starting from the end, |
1398 | it's possible to read the bitstream in a __little-endian__ fashion, |
1399 | keeping track of already used bits. Since the bitstream is encoded in reverse |
1400 | order, starting from the end read symbols in forward order. |
1401 | |
1402 | For example, if the literal sequence "0145" was encoded using above prefix code, |
1403 | it would be encoded (in reverse order) as: |
1404 | |
1405 | |Symbol | 5 | 4 | 1 | 0 | Padding | |
1406 | |--------|------|------|----|---|---------| |
1407 | |Encoding|`0000`|`0001`|`01`|`1`| `00001` | |
1408 | |
1409 | Resulting in following 2-bytes bitstream : |
1410 | ``` |
1411 | 00010000 00001101 |
1412 | ``` |
1413 | |
1414 | Here is an alternative representation with the symbol codes separated by underscore: |
1415 | ``` |
1416 | 0001_0000 00001_1_01 |
1417 | ``` |
1418 | |
1419 | Reading highest `Max_Number_of_Bits` bits, |
1420 | it's possible to compare extracted value to decoding table, |
1421 | determining the symbol to decode and number of bits to discard. |
1422 | |
1423 | The process continues up to reading the required number of symbols per stream. |
1424 | If a bitstream is not entirely and exactly consumed, |
1425 | hence reaching exactly its beginning position with _all_ bits consumed, |
1426 | the decoding process is considered faulty. |
1427 | |
1428 | |
1429 | Dictionary Format |
1430 | ----------------- |
1431 | |
1432 | Zstandard is compatible with "raw content" dictionaries, |
1433 | free of any format restriction, except that they must be at least 8 bytes. |
1434 | These dictionaries function as if they were just the `Content` part |
1435 | of a formatted dictionary. |
1436 | |
1437 | But dictionaries created by `zstd --train` follow a format, described here. |
1438 | |
1439 | __Pre-requisites__ : a dictionary has a size, |
1440 | defined either by a buffer limit, or a file size. |
1441 | |
1442 | | `Magic_Number` | `Dictionary_ID` | `Entropy_Tables` | `Content` | |
1443 | | -------------- | --------------- | ---------------- | --------- | |
1444 | |
1445 | __`Magic_Number`__ : 4 bytes ID, value 0xEC30A437, __little-endian__ format |
1446 | |
1447 | __`Dictionary_ID`__ : 4 bytes, stored in __little-endian__ format. |
1448 | `Dictionary_ID` can be any value, except 0 (which means no `Dictionary_ID`). |
1449 | It's used by decoders to check if they use the correct dictionary. |
1450 | |
1451 | _Reserved ranges :_ |
1452 | If the dictionary is going to be distributed in a public environment, |
1453 | the following ranges of `Dictionary_ID` are reserved for some future registrar |
1454 | and shall not be used : |
1455 | |
1456 | - low range : <= 32767 |
1457 | - high range : >= (2^31) |
1458 | |
1459 | Outside of these ranges, any value of `Dictionary_ID` |
1460 | which is both `>= 32768` and `< (1<<31)` can be used freely, |
1461 | even in public environment. |
1462 | |
1463 | |
1464 | __`Entropy_Tables`__ : follow the same format as tables in [compressed blocks]. |
1465 | See the relevant [FSE](#fse-table-description) |
1466 | and [Huffman](#huffman-tree-description) sections for how to decode these tables. |
1467 | They are stored in following order : |
1468 | Huffman tables for literals, FSE table for offsets, |
1469 | FSE table for match lengths, and FSE table for literals lengths. |
1470 | These tables populate the Repeat Stats literals mode and |
1471 | Repeat distribution mode for sequence decoding. |
1472 | It's finally followed by 3 offset values, populating recent offsets (instead of using `{1,4,8}`), |
1473 | stored in order, 4-bytes __little-endian__ each, for a total of 12 bytes. |
1474 | Each recent offset must have a value <= dictionary content size, and cannot equal 0. |
1475 | |
1476 | __`Content`__ : The rest of the dictionary is its content. |
1477 | The content act as a "past" in front of data to compress or decompress, |
1478 | so it can be referenced in sequence commands. |
1479 | As long as the amount of data decoded from this frame is less than or |
1480 | equal to `Window_Size`, sequence commands may specify offsets longer |
1481 | than the total length of decoded output so far to reference back to the |
1482 | dictionary, even parts of the dictionary with offsets larger than `Window_Size`. |
1483 | After the total output has surpassed `Window_Size` however, |
1484 | this is no longer allowed and the dictionary is no longer accessible. |
1485 | |
1486 | [compressed blocks]: #the-format-of-compressed_block |
1487 | |
1488 | If a dictionary is provided by an external source, |
1489 | it should be loaded with great care, its content considered untrusted. |
1490 | |
1491 | |
1492 | |
1493 | Appendix A - Decoding tables for predefined codes |
1494 | ------------------------------------------------- |
1495 | |
1496 | This appendix contains FSE decoding tables |
1497 | for the predefined literal length, match length, and offset codes. |
1498 | The tables have been constructed using the algorithm as given above in chapter |
1499 | "from normalized distribution to decoding tables". |
1500 | The tables here can be used as examples |
1501 | to crosscheck that an implementation build its decoding tables correctly. |
1502 | |
1503 | #### Literal Length Code: |
1504 | |
1505 | | State | Symbol | Number_Of_Bits | Base | |
1506 | | ----- | ------ | -------------- | ---- | |
1507 | | 0 | 0 | 4 | 0 | |
1508 | | 1 | 0 | 4 | 16 | |
1509 | | 2 | 1 | 5 | 32 | |
1510 | | 3 | 3 | 5 | 0 | |
1511 | | 4 | 4 | 5 | 0 | |
1512 | | 5 | 6 | 5 | 0 | |
1513 | | 6 | 7 | 5 | 0 | |
1514 | | 7 | 9 | 5 | 0 | |
1515 | | 8 | 10 | 5 | 0 | |
1516 | | 9 | 12 | 5 | 0 | |
1517 | | 10 | 14 | 6 | 0 | |
1518 | | 11 | 16 | 5 | 0 | |
1519 | | 12 | 18 | 5 | 0 | |
1520 | | 13 | 19 | 5 | 0 | |
1521 | | 14 | 21 | 5 | 0 | |
1522 | | 15 | 22 | 5 | 0 | |
1523 | | 16 | 24 | 5 | 0 | |
1524 | | 17 | 25 | 5 | 32 | |
1525 | | 18 | 26 | 5 | 0 | |
1526 | | 19 | 27 | 6 | 0 | |
1527 | | 20 | 29 | 6 | 0 | |
1528 | | 21 | 31 | 6 | 0 | |
1529 | | 22 | 0 | 4 | 32 | |
1530 | | 23 | 1 | 4 | 0 | |
1531 | | 24 | 2 | 5 | 0 | |
1532 | | 25 | 4 | 5 | 32 | |
1533 | | 26 | 5 | 5 | 0 | |
1534 | | 27 | 7 | 5 | 32 | |
1535 | | 28 | 8 | 5 | 0 | |
1536 | | 29 | 10 | 5 | 32 | |
1537 | | 30 | 11 | 5 | 0 | |
1538 | | 31 | 13 | 6 | 0 | |
1539 | | 32 | 16 | 5 | 32 | |
1540 | | 33 | 17 | 5 | 0 | |
1541 | | 34 | 19 | 5 | 32 | |
1542 | | 35 | 20 | 5 | 0 | |
1543 | | 36 | 22 | 5 | 32 | |
1544 | | 37 | 23 | 5 | 0 | |
1545 | | 38 | 25 | 4 | 0 | |
1546 | | 39 | 25 | 4 | 16 | |
1547 | | 40 | 26 | 5 | 32 | |
1548 | | 41 | 28 | 6 | 0 | |
1549 | | 42 | 30 | 6 | 0 | |
1550 | | 43 | 0 | 4 | 48 | |
1551 | | 44 | 1 | 4 | 16 | |
1552 | | 45 | 2 | 5 | 32 | |
1553 | | 46 | 3 | 5 | 32 | |
1554 | | 47 | 5 | 5 | 32 | |
1555 | | 48 | 6 | 5 | 32 | |
1556 | | 49 | 8 | 5 | 32 | |
1557 | | 50 | 9 | 5 | 32 | |
1558 | | 51 | 11 | 5 | 32 | |
1559 | | 52 | 12 | 5 | 32 | |
1560 | | 53 | 15 | 6 | 0 | |
1561 | | 54 | 17 | 5 | 32 | |
1562 | | 55 | 18 | 5 | 32 | |
1563 | | 56 | 20 | 5 | 32 | |
1564 | | 57 | 21 | 5 | 32 | |
1565 | | 58 | 23 | 5 | 32 | |
1566 | | 59 | 24 | 5 | 32 | |
1567 | | 60 | 35 | 6 | 0 | |
1568 | | 61 | 34 | 6 | 0 | |
1569 | | 62 | 33 | 6 | 0 | |
1570 | | 63 | 32 | 6 | 0 | |
1571 | |
1572 | #### Match Length Code: |
1573 | |
1574 | | State | Symbol | Number_Of_Bits | Base | |
1575 | | ----- | ------ | -------------- | ---- | |
1576 | | 0 | 0 | 6 | 0 | |
1577 | | 1 | 1 | 4 | 0 | |
1578 | | 2 | 2 | 5 | 32 | |
1579 | | 3 | 3 | 5 | 0 | |
1580 | | 4 | 5 | 5 | 0 | |
1581 | | 5 | 6 | 5 | 0 | |
1582 | | 6 | 8 | 5 | 0 | |
1583 | | 7 | 10 | 6 | 0 | |
1584 | | 8 | 13 | 6 | 0 | |
1585 | | 9 | 16 | 6 | 0 | |
1586 | | 10 | 19 | 6 | 0 | |
1587 | | 11 | 22 | 6 | 0 | |
1588 | | 12 | 25 | 6 | 0 | |
1589 | | 13 | 28 | 6 | 0 | |
1590 | | 14 | 31 | 6 | 0 | |
1591 | | 15 | 33 | 6 | 0 | |
1592 | | 16 | 35 | 6 | 0 | |
1593 | | 17 | 37 | 6 | 0 | |
1594 | | 18 | 39 | 6 | 0 | |
1595 | | 19 | 41 | 6 | 0 | |
1596 | | 20 | 43 | 6 | 0 | |
1597 | | 21 | 45 | 6 | 0 | |
1598 | | 22 | 1 | 4 | 16 | |
1599 | | 23 | 2 | 4 | 0 | |
1600 | | 24 | 3 | 5 | 32 | |
1601 | | 25 | 4 | 5 | 0 | |
1602 | | 26 | 6 | 5 | 32 | |
1603 | | 27 | 7 | 5 | 0 | |
1604 | | 28 | 9 | 6 | 0 | |
1605 | | 29 | 12 | 6 | 0 | |
1606 | | 30 | 15 | 6 | 0 | |
1607 | | 31 | 18 | 6 | 0 | |
1608 | | 32 | 21 | 6 | 0 | |
1609 | | 33 | 24 | 6 | 0 | |
1610 | | 34 | 27 | 6 | 0 | |
1611 | | 35 | 30 | 6 | 0 | |
1612 | | 36 | 32 | 6 | 0 | |
1613 | | 37 | 34 | 6 | 0 | |
1614 | | 38 | 36 | 6 | 0 | |
1615 | | 39 | 38 | 6 | 0 | |
1616 | | 40 | 40 | 6 | 0 | |
1617 | | 41 | 42 | 6 | 0 | |
1618 | | 42 | 44 | 6 | 0 | |
1619 | | 43 | 1 | 4 | 32 | |
1620 | | 44 | 1 | 4 | 48 | |
1621 | | 45 | 2 | 4 | 16 | |
1622 | | 46 | 4 | 5 | 32 | |
1623 | | 47 | 5 | 5 | 32 | |
1624 | | 48 | 7 | 5 | 32 | |
1625 | | 49 | 8 | 5 | 32 | |
1626 | | 50 | 11 | 6 | 0 | |
1627 | | 51 | 14 | 6 | 0 | |
1628 | | 52 | 17 | 6 | 0 | |
1629 | | 53 | 20 | 6 | 0 | |
1630 | | 54 | 23 | 6 | 0 | |
1631 | | 55 | 26 | 6 | 0 | |
1632 | | 56 | 29 | 6 | 0 | |
1633 | | 57 | 52 | 6 | 0 | |
1634 | | 58 | 51 | 6 | 0 | |
1635 | | 59 | 50 | 6 | 0 | |
1636 | | 60 | 49 | 6 | 0 | |
1637 | | 61 | 48 | 6 | 0 | |
1638 | | 62 | 47 | 6 | 0 | |
1639 | | 63 | 46 | 6 | 0 | |
1640 | |
1641 | #### Offset Code: |
1642 | |
1643 | | State | Symbol | Number_Of_Bits | Base | |
1644 | | ----- | ------ | -------------- | ---- | |
1645 | | 0 | 0 | 5 | 0 | |
1646 | | 1 | 6 | 4 | 0 | |
1647 | | 2 | 9 | 5 | 0 | |
1648 | | 3 | 15 | 5 | 0 | |
1649 | | 4 | 21 | 5 | 0 | |
1650 | | 5 | 3 | 5 | 0 | |
1651 | | 6 | 7 | 4 | 0 | |
1652 | | 7 | 12 | 5 | 0 | |
1653 | | 8 | 18 | 5 | 0 | |
1654 | | 9 | 23 | 5 | 0 | |
1655 | | 10 | 5 | 5 | 0 | |
1656 | | 11 | 8 | 4 | 0 | |
1657 | | 12 | 14 | 5 | 0 | |
1658 | | 13 | 20 | 5 | 0 | |
1659 | | 14 | 2 | 5 | 0 | |
1660 | | 15 | 7 | 4 | 16 | |
1661 | | 16 | 11 | 5 | 0 | |
1662 | | 17 | 17 | 5 | 0 | |
1663 | | 18 | 22 | 5 | 0 | |
1664 | | 19 | 4 | 5 | 0 | |
1665 | | 20 | 8 | 4 | 16 | |
1666 | | 21 | 13 | 5 | 0 | |
1667 | | 22 | 19 | 5 | 0 | |
1668 | | 23 | 1 | 5 | 0 | |
1669 | | 24 | 6 | 4 | 16 | |
1670 | | 25 | 10 | 5 | 0 | |
1671 | | 26 | 16 | 5 | 0 | |
1672 | | 27 | 28 | 5 | 0 | |
1673 | | 28 | 27 | 5 | 0 | |
1674 | | 29 | 26 | 5 | 0 | |
1675 | | 30 | 25 | 5 | 0 | |
1676 | | 31 | 24 | 5 | 0 | |
1677 | |
1678 | |
1679 | |
1680 | Appendix B - Resources for implementers |
1681 | ------------------------------------------------- |
1682 | |
1683 | An open source reference implementation is available on : |
1684 | https://github.com/facebook/zstd |
1685 | |
1686 | The project contains a frame generator, called [decodeCorpus], |
1687 | which can be used by any 3rd-party implementation |
1688 | to verify that a tested decoder is compliant with the specification. |
1689 | |
1690 | [decodeCorpus]: https://github.com/facebook/zstd/tree/v1.3.4/tests#decodecorpus---tool-to-generate-zstandard-frames-for-decoder-testing |
1691 | |
1692 | `decodeCorpus` generates random valid frames. |
1693 | A compliant decoder should be able to decode them all, |
1694 | or at least provide a meaningful error code explaining for which reason it cannot |
1695 | (memory limit restrictions for example). |
1696 | |
1697 | |
1698 | Version changes |
1699 | --------------- |
1700 | - 0.3.9 : clarifications for Huffman-compressed literal sizes. |
1701 | - 0.3.8 : clarifications for Huffman Blocks and Huffman Tree descriptions. |
1702 | - 0.3.7 : clarifications for Repeat_Offsets, matching RFC8878 |
1703 | - 0.3.6 : clarifications for Dictionary_ID |
1704 | - 0.3.5 : clarifications for Block_Maximum_Size |
1705 | - 0.3.4 : clarifications for FSE decoding table |
1706 | - 0.3.3 : clarifications for field Block_Size |
1707 | - 0.3.2 : remove additional block size restriction on compressed blocks |
1708 | - 0.3.1 : minor clarification regarding offset history update rules |
1709 | - 0.3.0 : minor edits to match RFC8478 |
1710 | - 0.2.9 : clarifications for huffman weights direct representation, by Ulrich Kunitz |
1711 | - 0.2.8 : clarifications for IETF RFC discuss |
1712 | - 0.2.7 : clarifications from IETF RFC review, by Vijay Gurbani and Nick Terrell |
1713 | - 0.2.6 : fixed an error in huffman example, by Ulrich Kunitz |
1714 | - 0.2.5 : minor typos and clarifications |
1715 | - 0.2.4 : section restructuring, by Sean Purcell |
1716 | - 0.2.3 : clarified several details, by Sean Purcell |
1717 | - 0.2.2 : added predefined codes, by Johannes Rudolph |
1718 | - 0.2.1 : clarify field names, by Przemyslaw Skibinski |
1719 | - 0.2.0 : numerous format adjustments for zstd v0.8+ |
1720 | - 0.1.2 : limit Huffman tree depth to 11 bits |
1721 | - 0.1.1 : reserved dictID ranges |
1722 | - 0.1.0 : initial release |