648db22b |
1 | # Zstandard Seekable Format |
2 | |
3 | ### Notices |
4 | |
5 | Copyright (c) Meta Platforms, Inc. and affiliates. |
6 | |
7 | Permission is granted to copy and distribute this document |
8 | for any purpose and without charge, |
9 | including translations into other languages |
10 | and incorporation into compilations, |
11 | provided that the copyright notice and this notice are preserved, |
12 | and that any substantive changes or deletions from the original |
13 | are clearly marked. |
14 | Distribution of this document is unlimited. |
15 | |
16 | ### Version |
17 | 0.1.0 (11/04/17) |
18 | |
19 | ## Introduction |
20 | This document defines a format for compressed data to be stored so that subranges of the data can be efficiently decompressed without requiring the entire document to be decompressed. |
21 | This is done by splitting up the input data into frames, |
22 | each of which are compressed independently, |
23 | and so can be decompressed independently. |
24 | Decompression then takes advantage of a provided 'seek table', which allows the decompressor to immediately jump to the desired data. This is done in a way that is compatible with the original Zstandard format by placing the seek table in a Zstandard skippable frame. |
25 | |
26 | ### Overall conventions |
27 | In this document: |
28 | - square brackets i.e. `[` and `]` are used to indicate optional fields or parameters. |
29 | - the naming convention for identifiers is `Mixed_Case_With_Underscores` |
30 | - All numeric fields are little-endian unless specified otherwise |
31 | |
32 | ## Format |
33 | |
34 | The format consists of a number of frames (Zstandard compressed frames and skippable frames), followed by a final skippable frame at the end containing the seek table. |
35 | |
36 | ### Seek Table Format |
37 | The structure of the seek table frame is as follows: |
38 | |
39 | |`Skippable_Magic_Number`|`Frame_Size`|`[Seek_Table_Entries]`|`Seek_Table_Footer`| |
40 | |------------------------|------------|----------------------|-------------------| |
41 | | 4 bytes | 4 bytes | 8-12 bytes each | 9 bytes | |
42 | |
43 | __`Skippable_Magic_Number`__ |
44 | |
45 | Value : 0x184D2A5E. |
46 | This is for compatibility with [Zstandard skippable frames]. |
47 | Since it is legal for other Zstandard skippable frames to use the same |
48 | magic number, it is not recommended for a decoder to recognize frames |
49 | solely on this. |
50 | |
51 | __`Frame_Size`__ |
52 | |
53 | The total size of the skippable frame, not including the `Skippable_Magic_Number` or `Frame_Size`. |
54 | This is for compatibility with [Zstandard skippable frames]. |
55 | |
56 | [Zstandard skippable frames]: https://github.com/facebook/zstd/blob/release/doc/zstd_compression_format.md#skippable-frames |
57 | |
58 | #### `Seek_Table_Footer` |
59 | The seek table footer format is as follows: |
60 | |
61 | |`Number_Of_Frames`|`Seek_Table_Descriptor`|`Seekable_Magic_Number`| |
62 | |------------------|-----------------------|-----------------------| |
63 | | 4 bytes | 1 byte | 4 bytes | |
64 | |
65 | __`Seekable_Magic_Number`__ |
66 | |
67 | Value : 0x8F92EAB1. |
68 | This value must be the last bytes present in the compressed file so that decoders |
69 | can efficiently find it and determine if there is an actual seek table present. |
70 | |
71 | __`Number_Of_Frames`__ |
72 | |
73 | The number of stored frames in the data. |
74 | |
75 | __`Seek_Table_Descriptor`__ |
76 | |
77 | A bitfield describing the format of the seek table. |
78 | |
79 | | Bit number | Field name | |
80 | | ---------- | ---------- | |
81 | | 7 | `Checksum_Flag` | |
82 | | 6-2 | `Reserved_Bits` | |
83 | | 1-0 | `Unused_Bits` | |
84 | |
85 | While only `Checksum_Flag` currently exists, there are 7 other bits in this field that can be used for future changes to the format, |
86 | for example the addition of inline dictionaries. |
87 | |
88 | __`Checksum_Flag`__ |
89 | |
90 | If the checksum flag is set, each of the seek table entries contains a 4 byte checksum of the uncompressed data contained in its frame. |
91 | |
92 | `Reserved_Bits` are not currently used but may be used in the future for breaking changes, so a compliant decoder should ensure they are set to 0. `Unused_Bits` may be used in the future for non-breaking changes, so a compliant decoder should not interpret these bits. |
93 | |
94 | #### __`Seek_Table_Entries`__ |
95 | |
96 | `Seek_Table_Entries` consists of `Number_Of_Frames` (one for each frame in the data, not including the seek table frame) entries of the following form, in sequence: |
97 | |
98 | |`Compressed_Size`|`Decompressed_Size`|`[Checksum]`| |
99 | |-----------------|-------------------|------------| |
100 | | 4 bytes | 4 bytes | 4 bytes | |
101 | |
102 | __`Compressed_Size`__ |
103 | |
104 | The compressed size of the frame. |
105 | The cumulative sum of the `Compressed_Size` fields of frames `0` to `i` gives the offset in the compressed file of frame `i+1`. |
106 | |
107 | __`Decompressed_Size`__ |
108 | |
109 | The size of the decompressed data contained in the frame. For skippable or otherwise empty frames, this value is 0. |
110 | |
111 | __`Checksum`__ |
112 | |
113 | Only present if `Checksum_Flag` is set in the `Seek_Table_Descriptor`. Value : the least significant 32 bits of the XXH64 digest of the uncompressed data, stored in little-endian format. |
114 | |
115 | ## Version Changes |
116 | - 0.1.0: initial version |