| 1 | # Zstandard Seekable Format |
| 2 | |
| 3 | ### Notices |
| 4 | |
| 5 | Copyright (c) Meta Platforms, Inc. and affiliates. |
| 6 | |
| 7 | Permission is granted to copy and distribute this document |
| 8 | for any purpose and without charge, |
| 9 | including translations into other languages |
| 10 | and incorporation into compilations, |
| 11 | provided that the copyright notice and this notice are preserved, |
| 12 | and that any substantive changes or deletions from the original |
| 13 | are clearly marked. |
| 14 | Distribution of this document is unlimited. |
| 15 | |
| 16 | ### Version |
| 17 | 0.1.0 (11/04/17) |
| 18 | |
| 19 | ## Introduction |
| 20 | This document defines a format for compressed data to be stored so that subranges of the data can be efficiently decompressed without requiring the entire document to be decompressed. |
| 21 | This is done by splitting up the input data into frames, |
| 22 | each of which are compressed independently, |
| 23 | and so can be decompressed independently. |
| 24 | Decompression then takes advantage of a provided 'seek table', which allows the decompressor to immediately jump to the desired data. This is done in a way that is compatible with the original Zstandard format by placing the seek table in a Zstandard skippable frame. |
| 25 | |
| 26 | ### Overall conventions |
| 27 | In this document: |
| 28 | - square brackets i.e. `[` and `]` are used to indicate optional fields or parameters. |
| 29 | - the naming convention for identifiers is `Mixed_Case_With_Underscores` |
| 30 | - All numeric fields are little-endian unless specified otherwise |
| 31 | |
| 32 | ## Format |
| 33 | |
| 34 | The format consists of a number of frames (Zstandard compressed frames and skippable frames), followed by a final skippable frame at the end containing the seek table. |
| 35 | |
| 36 | ### Seek Table Format |
| 37 | The structure of the seek table frame is as follows: |
| 38 | |
| 39 | |`Skippable_Magic_Number`|`Frame_Size`|`[Seek_Table_Entries]`|`Seek_Table_Footer`| |
| 40 | |------------------------|------------|----------------------|-------------------| |
| 41 | | 4 bytes | 4 bytes | 8-12 bytes each | 9 bytes | |
| 42 | |
| 43 | __`Skippable_Magic_Number`__ |
| 44 | |
| 45 | Value : 0x184D2A5E. |
| 46 | This is for compatibility with [Zstandard skippable frames]. |
| 47 | Since it is legal for other Zstandard skippable frames to use the same |
| 48 | magic number, it is not recommended for a decoder to recognize frames |
| 49 | solely on this. |
| 50 | |
| 51 | __`Frame_Size`__ |
| 52 | |
| 53 | The total size of the skippable frame, not including the `Skippable_Magic_Number` or `Frame_Size`. |
| 54 | This is for compatibility with [Zstandard skippable frames]. |
| 55 | |
| 56 | [Zstandard skippable frames]: https://github.com/facebook/zstd/blob/release/doc/zstd_compression_format.md#skippable-frames |
| 57 | |
| 58 | #### `Seek_Table_Footer` |
| 59 | The seek table footer format is as follows: |
| 60 | |
| 61 | |`Number_Of_Frames`|`Seek_Table_Descriptor`|`Seekable_Magic_Number`| |
| 62 | |------------------|-----------------------|-----------------------| |
| 63 | | 4 bytes | 1 byte | 4 bytes | |
| 64 | |
| 65 | __`Seekable_Magic_Number`__ |
| 66 | |
| 67 | Value : 0x8F92EAB1. |
| 68 | This value must be the last bytes present in the compressed file so that decoders |
| 69 | can efficiently find it and determine if there is an actual seek table present. |
| 70 | |
| 71 | __`Number_Of_Frames`__ |
| 72 | |
| 73 | The number of stored frames in the data. |
| 74 | |
| 75 | __`Seek_Table_Descriptor`__ |
| 76 | |
| 77 | A bitfield describing the format of the seek table. |
| 78 | |
| 79 | | Bit number | Field name | |
| 80 | | ---------- | ---------- | |
| 81 | | 7 | `Checksum_Flag` | |
| 82 | | 6-2 | `Reserved_Bits` | |
| 83 | | 1-0 | `Unused_Bits` | |
| 84 | |
| 85 | While only `Checksum_Flag` currently exists, there are 7 other bits in this field that can be used for future changes to the format, |
| 86 | for example the addition of inline dictionaries. |
| 87 | |
| 88 | __`Checksum_Flag`__ |
| 89 | |
| 90 | If the checksum flag is set, each of the seek table entries contains a 4 byte checksum of the uncompressed data contained in its frame. |
| 91 | |
| 92 | `Reserved_Bits` are not currently used but may be used in the future for breaking changes, so a compliant decoder should ensure they are set to 0. `Unused_Bits` may be used in the future for non-breaking changes, so a compliant decoder should not interpret these bits. |
| 93 | |
| 94 | #### __`Seek_Table_Entries`__ |
| 95 | |
| 96 | `Seek_Table_Entries` consists of `Number_Of_Frames` (one for each frame in the data, not including the seek table frame) entries of the following form, in sequence: |
| 97 | |
| 98 | |`Compressed_Size`|`Decompressed_Size`|`[Checksum]`| |
| 99 | |-----------------|-------------------|------------| |
| 100 | | 4 bytes | 4 bytes | 4 bytes | |
| 101 | |
| 102 | __`Compressed_Size`__ |
| 103 | |
| 104 | The compressed size of the frame. |
| 105 | The cumulative sum of the `Compressed_Size` fields of frames `0` to `i` gives the offset in the compressed file of frame `i+1`. |
| 106 | |
| 107 | __`Decompressed_Size`__ |
| 108 | |
| 109 | The size of the decompressed data contained in the frame. For skippable or otherwise empty frames, this value is 0. |
| 110 | |
| 111 | __`Checksum`__ |
| 112 | |
| 113 | Only present if `Checksum_Flag` is set in the `Seek_Table_Descriptor`. Value : the least significant 32 bits of the XXH64 digest of the uncompressed data, stored in little-endian format. |
| 114 | |
| 115 | ## Version Changes |
| 116 | - 0.1.0: initial version |