mirror of
https://github.com/ziglang/zig.git
synced 2025-12-06 06:13:07 +00:00
3028 lines
105 KiB
Plaintext
3028 lines
105 KiB
Plaintext
|
||
|
||
|
||
|
||
|
||
|
||
Internet Engineering Task Force (IETF) Y. Collet
|
||
Request for Comments: 8478 M. Kucherawy, Ed.
|
||
Category: Informational Facebook
|
||
ISSN: 2070-1721 October 2018
|
||
|
||
|
||
Zstandard Compression and the application/zstd Media Type
|
||
|
||
Abstract
|
||
|
||
Zstandard, or "zstd" (pronounced "zee standard"), is a data
|
||
compression mechanism. This document describes the mechanism and
|
||
registers a media type and content encoding to be used when
|
||
transporting zstd-compressed content via Multipurpose Internet Mail
|
||
Extensions (MIME).
|
||
|
||
Despite use of the word "standard" as part of its name, readers are
|
||
advised that this document is not an Internet Standards Track
|
||
specification; it is being published for informational purposes only.
|
||
|
||
Status of This Memo
|
||
|
||
This document is not an Internet Standards Track specification; it is
|
||
published for informational purposes.
|
||
|
||
This document is a product of the Internet Engineering Task Force
|
||
(IETF). It represents the consensus of the IETF community. It has
|
||
received public review and has been approved for publication by the
|
||
Internet Engineering Steering Group (IESG). Not all documents
|
||
approved by the IESG are candidates for any level of Internet
|
||
Standard; see Section 2 of RFC 7841.
|
||
|
||
Information about the current status of this document, any errata,
|
||
and how to provide feedback on it may be obtained at
|
||
https://www.rfc-editor.org/info/rfc8478.
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Collet & Kucherawy Informational [Page 1]
|
||
|
||
RFC 8478 application/zstd October 2018
|
||
|
||
|
||
Copyright Notice
|
||
|
||
Copyright (c) 2018 IETF Trust and the persons identified as the
|
||
document authors. All rights reserved.
|
||
|
||
This document is subject to BCP 78 and the IETF Trust's Legal
|
||
Provisions Relating to IETF Documents
|
||
(https://trustee.ietf.org/license-info) in effect on the date of
|
||
publication of this document. Please review these documents
|
||
carefully, as they describe your rights and restrictions with respect
|
||
to this document. Code Components extracted from this document must
|
||
include Simplified BSD License text as described in Section 4.e of
|
||
the Trust Legal Provisions and are provided without warranty as
|
||
described in the Simplified BSD License.
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Collet & Kucherawy Informational [Page 2]
|
||
|
||
RFC 8478 application/zstd October 2018
|
||
|
||
|
||
Table of Contents
|
||
|
||
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . 4
|
||
2. Definitions . . . . . . . . . . . . . . . . . . . . . . . . . 4
|
||
3. Compression Algorithm . . . . . . . . . . . . . . . . . . . . 5
|
||
3.1. Frames . . . . . . . . . . . . . . . . . . . . . . . . . 6
|
||
3.1.1. Zstandard Frames . . . . . . . . . . . . . . . . . . 6
|
||
3.1.1.1. Frame Header . . . . . . . . . . . . . . . . . . 7
|
||
3.1.1.2. Blocks . . . . . . . . . . . . . . . . . . . . . 12
|
||
3.1.1.3. Compressed Blocks . . . . . . . . . . . . . . . . 14
|
||
3.1.1.4. Sequence Execution . . . . . . . . . . . . . . . 28
|
||
3.1.1.5. Repeat Offsets . . . . . . . . . . . . . . . . . 29
|
||
3.1.2. Skippable Frames . . . . . . . . . . . . . . . . . . 30
|
||
4. Entropy Encoding . . . . . . . . . . . . . . . . . . . . . . 30
|
||
4.1. FSE . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
|
||
4.1.1. FSE Table Description . . . . . . . . . . . . . . . . 31
|
||
4.2. Huffman Coding . . . . . . . . . . . . . . . . . . . . . 34
|
||
4.2.1. Huffman Tree Description . . . . . . . . . . . . . . 35
|
||
4.2.1.1. Huffman Tree Header . . . . . . . . . . . . . . . 36
|
||
4.2.1.2. FSE Compression of Huffman Weights . . . . . . . 37
|
||
4.2.1.3. Conversion from Weights to Huffman Prefix Codes . 38
|
||
4.2.2. Huffman-Coded Streams . . . . . . . . . . . . . . . . 39
|
||
5. Dictionary Format . . . . . . . . . . . . . . . . . . . . . . 40
|
||
6. IANA Considerations . . . . . . . . . . . . . . . . . . . . . 42
|
||
6.1. The 'application/zstd' Media Type . . . . . . . . . . . . 42
|
||
6.2. Content Encoding . . . . . . . . . . . . . . . . . . . . 43
|
||
6.3. Dictionaries . . . . . . . . . . . . . . . . . . . . . . 43
|
||
7. Security Considerations . . . . . . . . . . . . . . . . . . . 43
|
||
8. Implementation Status . . . . . . . . . . . . . . . . . . . . 44
|
||
9. References . . . . . . . . . . . . . . . . . . . . . . . . . 45
|
||
9.1. Normative References . . . . . . . . . . . . . . . . . . 45
|
||
9.2. Informative References . . . . . . . . . . . . . . . . . 45
|
||
Appendix A. Decoding Tables for Predefined Codes . . . . . . . . 46
|
||
A.1. Literal Length Code Table . . . . . . . . . . . . . . . . 46
|
||
A.2. Match Length Code Table . . . . . . . . . . . . . . . . . 49
|
||
A.3. Offset Code Table . . . . . . . . . . . . . . . . . . . . 52
|
||
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . 53
|
||
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . 54
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Collet & Kucherawy Informational [Page 3]
|
||
|
||
RFC 8478 application/zstd October 2018
|
||
|
||
|
||
1. Introduction
|
||
|
||
Zstandard, or "zstd" (pronounced "zee standard"), is a data
|
||
compression mechanism, akin to gzip [RFC1952].
|
||
|
||
Despite use of the word "standard" as part of its name, readers are
|
||
advised that this document is not an Internet Standards Track
|
||
specification; it is being published for informational purposes only.
|
||
|
||
This document describes the Zstandard format. Also, to enable the
|
||
transport of a data object compressed with Zstandard, this document
|
||
registers a media type that can be used to identify such content when
|
||
it is used in a payload encoded using Multipurpose Internet Mail
|
||
Extensions (MIME).
|
||
|
||
2. Definitions
|
||
|
||
Some terms used elsewhere in this document are defined here for
|
||
clarity.
|
||
|
||
uncompressed: Describes an arbitrary set of bytes in their original
|
||
form, prior to being subjected to compression.
|
||
|
||
compress, compression: The act of processing a set of bytes via the
|
||
compression mechanism described here.
|
||
|
||
compressed: Describes the result of passing a set of bytes through
|
||
this mechanism. The original input has thus been compressed.
|
||
|
||
decompress, decompression: The act of processing a set of bytes
|
||
through the inverse of the compression mechanism described here,
|
||
in an attempt to recover the original set of bytes prior to
|
||
compression.
|
||
|
||
decompressed: Describes the result of passing a set of bytes through
|
||
the reverse of this mechanism. When this is successful, the
|
||
decompressed payload and the uncompressed payload are
|
||
indistinguishable.
|
||
|
||
encode: The process of translating data from one form to another;
|
||
this may include compression or it may refer to other translations
|
||
done as part of this specification.
|
||
|
||
decode: The reverse of "encode"; describes a process of reversing a
|
||
prior encoding to recover the original content.
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Collet & Kucherawy Informational [Page 4]
|
||
|
||
RFC 8478 application/zstd October 2018
|
||
|
||
|
||
frame: Content compressed by Zstandard is transformed into a
|
||
Zstandard frame. Multiple frames can be appended into a single
|
||
file or stream. A frame is completely independent, has a defined
|
||
beginning and end, and has a set of parameters that tells the
|
||
decoder how to decompress it.
|
||
|
||
block: A frame encapsulates one or multiple blocks. Each block
|
||
contains arbitrary content, which is described by its header, and
|
||
has a guaranteed maximum content size that depends upon frame
|
||
parameters. Unlike frames, each block depends on previous blocks
|
||
for proper decoding. However, each block can be decompressed
|
||
without waiting for its successor, allowing streaming operations.
|
||
|
||
natural order: A sequence or ordering of objects or values that is
|
||
typical of that type of object or value. A set of unique
|
||
integers, for example, is in "natural order" if when progressing
|
||
from one element in the set or sequence to the next, there is
|
||
never a decrease in value.
|
||
|
||
The naming convention for identifiers within the specification is
|
||
Mixed_Case_With_Underscores. Identifiers inside square brackets
|
||
indicate that the identifier is optional in the presented context.
|
||
|
||
3. Compression Algorithm
|
||
|
||
This section describes the Zstandard algorithm.
|
||
|
||
The purpose of this document is to define a lossless compressed data
|
||
format that is a) independent of the CPU type, operating system, file
|
||
system, and character set and b) is suitable for file compression and
|
||
pipe and streaming compression, using the Zstandard algorithm. The
|
||
text of the specification assumes a basic background in programming
|
||
at the level of bits and other primitive data representations.
|
||
|
||
The data can be produced or consumed, even for an arbitrarily long
|
||
sequentially presented input data stream, using only an a priori
|
||
bounded amount of intermediate storage, and hence can be used in data
|
||
communications. The format uses the Zstandard compression method,
|
||
and an optional xxHash-64 checksum method [XXHASH], for detection of
|
||
data corruption.
|
||
|
||
The data format defined by this specification does not attempt to
|
||
allow random access to compressed data.
|
||
|
||
Unless otherwise indicated below, a compliant compressor must produce
|
||
data sets that conform to the specifications presented here.
|
||
However, it does not need to support all options.
|
||
|
||
|
||
|
||
|
||
Collet & Kucherawy Informational [Page 5]
|
||
|
||
RFC 8478 application/zstd October 2018
|
||
|
||
|
||
A compliant decompressor must be able to decompress at least one
|
||
working set of parameters that conforms to the specifications
|
||
presented here. It may also ignore informative fields, such as the
|
||
checksum. Whenever it does not support a parameter defined in the
|
||
compressed stream, it must produce a non-ambiguous error code and
|
||
associated error message explaining which parameter is unsupported.
|
||
|
||
This specification is intended for use by implementers of software to
|
||
compress data into Zstandard format and/or decompress data from
|
||
Zstandard format. The Zstandard format is supported by an open
|
||
source reference implementation, written in portable C, and available
|
||
at [ZSTD].
|
||
|
||
3.1. Frames
|
||
|
||
Zstandard compressed data is made up of one or more frames. Each
|
||
frame is independent and can be decompressed independently of other
|
||
frames. The decompressed content of multiple concatenated frames is
|
||
the concatenation of each frame's decompressed content.
|
||
|
||
There are two frame formats defined for Zstandard: Zstandard frames
|
||
and skippable frames. Zstandard frames contain compressed data,
|
||
while skippable frames contain custom user metadata.
|
||
|
||
3.1.1. Zstandard Frames
|
||
|
||
The structure of a single Zstandard frame is as follows:
|
||
|
||
+--------------------+------------+
|
||
| Magic_Number | 4 bytes |
|
||
+--------------------+------------+
|
||
| Frame_Header | 2-14 bytes |
|
||
+--------------------+------------+
|
||
| Data_Block | n bytes |
|
||
+--------------------+------------+
|
||
| [More Data_Blocks] | |
|
||
+--------------------+------------+
|
||
| [Content_Checksum] | 0-4 bytes |
|
||
+--------------------+------------+
|
||
|
||
Magic_Number: 4 bytes, little-endian format. Value: 0xFD2FB528.
|
||
|
||
Frame_Header: 2 to 14 bytes, detailed in Section 3.1.1.1.
|
||
|
||
Data_Block: Detailed in Section 3.1.1.2. This is where data
|
||
appears.
|
||
|
||
|
||
|
||
|
||
|
||
Collet & Kucherawy Informational [Page 6]
|
||
|
||
RFC 8478 application/zstd October 2018
|
||
|
||
|
||
Content_Checksum: An optional 32-bit checksum, only present if
|
||
Content_Checksum_Flag is set. The content checksum is the result
|
||
of the XXH64() hash function [XXHASH] digesting the original
|
||
(decoded) data as input, and a seed of zero. The low 4 bytes of
|
||
the checksum are stored in little-endian format.
|
||
|
||
The magic number was selected to be less probable to find at the
|
||
beginning of an arbitrary file. It avoids trivial patterns (0x00,
|
||
0xFF, repeated bytes, increasing bytes, etc.), contains byte values
|
||
outside of ASCII range, and doesn't map into UTF-8 space, all of
|
||
which reduce the likelihood of its appearance at the top of a text
|
||
file.
|
||
|
||
3.1.1.1. Frame Header
|
||
|
||
The frame header has a variable size, with a minimum of 2 bytes and
|
||
up to 14 bytes depending on optional parameters. The structure of
|
||
Frame_Header is as follows:
|
||
|
||
+-------------------------+-----------+
|
||
| Frame_Header_Descriptor | 1 byte |
|
||
+-------------------------+-----------+
|
||
| [Window_Descriptor] | 0-1 byte |
|
||
+-------------------------+-----------+
|
||
| [Dictionary_ID] | 0-4 bytes |
|
||
+-------------------------+-----------+
|
||
| [Frame_Content_Size] | 0-8 bytes |
|
||
+-------------------------+-----------+
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Collet & Kucherawy Informational [Page 7]
|
||
|
||
RFC 8478 application/zstd October 2018
|
||
|
||
|
||
3.1.1.1.1. Frame_Header_Descriptor
|
||
|
||
The first header's byte is called the Frame_Header_Descriptor. It
|
||
describes which other fields are present. Decoding this byte is
|
||
enough to tell the size of Frame_Header.
|
||
|
||
+------------+-------------------------+
|
||
| Bit Number | Field Name |
|
||
+------------+-------------------------+
|
||
| 7-6 | Frame_Content_Size_Flag |
|
||
+------------+-------------------------+
|
||
| 5 | Single_Segment_Flag |
|
||
+------------+-------------------------+
|
||
| 4 | (unused) |
|
||
+------------+-------------------------+
|
||
| 3 | (reserved) |
|
||
+------------+-------------------------+
|
||
| 2 | Content_Checksum_Flag |
|
||
+------------+-------------------------+
|
||
| 1-0 | Dictionary_ID_Flag |
|
||
+------------+-------------------------+
|
||
|
||
In this table, bit 7 is the highest bit, while bit 0 is the lowest
|
||
one.
|
||
|
||
3.1.1.1.1.1. Frame_Content_Size_Flag
|
||
|
||
This is a 2-bit flag (equivalent to Frame_Header_Descriptor right-
|
||
shifted 6 bits) specifying whether Frame_Content_Size (the
|
||
decompressed data size) is provided within the header. Flag_Value
|
||
provides FCS_Field_Size, which is the number of bytes used by
|
||
Frame_Content_Size according to the following table:
|
||
|
||
+----------------+--------+---+---+---+
|
||
| Flag_Value | 0 | 1 | 2 | 3 |
|
||
+----------------+--------+---+---+---+
|
||
| FCS_Field_Size | 0 or 1 | 2 | 4 | 8 |
|
||
+----------------+--------+---+---+---+
|
||
|
||
When Flag_Value is 0, FCS_Field_Size depends on Single_Segment_Flag:
|
||
If Single_Segment_Flag is set, FCS_Field_Size is 1. Otherwise,
|
||
FCS_Field_Size is 0; Frame_Content_Size is not provided.
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Collet & Kucherawy Informational [Page 8]
|
||
|
||
RFC 8478 application/zstd October 2018
|
||
|
||
|
||
3.1.1.1.1.2. Single_Segment_Flag
|
||
|
||
If this flag is set, data must be regenerated within a single
|
||
continuous memory segment.
|
||
|
||
In this case, Window_Descriptor byte is skipped, but
|
||
Frame_Content_Size is necessarily present. As a consequence, the
|
||
decoder must allocate a memory segment of size equal or larger than
|
||
Frame_Content_Size.
|
||
|
||
In order to protect the decoder from unreasonable memory
|
||
requirements, a decoder is allowed to reject a compressed frame that
|
||
requests a memory size beyond the decoder's authorized range.
|
||
|
||
For broader compatibility, decoders are recommended to support memory
|
||
sizes of at least 8 MB. This is only a recommendation; each decoder
|
||
is free to support higher or lower limits, depending on local
|
||
limitations.
|
||
|
||
3.1.1.1.1.3. Unused Bit
|
||
|
||
A decoder compliant with this specification version shall not
|
||
interpret this bit. It might be used in a future version, to signal
|
||
a property that is not mandatory to properly decode the frame. An
|
||
encoder compliant with this specification must set this bit to zero.
|
||
|
||
3.1.1.1.1.4. Reserved Bit
|
||
|
||
This bit is reserved for some future feature. Its value must be
|
||
zero. A decoder compliant with this specification version must
|
||
ensure it is not set. This bit may be used in a future revision, to
|
||
signal a feature that must be interpreted to decode the frame
|
||
correctly.
|
||
|
||
3.1.1.1.1.5. Content_Checksum_Flag
|
||
|
||
If this flag is set, a 32-bit Content_Checksum will be present at the
|
||
frame's end. See the description of Content_Checksum above.
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Collet & Kucherawy Informational [Page 9]
|
||
|
||
RFC 8478 application/zstd October 2018
|
||
|
||
|
||
3.1.1.1.1.6. Dictionary_ID_Flag
|
||
|
||
This is a 2-bit flag (= Frame_Header_Descriptor & 0x3) indicating
|
||
whether a dictionary ID is provided within the header. It also
|
||
specifies the size of this field as DID_Field_Size:
|
||
|
||
+----------------+---+---+---+---+
|
||
| Flag_Value | 0 | 1 | 2 | 3 |
|
||
+----------------+---+---+---+---+
|
||
| DID_Field_Size | 0 | 1 | 2 | 4 |
|
||
+----------------+---+---+---+---+
|
||
|
||
3.1.1.1.2. Window Descriptor
|
||
|
||
This provides guarantees about the minimum memory buffer required to
|
||
decompress a frame. This information is important for decoders to
|
||
allocate enough memory.
|
||
|
||
The Window_Descriptor byte is optional. When Single_Segment_Flag is
|
||
set, Window_Descriptor is not present. In this case, Window_Size is
|
||
Frame_Content_Size, which can be any value from 0 to 2^64-1 bytes (16
|
||
ExaBytes).
|
||
|
||
+------------+----------+----------+
|
||
| Bit Number | 7-3 | 2-0 |
|
||
+------------+----------+----------+
|
||
| Field Name | Exponent | Mantissa |
|
||
+------------+----------+----------+
|
||
|
||
The minimum memory buffer size is called Window_Size. It is
|
||
described by the following formulae:
|
||
|
||
windowLog = 10 + Exponent;
|
||
windowBase = 1 << windowLog;
|
||
windowAdd = (windowBase / 8) * Mantissa;
|
||
Window_Size = windowBase + windowAdd;
|
||
|
||
The minimum Window_Size is 1 KB. The maximum Window_Size is (1<<41)
|
||
+ 7*(1<<38) bytes, which is 3.75 TB.
|
||
|
||
In general, larger Window_Size values tend to improve the compression
|
||
ratio, but at the cost of increased memory usage.
|
||
|
||
To properly decode compressed data, a decoder will need to allocate a
|
||
buffer of at least Window_Size bytes.
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Collet & Kucherawy Informational [Page 10]
|
||
|
||
RFC 8478 application/zstd October 2018
|
||
|
||
|
||
In order to protect decoders from unreasonable memory requirements, a
|
||
decoder is allowed to reject a compressed frame that requests a
|
||
memory size beyond decoder's authorized range.
|
||
|
||
For improved interoperability, it's recommended for decoders to
|
||
support values of Window_Size up to 8 MB and for encoders not to
|
||
generate frames requiring a Window_Size larger than 8 MB. It's
|
||
merely a recommendation though, and decoders are free to support
|
||
larger or lower limits, depending on local limitations.
|
||
|
||
3.1.1.1.3. Dictionary_ID
|
||
|
||
This is a variable size field, which contains the ID of the
|
||
dictionary required to properly decode the frame. This field is
|
||
optional. When it's not present, it's up to the decoder to know
|
||
which dictionary to use.
|
||
|
||
Dictionary_ID field size is provided by DID_Field_Size.
|
||
DID_Field_Size is directly derived from the value of
|
||
Dictionary_ID_Flag. One byte can represent an ID 0-255; 2 bytes can
|
||
represent an ID 0-65535; 4 bytes can represent an ID 0-4294967295.
|
||
Format is little-endian.
|
||
|
||
It is permitted to represent a small ID (for example, 13) with a
|
||
large 4-byte dictionary ID, even if it is less efficient.
|
||
|
||
Within private environments, any dictionary ID can be used. However,
|
||
for frames and dictionaries distributed in public space,
|
||
Dictionary_ID must be attributed carefully. The following ranges are
|
||
reserved for use only with dictionaries that have been registered
|
||
with IANA (see Section 6.3):
|
||
|
||
low range: <= 32767
|
||
high range: >= (1 << 31)
|
||
|
||
Any other value for Dictionary_ID can be used by private arrangement
|
||
between participants.
|
||
|
||
Any payload presented for decompression that references an
|
||
unregistered reserved dictionary ID results in an error.
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Collet & Kucherawy Informational [Page 11]
|
||
|
||
RFC 8478 application/zstd October 2018
|
||
|
||
|
||
3.1.1.1.4. Frame Content Size
|
||
|
||
This is the original (uncompressed) size. This information is
|
||
optional. Frame_Content_Size uses a variable number of bytes,
|
||
provided by FCS_Field_Size. FCS_Field_Size is provided by the value
|
||
of Frame_Content_Size_Flag. FCS_Field_Size can be equal to 0 (not
|
||
present), 1, 2, 4, or 8 bytes.
|
||
|
||
+----------------+--------------+
|
||
| FCS Field Size | Range |
|
||
+----------------+--------------+
|
||
| 0 | unknown |
|
||
+----------------+--------------+
|
||
| 1 | 0 - 255 |
|
||
+----------------+--------------+
|
||
| 2 | 256 - 65791 |
|
||
+----------------+--------------+
|
||
| 4 | 0 - 2^32 - 1 |
|
||
+----------------+--------------+
|
||
| 8 | 0 - 2^64 - 1 |
|
||
+----------------+--------------+
|
||
|
||
Frame_Content_Size format is little-endian. When FCS_Field_Size is
|
||
1, 4, or 8 bytes, the value is read directly. When FCS_Field_Size is
|
||
2, the offset of 256 is added. It's allowed to represent a small
|
||
size (for example 18) using any compatible variant.
|
||
|
||
3.1.1.2. Blocks
|
||
|
||
After Magic_Number and Frame_Header, there are some number of blocks.
|
||
Each frame must have at least 1 block, but there is no upper limit on
|
||
the number of blocks per frame.
|
||
|
||
The structure of a block is as follows:
|
||
|
||
+--------------+---------------+
|
||
| Block_Header | Block_Content |
|
||
+--------------+---------------+
|
||
| 3 bytes | n bytes |
|
||
+--------------+---------------+
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Collet & Kucherawy Informational [Page 12]
|
||
|
||
RFC 8478 application/zstd October 2018
|
||
|
||
|
||
Block_Header uses 3 bytes, written using little-endian convention.
|
||
It contains three fields:
|
||
|
||
+------------+------------+------------+
|
||
| Last_Block | Block_Type | Block_Size |
|
||
+------------+------------+------------+
|
||
| bit 0 | bits 1-2 | bits 3-23 |
|
||
+------------+------------+------------+
|
||
|
||
3.1.1.2.1. Last_Block
|
||
|
||
The lowest bit (Last_Block) signals whether this block is the last
|
||
one. The frame will end after this last block. It may be followed
|
||
by an optional Content_Checksum (see Section 3.1.1).
|
||
|
||
3.1.1.2.2. Block_Type
|
||
|
||
The next 2 bits represent the Block_Type. There are four block
|
||
types:
|
||
|
||
+-----------+------------------+
|
||
| Value | Block_Type |
|
||
+-----------+------------------+
|
||
| 0 | Raw_Block |
|
||
+-----------+------------------+
|
||
| 1 | RLE_Block |
|
||
+-----------+------------------+
|
||
| 2 | Compressed_Block |
|
||
+-----------+------------------+
|
||
| 3 | Reserved |
|
||
+-----------+------------------+
|
||
|
||
Raw_Block: This is an uncompressed block. Block_Content contains
|
||
Block_Size bytes.
|
||
|
||
RLE_Block: This is a single byte, repeated Block_Size times.
|
||
Block_Content consists of a single byte. On the decompression
|
||
side, this byte must be repeated Block_Size times.
|
||
|
||
Compressed_Block: This is a compressed block as described in
|
||
Section 3.1.1.3. Block_Size is the length of Block_Content,
|
||
namely the compressed data. The decompressed size is not known,
|
||
but its maximum possible value is guaranteed (see below).
|
||
|
||
Reserved: This is not a block. This value cannot be used with the
|
||
current specification. If such a value is present, it is
|
||
considered to be corrupt data.
|
||
|
||
|
||
|
||
|
||
Collet & Kucherawy Informational [Page 13]
|
||
|
||
RFC 8478 application/zstd October 2018
|
||
|
||
|
||
3.1.1.2.3. Block_Size
|
||
|
||
The upper 21 bits of Block_Header represent the Block_Size.
|
||
Block_Size is the size of the block excluding the header. A block
|
||
can contain any number of bytes (even zero), up to
|
||
Block_Maximum_Decompressed_Size, which is the smallest of:
|
||
|
||
o Window_Size
|
||
|
||
o 128 KB
|
||
|
||
A Compressed_Block has the extra restriction that Block_Size is
|
||
always strictly less than the decompressed size. If this condition
|
||
cannot be respected, the block must be sent uncompressed instead
|
||
(i.e., treated as a Raw_Block).
|
||
|
||
3.1.1.3. Compressed Blocks
|
||
|
||
To decompress a compressed block, the compressed size must be
|
||
provided from the Block_Size field within Block_Header.
|
||
|
||
A compressed block consists of two sections: a Literals
|
||
Section (Section 3.1.1.3.1) and a
|
||
Sequences_Section (Section 3.1.1.3.2). The results of the two
|
||
sections are then combined to produce the decompressed data in
|
||
Sequence Execution (Section 3.1.1.4).
|
||
|
||
To decode a compressed block, the following elements are necessary:
|
||
|
||
o Previous decoded data, up to a distance of Window_Size, or the
|
||
beginning of the Frame, whichever is smaller. Single_Segment_Flag
|
||
will be set in the latter case.
|
||
|
||
o List of "recent offsets" from the previous Compressed_Block.
|
||
|
||
o The previous Huffman tree, required by Treeless_Literals_Block
|
||
type.
|
||
|
||
o Previous Finite State Entropy (FSE) decoding tables, required by
|
||
Repeat_Mode, for each symbol type (literals lengths, match
|
||
lengths, offsets).
|
||
|
||
Note that decoding tables are not always from the previous
|
||
Compressed_Block:
|
||
|
||
o Every decoding table can come from a dictionary.
|
||
|
||
|
||
|
||
|
||
|
||
Collet & Kucherawy Informational [Page 14]
|
||
|
||
RFC 8478 application/zstd October 2018
|
||
|
||
|
||
o The Huffman tree comes from the previous
|
||
Compressed_Literals_Block.
|
||
|
||
3.1.1.3.1. Literals_Section_Header
|
||
|
||
All literals are regrouped in the first part of the block. They can
|
||
be decoded first and then copied during Sequence Execution (see
|
||
Section 3.1.1.4), or they can be decoded on the flow during Sequence
|
||
Execution.
|
||
|
||
Literals can be stored uncompressed or compressed using Huffman
|
||
prefix codes. When compressed, an optional tree description can be
|
||
present, followed by 1 or 4 streams.
|
||
|
||
+----------------------------+
|
||
| Literals_Section_Header |
|
||
+----------------------------+
|
||
| [Huffman_Tree_Description] |
|
||
+----------------------------+
|
||
| [Jump_Table] |
|
||
+----------------------------+
|
||
| Stream_1 |
|
||
+----------------------------+
|
||
| [Stream_2] |
|
||
+----------------------------+
|
||
| [Stream_3] |
|
||
+----------------------------+
|
||
| [Stream_4] |
|
||
+----------------------------+
|
||
|
||
3.1.1.3.1.1. Literals_Section_Header
|
||
|
||
This field describes how literals are packed. It's a byte-aligned
|
||
variable-size bit field, ranging from 1 to 5 bytes, using little-
|
||
endian convention.
|
||
|
||
+---------------------+-----------+
|
||
| Literals_Block_Type | 2 bits |
|
||
+---------------------+-----------+
|
||
| Size_Format | 1-2 bits |
|
||
+---------------------+-----------+
|
||
| Regenerated_Size | 5-20 bits |
|
||
+---------------------+-----------+
|
||
| [Compressed_Size] | 0-18 bits |
|
||
+---------------------+-----------+
|
||
|
||
In this representation, bits at the top are the lowest bits.
|
||
|
||
|
||
|
||
|
||
Collet & Kucherawy Informational [Page 15]
|
||
|
||
RFC 8478 application/zstd October 2018
|
||
|
||
|
||
The Literals_Block_Type field uses the two lowest bits of the first
|
||
byte, describing four different block types:
|
||
|
||
+---------------------------+-------+
|
||
| Literals_Block_Type | Value |
|
||
+---------------------------+-------+
|
||
| Raw_Literals_Block | 0 |
|
||
+---------------------------+-------+
|
||
| RLE_Literals_Block | 1 |
|
||
+---------------------------+-------+
|
||
| Compressed_Literals_Block | 2 |
|
||
+---------------------------+-------+
|
||
| Treeless_Literals_Block | 3 |
|
||
+---------------------------+-------+
|
||
|
||
Raw_Literals_Block: Literals are stored uncompressed.
|
||
Literals_Section_Content is Regenerated_Size.
|
||
|
||
RLE_Literals_Block: Literals consist of a single-byte value repeated
|
||
Regenerated_Size times. Literals_Section_Content is 1.
|
||
|
||
Compressed_Literals_Block: This is a standard Huffman-compressed
|
||
block, starting with a Huffman tree description. See details
|
||
below. Literals_Section_Content is Compressed_Size.
|
||
|
||
Treeless_Literals_Block: This is a Huffman-compressed block, using
|
||
the Huffman tree from the previous Compressed_Literals_Block, or a
|
||
dictionary if there is no previous Huffman-compressed literals
|
||
block. Huffman_Tree_Description will be skipped. Note that if
|
||
this mode is triggered without any previous Huffman-table in the
|
||
frame (or dictionary, per Section 5), it should be treated as data
|
||
corruption. Literals_Section_Content is Compressed_Size.
|
||
|
||
The Size_Format is divided into two families:
|
||
|
||
o For Raw_Literals_Block and RLE_Literals_Block, it's only necessary
|
||
to decode Regenerated_Size. There is no Compressed_Size field.
|
||
|
||
o For Compressed_Block and Treeless_Literals_Block, it's required to
|
||
decode both Compressed_Size and Regenerated_Size (the decompressed
|
||
size). It's also necessary to decode the number of streams (1 or
|
||
4).
|
||
|
||
For values spanning several bytes, the convention is little endian.
|
||
|
||
Size_Format for Raw_Literals_Block and RLE_Literals_Block uses 1 or 2
|
||
bits. Its value is (Literals_Section_Header[0]>>2) & 0x3.
|
||
|
||
|
||
|
||
|
||
Collet & Kucherawy Informational [Page 16]
|
||
|
||
RFC 8478 application/zstd October 2018
|
||
|
||
|
||
Size_Format == 00 or 10: Size_Format uses 1 bit. Regenerated_Size
|
||
uses 5 bits (value 0-31). Literals_Section_Header uses 1 byte.
|
||
Regenerated_Size = Literal_Section_Header[0]>>3.
|
||
|
||
Size_Format == 01: Size_Format uses 2 bits. Regenerated_Size uses
|
||
12 bits (values 0-4095). Literals_Section_Header uses 2 bytes.
|
||
Regenerated_Size = (Literals_Section_Header[0]>>4) +
|
||
(Literals_Section_Header[1]<<4).
|
||
|
||
Size_Format == 11: Size_Format uses 2 bits. Regenerated_Size uses
|
||
20 bits (values 0-1048575). Literals_Section_Header uses 3 bytes.
|
||
Regenerated_Size = (Literals_Section_Header[0]>>4) +
|
||
(Literals_Section_Header[1]<<4) + (Literals_Section_Header[2]<<12)
|
||
|
||
Only Stream_1 is present for these cases. Note that it is permitted
|
||
to represent a short value (for example, 13) using a long format,
|
||
even if it's less efficient.
|
||
|
||
Size_Format for Compressed_Literals_Block and Treeless_Literals_Block
|
||
always uses 2 bits.
|
||
|
||
Size_Format == 00: A single stream. Both Regenerated_Size and
|
||
Compressed_Size use 10 bits (values 0-1023).
|
||
Literals_Section_Header uses 3 bytes.
|
||
|
||
Size_Format == 01: 4 streams. Both Regenerated_Size and
|
||
Compressed_Size use 10 bits (values 0-1023).
|
||
Literals_Section_Header uses 3 bytes.
|
||
|
||
Size_Format == 10: 4 streams. Both Regenerated_Size and
|
||
Compressed_Size use 14 bits (values 0-16383).
|
||
Literals_Section_Header uses 4 bytes.
|
||
|
||
Size_Format == 11: 4 streams. Both Regenerated_Size and
|
||
Compressed_Size use 18 bits (values 0-262143).
|
||
Literals_Section_Header uses 5 bytes.
|
||
|
||
Both the Compressed_Size and Regenerated_Size fields follow little-
|
||
endian convention. Note that Compressed_Size includes the size of
|
||
the Huffman_Tree_Description when it is present.
|
||
|
||
3.1.1.3.1.2. Raw_Literals_Block
|
||
|
||
The data in Stream_1 is Regenerated_Size bytes long. It contains the
|
||
raw literals data to be used during Sequence Execution
|
||
(Section 3.1.1.3.2).
|
||
|
||
|
||
|
||
|
||
|
||
Collet & Kucherawy Informational [Page 17]
|
||
|
||
RFC 8478 application/zstd October 2018
|
||
|
||
|
||
3.1.1.3.1.3. RLE_Literals_Block
|
||
|
||
Stream_1 consists of a single byte that should be repeated
|
||
Regenerated_Size times to generate the decoded literals.
|
||
|
||
3.1.1.3.1.4. Compressed_Literals_Block and Treeless_Literals_Block
|
||
|
||
Both of these modes contain Huffman-encoded data. For
|
||
Treeless_Literals_Block, the Huffman table comes from the previously
|
||
compressed literals block, or from a dictionary; see Section 5.
|
||
|
||
3.1.1.3.1.5. Huffman_Tree_Description
|
||
|
||
This section is only present when the Literals_Block_Type type is
|
||
Compressed_Literals_Block (2). The format of
|
||
Huffman_Tree_Description can be found in Section 4.2.1. The size of
|
||
Huffman_Tree_Description is determined during the decoding process.
|
||
It must be used to determine where streams begin.
|
||
|
||
Total_Streams_Size = Compressed_Size
|
||
- Huffman_Tree_Description_Size
|
||
|
||
3.1.1.3.1.6. Jump_Table
|
||
|
||
The Jump_Table is only present when there are 4 Huffman-coded
|
||
streams.
|
||
|
||
(Reminder: Huffman-compressed data consists of either 1 or 4 Huffman-
|
||
coded streams.)
|
||
|
||
If only 1 stream is present, it is a single bitstream occupying the
|
||
entire remaining portion of the literals block, encoded as described
|
||
within Section 4.2.2.
|
||
|
||
If there are 4 streams, Literals_Section_Header only provides enough
|
||
information to know the decompressed and compressed sizes of all 4
|
||
streams combined. The decompressed size of each stream is equal to
|
||
(Regenerated_Size+3)/4, except for the last stream, which may be up
|
||
to 3 bytes smaller, to reach a total decompressed size as specified
|
||
in Regenerated_Size.
|
||
|
||
The compressed size of each stream is provided explicitly in the
|
||
Jump_Table. The Jump_Table is 6 bytes long and consists of three
|
||
2-byte little-endian fields, describing the compressed sizes of the
|
||
first 3 streams. Stream4_Size is computed from Total_Streams_Size
|
||
minus sizes of other streams.
|
||
|
||
|
||
|
||
|
||
|
||
Collet & Kucherawy Informational [Page 18]
|
||
|
||
RFC 8478 application/zstd October 2018
|
||
|
||
|
||
Stream4_Size = Total_Streams_Size - 6
|
||
- Stream1_Size - Stream2_Size
|
||
- Stream3_Size
|
||
|
||
Note that if Stream1_Size + Stream2_Size + Stream3_Size exceeds
|
||
Total_Streams_Size, the data are considered corrupted.
|
||
|
||
Each of these 4 bitstreams is then decoded independently as a
|
||
Huffman-Coded stream, as described in Section 4.2.2.
|
||
|
||
3.1.1.3.2. Sequences_Section
|
||
|
||
A compressed block is a succession of sequences. A sequence is a
|
||
literal copy command, followed by a match copy command. A literal
|
||
copy command specifies a length. It is the number of bytes to be
|
||
copied (or extracted) from the Literals Section. A match copy
|
||
command specifies an offset and a length.
|
||
|
||
When all sequences are decoded, if there are literals left in the
|
||
literals section, these bytes are added at the end of the block.
|
||
|
||
This is described in more detail in Section 3.1.1.4.
|
||
|
||
The Sequences_Section regroups all symbols required to decode
|
||
commands. There are three symbol types: literals lengths, offsets,
|
||
and match lengths. They are encoded together, interleaved, in a
|
||
single "bitstream".
|
||
|
||
The Sequences_Section starts by a header, followed by optional
|
||
probability tables for each symbol type, followed by the bitstream.
|
||
|
||
Sequences_Section_Header
|
||
[Literals_Length_Table]
|
||
[Offset_Table]
|
||
[Match_Length_Table]
|
||
bitStream
|
||
|
||
To decode the Sequences_Section, it's necessary to know its size.
|
||
This size is deduced from the size of the Literals_Section:
|
||
Sequences_Section_Size = Block_Size - Literals_Section_Header -
|
||
Literals_Section_Content
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Collet & Kucherawy Informational [Page 19]
|
||
|
||
RFC 8478 application/zstd October 2018
|
||
|
||
|
||
3.1.1.3.2.1. Sequences_Section_Header
|
||
|
||
This header consists of two items:
|
||
|
||
o Number_of_Sequences
|
||
|
||
o Symbol_Compression_Modes
|
||
|
||
Number_of_Sequences is a variable size field using between 1 and 3
|
||
bytes. If the first byte is "byte0":
|
||
|
||
o if (byte0 == 0): there are no sequences. The sequence section
|
||
stops here. Decompressed content is defined entirely as Literals
|
||
Section content. The FSE tables used in Repeat_Mode are not
|
||
updated.
|
||
|
||
o if (byte0 < 128): Number_of_Sequences = byte0. Uses 1 byte.
|
||
|
||
o if (byte0 < 255): Number_of_Sequences = ((byte0 - 128) << 8) +
|
||
byte1. Uses 2 bytes.
|
||
|
||
o if (byte0 == 255): Number_of_Sequences = byte1 + (byte2 << 8) +
|
||
0x7F00. Uses 3 bytes.
|
||
|
||
Symbol_Compression_Modes is a single byte, defining the compression
|
||
mode of each symbol type.
|
||
|
||
+-------------+----------------------+
|
||
| Bit Number | Field Name |
|
||
+-------------+----------------------+
|
||
| 7-6 | Literal_Lengths_Mode |
|
||
+-------------+----------------------+
|
||
| 5-4 | Offsets_Mode |
|
||
+-------------+----------------------+
|
||
| 3-2 | Match_Lengths_Mode |
|
||
+-------------+----------------------+
|
||
| 1-0 | Reserved |
|
||
+-------------+----------------------+
|
||
|
||
The last field, Reserved, must be all zeroes.
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Collet & Kucherawy Informational [Page 20]
|
||
|
||
RFC 8478 application/zstd October 2018
|
||
|
||
|
||
Literals_Lengths_Mode, Offsets_Mode, and Match_Lengths_Mode define
|
||
the Compression_Mode of literals lengths, offsets, and match lengths
|
||
symbols, respectively. They follow the same enumeration:
|
||
|
||
+-------+---------------------+
|
||
| Value | Compression_Mode |
|
||
+-------+---------------------+
|
||
| 0 | Predefined_Mode |
|
||
+-------+---------------------+
|
||
| 1 | RLE_Mode |
|
||
+-------+---------------------+
|
||
| 2 | FSE_Compressed_Mode |
|
||
+-------+---------------------+
|
||
| 3 | Repeat_Mode |
|
||
+-------+---------------------+
|
||
|
||
Predefined_Mode: A predefined FSE (see Section 4.1) distribution
|
||
table is used, as defined in Section 3.1.1.3.2.2. No distribution
|
||
table will be present.
|
||
|
||
RLE_Mode: The table description consists of a single byte, which
|
||
contains the symbol's value. This symbol will be used for all
|
||
sequences.
|
||
|
||
FSE_Compressed_Mode: Standard FSE compression. A distribution table
|
||
will be present. The format of this distribution table is
|
||
described in Section 4.1.1. Note that the maximum allowed
|
||
accuracy log for literals length and match length tables is 9, and
|
||
the maximum accuracy log for the offsets table is 8. This mode
|
||
must not be used when only one symbol is present; RLE_Mode should
|
||
be used instead (although any other mode will work).
|
||
|
||
Repeat_Mode: The table used in the previous Compressed_Block with
|
||
Number_Of_Sequences > 0 will be used again, or if this is the
|
||
first block, the table in the dictionary will be used. Note that
|
||
this includes RLE_Mode, so if Repeat_Mode follows RLE_Mode, the
|
||
same symbol will be repeated. It also includes Predefined_Mode,
|
||
in which case Repeat_Mode will have the same outcome as
|
||
Predefined_Mode. No distribution table will be present. If this
|
||
mode is used without any previous sequence table in the frame (or
|
||
dictionary; see Section 5) to repeat, this should be treated as
|
||
corruption.
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Collet & Kucherawy Informational [Page 21]
|
||
|
||
RFC 8478 application/zstd October 2018
|
||
|
||
|
||
3.1.1.3.2.1.1. Sequence Codes for Lengths and Offsets
|
||
|
||
Each symbol is a code in its own context, which specifies Baseline
|
||
and Number_of_Bits to add. Codes are FSE compressed and interleaved
|
||
with raw additional bits in the same bitstream.
|
||
|
||
Literals length codes are values ranging from 0 to 35 inclusive.
|
||
They define lengths from 0 to 131071 bytes. The literals length is
|
||
equal to the decoded Baseline plus the result of reading
|
||
Number_of_Bits bits from the bitstream, as a little-endian value.
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Collet & Kucherawy Informational [Page 22]
|
||
|
||
RFC 8478 application/zstd October 2018
|
||
|
||
|
||
+----------------------+----------+----------------+
|
||
| Literals_Length_Code | Baseline | Number_of_Bits |
|
||
+----------------------+----------+----------------+
|
||
| 0-15 | length | 0 |
|
||
+----------------------+----------+----------------+
|
||
| 16 | 16 | 1 |
|
||
+----------------------+----------+----------------+
|
||
| 17 | 18 | 1 |
|
||
+----------------------+----------+----------------+
|
||
| 18 | 20 | 1 |
|
||
+----------------------+----------+----------------+
|
||
| 19 | 22 | 1 |
|
||
+----------------------+----------+----------------+
|
||
| 20 | 24 | 2 |
|
||
+----------------------+----------+----------------+
|
||
| 21 | 28 | 2 |
|
||
+----------------------+----------+----------------+
|
||
| 22 | 32 | 3 |
|
||
+----------------------+----------+----------------+
|
||
| 23 | 40 | 3 |
|
||
+----------------------+----------+----------------+
|
||
| 24 | 48 | 4 |
|
||
+----------------------+----------+----------------+
|
||
| 25 | 64 | 6 |
|
||
+----------------------+----------+----------------+
|
||
| 26 | 128 | 7 |
|
||
+----------------------+----------+----------------+
|
||
| 27 | 256 | 8 |
|
||
+----------------------+----------+----------------+
|
||
| 28 | 512 | 9 |
|
||
+----------------------+----------+----------------+
|
||
| 29 | 1024 | 10 |
|
||
+----------------------+----------+----------------+
|
||
| 30 | 2048 | 11 |
|
||
+----------------------+----------+----------------+
|
||
| 31 | 4096 | 12 |
|
||
+----------------------+----------+----------------+
|
||
| 32 | 8192 | 13 |
|
||
+----------------------+----------+----------------+
|
||
| 33 | 16384 | 14 |
|
||
+----------------------+----------+----------------+
|
||
| 34 | 32768 | 15 |
|
||
+----------------------+----------+----------------+
|
||
| 35 | 65536 | 16 |
|
||
+----------------------+----------+----------------+
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Collet & Kucherawy Informational [Page 23]
|
||
|
||
RFC 8478 application/zstd October 2018
|
||
|
||
|
||
Match length codes are values ranging from 0 to 52 inclusive. They
|
||
define lengths from 3 to 131074 bytes. The match length is equal to
|
||
the decoded Baseline plus the result of reading Number_of_Bits bits
|
||
from the bitstream, as a little-endian value.
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Collet & Kucherawy Informational [Page 24]
|
||
|
||
RFC 8478 application/zstd October 2018
|
||
|
||
|
||
+-------------------+-----------------------+----------------+
|
||
| Match_Length_Code | Baseline | Number_of_Bits |
|
||
+-------------------+-----------------------+----------------+
|
||
| 0-31 | Match_Length_Code + 3 | 0 |
|
||
+-------------------+-----------------------+----------------+
|
||
| 32 | 35 | 1 |
|
||
+-------------------+-----------------------+----------------+
|
||
| 33 | 37 | 1 |
|
||
+-------------------+-----------------------+----------------+
|
||
| 34 | 39 | 1 |
|
||
+-------------------+-----------------------+----------------+
|
||
| 35 | 41 | 1 |
|
||
+-------------------+-----------------------+----------------+
|
||
| 36 | 43 | 2 |
|
||
+-------------------+-----------------------+----------------+
|
||
| 37 | 47 | 2 |
|
||
+-------------------+-----------------------+----------------+
|
||
| 38 | 51 | 3 |
|
||
+-------------------+-----------------------+----------------+
|
||
| 39 | 59 | 3 |
|
||
+-------------------+-----------------------+----------------+
|
||
| 40 | 67 | 4 |
|
||
+-------------------+-----------------------+----------------+
|
||
| 41 | 83 | 4 |
|
||
+-------------------+-----------------------+----------------+
|
||
| 42 | 99 | 5 |
|
||
+-------------------+-----------------------+----------------+
|
||
| 43 | 131 | 7 |
|
||
+-------------------+-----------------------+----------------+
|
||
| 44 | 259 | 8 |
|
||
+-------------------+-----------------------+----------------+
|
||
| 45 | 515 | 9 |
|
||
+-------------------+-----------------------+----------------+
|
||
| 46 | 1027 | 10 |
|
||
+-------------------+-----------------------+----------------+
|
||
| 47 | 2051 | 11 |
|
||
+-------------------+-----------------------+----------------+
|
||
| 48 | 4099 | 12 |
|
||
+-------------------+-----------------------+----------------+
|
||
| 49 | 8195 | 13 |
|
||
+-------------------+-----------------------+----------------+
|
||
| 50 | 16387 | 14 |
|
||
+-------------------+-----------------------+----------------+
|
||
| 51 | 32771 | 15 |
|
||
+-------------------+-----------------------+----------------+
|
||
| 52 | 65539 | 16 |
|
||
+-------------------+-----------------------+----------------+
|
||
|
||
|
||
|
||
|
||
Collet & Kucherawy Informational [Page 25]
|
||
|
||
RFC 8478 application/zstd October 2018
|
||
|
||
|
||
Offset codes are values ranging from 0 to N.
|
||
|
||
A decoder is free to limit its maximum supported value for N.
|
||
Support for values of at least 22 is recommended. At the time of
|
||
this writing, the reference decoder supports a maximum N value of 31.
|
||
|
||
An offset code is also the number of additional bits to read in
|
||
little-endian fashion and can be translated into an Offset_Value
|
||
using the following formulas:
|
||
|
||
Offset_Value = (1 << offsetCode) + readNBits(offsetCode);
|
||
if (Offset_Value > 3) Offset = Offset_Value - 3;
|
||
|
||
This means that maximum Offset_Value is (2^(N+1))-1, supporting back-
|
||
reference distance up to (2^(N+1))-4, but it is limited by the
|
||
maximum back-reference distance (see Section 3.1.1.1.2).
|
||
|
||
Offset_Value from 1 to 3 are special: they define "repeat codes".
|
||
This is described in more detail in Section 3.1.1.5.
|
||
|
||
3.1.1.3.2.1.2. Decoding Sequences
|
||
|
||
FSE bitstreams are read in reverse of the direction they are written.
|
||
In zstd, the compressor writes bits forward into a block, and the
|
||
decompressor must read the bitstream backwards.
|
||
|
||
To find the start of the bitstream, it is therefore necessary to know
|
||
the offset of the last byte of the block, which can be found by
|
||
counting Block_Size bytes after the block header.
|
||
|
||
After writing the last bit containing information, the compressor
|
||
writes a single 1 bit and then fills the byte with 0-7 zero bits of
|
||
padding. The last byte of the compressed bitstream cannot be zero
|
||
for that reason.
|
||
|
||
When decompressing, the last byte containing the padding is the first
|
||
byte to read. The decompressor needs to skip 0-7 initial zero bits
|
||
until the first 1 bit occurs. Afterwards, the useful part of the
|
||
bitstream begins.
|
||
|
||
FSE decoding requires a 'state' to be carried from symbol to symbol.
|
||
For more explanation on FSE decoding, see Section 4.1.
|
||
|
||
For sequence decoding, a separate state keeps track of each literal
|
||
lengths, offsets, and match lengths symbols. Some FSE primitives are
|
||
also used. For more details on the operation of these primitives,
|
||
see Section 4.1.
|
||
|
||
|
||
|
||
|
||
Collet & Kucherawy Informational [Page 26]
|
||
|
||
RFC 8478 application/zstd October 2018
|
||
|
||
|
||
The bitstream starts with initial FSE state values, each using the
|
||
required number of bits in their respective accuracy, decoded
|
||
previously from their normalized distribution. It starts with
|
||
Literals_Length_State, followed by Offset_State, and finally
|
||
Match_Length_State.
|
||
|
||
Note that all values are read backward, so the 'start' of the
|
||
bitstream is at the highest position in memory, immediately before
|
||
the last 1 bit for padding.
|
||
|
||
After decoding the starting states, a single sequence is decoded
|
||
Number_Of_Sequences times. These sequences are decoded in order from
|
||
first to last. Since the compressor writes the bitstream in the
|
||
forward direction, this means the compressor must encode the
|
||
sequences starting with the last one and ending with the first.
|
||
|
||
For each of the symbol types, the FSE state can be used to determine
|
||
the appropriate code. The code then defines the Baseline and
|
||
Number_of_Bits to read for each type. The description of the codes
|
||
for how to determine these values can be found in
|
||
Section 3.1.1.3.2.1.
|
||
|
||
Decoding starts by reading the Number_of_Bits required to decode
|
||
offset. It does the same for Match_Length and then for
|
||
Literals_Length. This sequence is then used for Sequence Execution
|
||
(see Section 3.1.1.4).
|
||
|
||
If it is not the last sequence in the block, the next operation is to
|
||
update states. Using the rules pre-calculated in the decoding
|
||
tables, Literals_Length_State is updated, followed by
|
||
Match_Length_State, and then Offset_State. See Section 4.1 for
|
||
details on how to update states from the bitstream.
|
||
|
||
This operation will be repeated Number_of_Sequences times. At the
|
||
end, the bitstream shall be entirely consumed; otherwise, the
|
||
bitstream is considered corrupted.
|
||
|
||
3.1.1.3.2.2. Default Distributions
|
||
|
||
If Predefined_Mode is selected for a symbol type, its FSE decoding
|
||
table is generated from a predefined distribution table defined here.
|
||
For details on how to convert this distribution into a decoding
|
||
table, see Section 4.1.
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Collet & Kucherawy Informational [Page 27]
|
||
|
||
RFC 8478 application/zstd October 2018
|
||
|
||
|
||
3.1.1.3.2.2.1. Literals Length
|
||
|
||
The decoding table uses an accuracy log of 6 bits (64 states).
|
||
|
||
short literalsLength_defaultDistribution[36] =
|
||
{ 4, 3, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1,
|
||
2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 2, 1, 1, 1, 1, 1,
|
||
-1,-1,-1,-1
|
||
};
|
||
|
||
3.1.1.3.2.2.2. Match Length
|
||
|
||
The decoding table uses an accuracy log of 6 bits (64 states).
|
||
|
||
short matchLengths_defaultDistribution[53] =
|
||
{ 1, 4, 3, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1,
|
||
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
|
||
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,-1,-1,
|
||
-1,-1,-1,-1,-1
|
||
};
|
||
|
||
3.1.1.3.2.2.3. Offset Codes
|
||
|
||
The decoding table uses an accuracy log of 5 bits (32 states), and
|
||
supports a maximum N value of 28, allowing offset values up to
|
||
536,870,908.
|
||
|
||
If any sequence in the compressed block requires a larger offset than
|
||
this, it's not possible to use the default distribution to represent
|
||
it.
|
||
|
||
short offsetCodes_defaultDistribution[29] =
|
||
{ 1, 1, 1, 1, 1, 1, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1,
|
||
1, 1, 1, 1, 1, 1, 1, 1,-1,-1,-1,-1,-1
|
||
};
|
||
|
||
3.1.1.4. Sequence Execution
|
||
|
||
Once literals and sequences have been decoded, they are combined to
|
||
produce the decoded content of a block.
|
||
|
||
Each sequence consists of a tuple of (literals_length, offset_value,
|
||
match_length), decoded as described in the
|
||
Sequences_Section (Section 3.1.1.3.2). To execute a sequence, first
|
||
copy literals_length bytes from the decoded literals to the output.
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Collet & Kucherawy Informational [Page 28]
|
||
|
||
RFC 8478 application/zstd October 2018
|
||
|
||
|
||
Then, match_length bytes are copied from previous decoded data. The
|
||
offset to copy from is determined by offset_value:
|
||
|
||
o if Offset_Value > 3, then the offset is Offset_Value - 3;
|
||
|
||
o if Offset_Value is from 1-3, the offset is a special repeat offset
|
||
value. See Section 3.1.1.5 for how the offset is determined in
|
||
this case.
|
||
|
||
The offset is defined as from the current position (after copying the
|
||
literals), so an offset of 6 and a match length of 3 means that 3
|
||
bytes should be copied from 6 bytes back. Note that all offsets
|
||
leading to previously decoded data must be smaller than Window_Size
|
||
defined in Frame_Header_Descriptor (Section 3.1.1.1.1).
|
||
|
||
3.1.1.5. Repeat Offsets
|
||
|
||
As seen above, the first three values define a repeated offset; we
|
||
will call them Repeated_Offset1, Repeated_Offset2, and
|
||
Repeated_Offset3. They are sorted in recency order, with
|
||
Repeated_Offset1 meaning "most recent one".
|
||
|
||
If offset_value is 1, then the offset used is Repeated_Offset1, etc.
|
||
|
||
There is one exception: When the current sequence's literals_length
|
||
is 0, repeated offsets are shifted by 1, so an offset_value of 1
|
||
means Repeated_Offset2, an offset_value of 2 means Repeated_Offset3,
|
||
and an offset_value of 3 means Repeated_Offset1 - 1_byte.
|
||
|
||
For the first block, the starting offset history is populated with
|
||
the following values: Repeated_Offset1 (1), Repeated_Offset2 (4), and
|
||
Repeated_Offset3 (8), unless a dictionary is used, in which case they
|
||
come from the dictionary.
|
||
|
||
Then each block gets its starting offset history from the ending
|
||
values of the most recent Compressed_Block. Note that blocks that
|
||
are not Compressed_Block are skipped; they do not contribute to
|
||
offset history.
|
||
|
||
The newest offset takes the lead in offset history, shifting others
|
||
back (up to its previous place if it was already present). This
|
||
means that when Repeated_Offset1 (most recent) is used, history is
|
||
unmodified. When Repeated_Offset2 is used, it is swapped with
|
||
Repeated_Offset1. If any other offset is used, it becomes
|
||
Repeated_Offset1, and the rest are shifted back by 1.
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Collet & Kucherawy Informational [Page 29]
|
||
|
||
RFC 8478 application/zstd October 2018
|
||
|
||
|
||
3.1.2. Skippable Frames
|
||
|
||
+--------------+------------+-----------+
|
||
| Magic_Number | Frame_Size | User_Data |
|
||
+--------------+------------+-----------+
|
||
| 4 bytes | 4 bytes | n bytes |
|
||
+--------------+------------+-----------+
|
||
|
||
Skippable frames allow the insertion of user-defined metadata into a
|
||
flow of concatenated frames.
|
||
|
||
Skippable frames defined in this specification are compatible with
|
||
skippable frames in [LZ4].
|
||
|
||
From a compliant decoder perspective, skippable frames simply need to
|
||
be skipped, and their content ignored, resuming decoding after the
|
||
skippable frame.
|
||
|
||
It should be noted that a skippable frame can be used to watermark a
|
||
stream of concatenated frames embedding any kind of tracking
|
||
information (even just a Universally Unique Identifier (UUID)).
|
||
Users wary of such possibility should scan the stream of concatenated
|
||
frames in an attempt to detect such frames for analysis or removal.
|
||
|
||
The fields are:
|
||
|
||
Magic_Number: 4 bytes, little-endian format. Value: 0x184D2A5?,
|
||
which means any value from 0x184D2A50 to 0x184D2A5F. All 16
|
||
values are valid to identify a skippable frame. This
|
||
specification does not detail any specific tagging methods for
|
||
skippable frames.
|
||
|
||
Frame_Size: This is the size, in bytes, of the following User_Data
|
||
(without including the magic number nor the size field itself).
|
||
This field is represented using 4 bytes, little-endian format,
|
||
unsigned 32 bits. This means User_Data can't be bigger than
|
||
(2^32-1) bytes.
|
||
|
||
User_Data: This field can be anything. Data will just be skipped by
|
||
the decoder.
|
||
|
||
4. Entropy Encoding
|
||
|
||
Two types of entropy encoding are used by the Zstandard format: FSE
|
||
and Huffman coding. Huffman is used to compress literals, while FSE
|
||
is used for all other symbols (Literals_Length_Code,
|
||
Match_Length_Code, and offset codes) and to compress Huffman headers.
|
||
|
||
|
||
|
||
|
||
Collet & Kucherawy Informational [Page 30]
|
||
|
||
RFC 8478 application/zstd October 2018
|
||
|
||
|
||
4.1. FSE
|
||
|
||
FSE, short for Finite State Entropy, is an entropy codec based on
|
||
[ANS]. FSE encoding/decoding involves a state that is carried over
|
||
between symbols, so decoding must be done in the opposite direction
|
||
as encoding. Therefore, all FSE bitstreams are read from end to
|
||
beginning. Note that the order of the bits in the stream is not
|
||
reversed; they are simply read in the reverse order from which they
|
||
were written.
|
||
|
||
For additional details on FSE, see Finite State Entropy [FSE].
|
||
|
||
FSE decoding involves a decoding table that has a power of 2 size and
|
||
contains three elements: Symbol, Num_Bits, and Baseline. The base 2
|
||
logarithm of the table size is its Accuracy_Log. An FSE state value
|
||
represents an index in this table.
|
||
|
||
To obtain the initial state value, consume Accuracy_Log bits from the
|
||
stream as a little-endian value. The next symbol in the stream is
|
||
the Symbol indicated in the table for that state. To obtain the next
|
||
state value, the decoder should consume Num_Bits bits from the stream
|
||
as a little-endian value and add it to Baseline.
|
||
|
||
4.1.1. FSE Table Description
|
||
|
||
To decode FSE streams, it is necessary to construct the decoding
|
||
table. The Zstandard format encodes FSE table descriptions as
|
||
described here.
|
||
|
||
An FSE distribution table describes the probabilities of all symbols
|
||
from 0 to the last present one (included) on a normalized scale of
|
||
(1 << Accuracy_Log). Note that there must be two or more symbols
|
||
with non-zero probability.
|
||
|
||
A bitstream is read forward, in little-endian fashion. It is not
|
||
necessary to know its exact size, since the size will be discovered
|
||
and reported by the decoding process. The bitstream starts by
|
||
reporting on which scale it operates. If low4bits designates the
|
||
lowest 4 bits of the first byte, then Accuracy_Log = low4bits + 5.
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Collet & Kucherawy Informational [Page 31]
|
||
|
||
RFC 8478 application/zstd October 2018
|
||
|
||
|
||
This is followed by each symbol value, from 0 to the last present
|
||
one. The number of bits used by each field is variable and depends
|
||
on:
|
||
|
||
Remaining probabilities + 1: For example, presuming an Accuracy_Log
|
||
of 8, and presuming 100 probabilities points have already been
|
||
distributed, the decoder may read any value from 0 to
|
||
(256 - 100 + 1) == 157, inclusive. Therefore, it must read
|
||
log2sup(157) == 8 bits.
|
||
|
||
Value decoded: Small values use 1 fewer bit. For example, presuming
|
||
values from 0 to 157 (inclusive) are possible, 255 - 157 = 98
|
||
values are remaining in an 8-bit field. The first 98 values
|
||
(hence from 0 to 97) use only 7 bits, and values from 98 to 157
|
||
use 8 bits. This is achieved through this scheme:
|
||
|
||
+------------+---------------+-----------+
|
||
| Value Read | Value Decoded | Bits Used |
|
||
+------------+---------------+-----------+
|
||
| 0 - 97 | 0 - 97 | 7 |
|
||
+------------+---------------+-----------+
|
||
| 98 - 127 | 98 - 127 | 8 |
|
||
+------------+---------------+-----------+
|
||
| 128 - 225 | 0 - 97 | 7 |
|
||
+------------+---------------+-----------+
|
||
| 226 - 255 | 128 - 157 | 8 |
|
||
+------------+---------------+-----------+
|
||
|
||
Symbol probabilities are read one by one, in order. The probability
|
||
is obtained from Value decoded using the formula P = Value - 1. This
|
||
means the value 0 becomes the negative probability -1. This is a
|
||
special probability that means "less than 1". Its effect on the
|
||
distribution table is described below. For the purpose of
|
||
calculating total allocated probability points, it counts as 1.
|
||
|
||
When a symbol has a probability of zero, it is followed by a 2-bit
|
||
repeat flag. This repeat flag tells how many probabilities of zeroes
|
||
follow the current one. It provides a number ranging from 0 to 3.
|
||
If it is a 3, another 2-bit repeat flag follows, and so on.
|
||
|
||
When the last symbol reaches a cumulated total of
|
||
(1 << Accuracy_Log), decoding is complete. If the last symbol makes
|
||
the cumulated total go above (1 << Accuracy_Log), distribution is
|
||
considered corrupted.
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Collet & Kucherawy Informational [Page 32]
|
||
|
||
RFC 8478 application/zstd October 2018
|
||
|
||
|
||
Finally, the decoder can tell how many bytes were used in this
|
||
process and how many symbols are present. The bitstream consumes a
|
||
round number of bytes. Any remaining bit within the last byte is
|
||
simply unused.
|
||
|
||
The distribution of normalized probabilities is enough to create a
|
||
unique decoding table. The table has a size of (1 << Accuracy_Log).
|
||
Each cell describes the symbol decoded and instructions to get the
|
||
next state.
|
||
|
||
Symbols are scanned in their natural order for "less than 1"
|
||
probabilities as described above. Symbols with this probability are
|
||
being attributed a single cell, starting from the end of the table
|
||
and retreating. These symbols define a full state reset, reading
|
||
Accuracy_Log bits.
|
||
|
||
All remaining symbols are allocated in their natural order. Starting
|
||
from symbol 0 and table position 0, each symbol gets allocated as
|
||
many cells as its probability. Cell allocation is spread, not
|
||
linear; each successor position follows this rule:
|
||
|
||
position += (tableSize >> 1) + (tableSize >> 3) + 3;
|
||
position &= tableSize - 1;
|
||
|
||
A position is skipped if it is already occupied by a "less than 1"
|
||
probability symbol. Position does not reset between symbols; it
|
||
simply iterates through each position in the table, switching to the
|
||
next symbol when enough states have been allocated to the current
|
||
one.
|
||
|
||
The result is a list of state values. Each state will decode the
|
||
current symbol.
|
||
|
||
To get the Number_of_Bits and Baseline required for the next state,
|
||
it is first necessary to sort all states in their natural order. The
|
||
lower states will need 1 more bit than higher ones. The process is
|
||
repeated for each symbol.
|
||
|
||
For example, presuming a symbol has a probability of 5, it receives
|
||
five state values. States are sorted in natural order. The next
|
||
power of 2 is 8. The space of probabilities is divided into 8 equal
|
||
parts. Presuming the Accuracy_Log is 7, this defines 128 states, and
|
||
each share (divided by 8) is 16 in size. In order to reach 8, 8 - 5
|
||
= 3 lowest states will count "double", doubling the number of shares
|
||
(32 in width), requiring 1 more bit in the process.
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Collet & Kucherawy Informational [Page 33]
|
||
|
||
RFC 8478 application/zstd October 2018
|
||
|
||
|
||
Baseline is assigned starting from the higher states using fewer
|
||
bits, and proceeding naturally, then resuming at the first state,
|
||
each taking its allocated width from Baseline.
|
||
|
||
+----------------+-------+-------+--------+------+-------+
|
||
| state order | 0 | 1 | 2 | 3 | 4 |
|
||
+----------------+-------+-------+--------+------+-------+
|
||
| width | 32 | 32 | 32 | 16 | 16 |
|
||
+----------------+-------+-------+--------+------+-------+
|
||
| Number_of_Bits | 5 | 5 | 5 | 4 | 4 |
|
||
+----------------+-------+-------+--------+------+-------+
|
||
| range number | 2 | 4 | 6 | 0 | 1 |
|
||
+----------------+-------+-------+--------+------+-------+
|
||
| Baseline | 32 | 64 | 96 | 0 | 16 |
|
||
+----------------+-------+-------+--------+------+-------+
|
||
| range | 32-63 | 64-95 | 96-127 | 0-15 | 16-31 |
|
||
+----------------+-------+-------+--------+------+-------+
|
||
|
||
The next state is determined from the current state by reading the
|
||
required Number_of_Bits and adding the specified Baseline.
|
||
|
||
See Appendix A for the results of this process that are applied to
|
||
the default distributions.
|
||
|
||
4.2. Huffman Coding
|
||
|
||
Zstandard Huffman-coded streams are read backwards, similar to the
|
||
FSE bitstreams. Therefore, to find the start of the bitstream, it is
|
||
necessary to know the offset of the last byte of the Huffman-coded
|
||
stream.
|
||
|
||
After writing the last bit containing information, the compressor
|
||
writes a single 1 bit and then fills the byte with 0-7 0 bits of
|
||
padding. The last byte of the compressed bitstream cannot be 0 for
|
||
that reason.
|
||
|
||
When decompressing, the last byte containing the padding is the first
|
||
byte to read. The decompressor needs to skip 0-7 initial 0 bits and
|
||
the first 1 bit that occurs. Afterwards, the useful part of the
|
||
bitstream begins.
|
||
|
||
The bitstream contains Huffman-coded symbols in little-endian order,
|
||
with the codes defined by the method below.
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Collet & Kucherawy Informational [Page 34]
|
||
|
||
RFC 8478 application/zstd October 2018
|
||
|
||
|
||
4.2.1. Huffman Tree Description
|
||
|
||
Prefix coding represents symbols from an a priori known alphabet by
|
||
bit sequences (codewords), one codeword for each symbol, in a manner
|
||
such that different symbols may be represented by bit sequences of
|
||
different lengths, but a parser can always parse an encoded string
|
||
unambiguously symbol by symbol.
|
||
|
||
Given an alphabet with known symbol frequencies, the Huffman
|
||
algorithm allows the construction of an optimal prefix code using the
|
||
fewest bits of any possible prefix codes for that alphabet.
|
||
|
||
The prefix code must not exceed a maximum code length. More bits
|
||
improve accuracy but yield a larger header size and require more
|
||
memory or more complex decoding operations. This specification
|
||
limits the maximum code length to 11 bits.
|
||
|
||
All literal values from zero (included) to the last present one
|
||
(excluded) are represented by Weight with values from 0 to
|
||
Max_Number_of_Bits. Transformation from Weight to Number_of_Bits
|
||
follows this pseudocode:
|
||
|
||
if Weight == 0
|
||
Number_of_Bits = 0
|
||
else
|
||
Number_of_Bits = Max_Number_of_Bits + 1 - Weight
|
||
|
||
The last symbol's Weight is deduced from previously decoded ones, by
|
||
completing to the nearest power of 2. This power of 2 gives
|
||
Max_Number_of_Bits the depth of the current tree.
|
||
|
||
For example, presume the following Huffman tree must be described:
|
||
|
||
+---------------+----------------+
|
||
| Literal Value | Number_of_Bits |
|
||
+---------------+----------------+
|
||
| 0 | 1 |
|
||
+---------------+----------------+
|
||
| 1 | 2 |
|
||
+---------------+----------------+
|
||
| 2 | 3 |
|
||
+---------------+----------------+
|
||
| 3 | 0 |
|
||
+---------------+----------------+
|
||
| 4 | 4 |
|
||
+---------------+----------------+
|
||
| 5 | 4 |
|
||
+---------------+----------------+
|
||
|
||
|
||
|
||
Collet & Kucherawy Informational [Page 35]
|
||
|
||
RFC 8478 application/zstd October 2018
|
||
|
||
|
||
The tree depth is 4, since its longest element uses 4 bits. (The
|
||
longest elements are those with the smallest frequencies.) Value 5
|
||
will not be listed as it can be determined from the values for 0-4,
|
||
nor will values above 5 as they are all 0. Values from 0 to 4 will
|
||
be listed using Weight instead of Number_of_Bits. The pseudocode to
|
||
determine Weight is:
|
||
|
||
if Number_of_Bits == 0
|
||
Weight = 0
|
||
else
|
||
Weight = Max_Number_of_Bits + 1 - Number_of_Bits
|
||
|
||
It gives the following series of weights:
|
||
|
||
+---------------+--------+
|
||
| Literal Value | Weight |
|
||
+---------------+--------+
|
||
| 0 | 4 |
|
||
+---------------+--------+
|
||
| 1 | 3 |
|
||
+---------------+--------+
|
||
| 2 | 2 |
|
||
+---------------+--------+
|
||
| 3 | 0 |
|
||
+---------------+--------+
|
||
| 4 | 1 |
|
||
+---------------+--------+
|
||
|
||
The decoder will do the inverse operation: having collected weights
|
||
of literals from 0 to 4, it knows the last literal, 5, is present
|
||
with a non-zero Weight. The Weight of 5 can be determined by
|
||
advancing to the next power of 2. The sum of 2^(Weight-1) (excluding
|
||
0's) is 15. The nearest power of 2 is 16. Therefore,
|
||
Max_Number_of_Bits = 4 and Weight[5] = 16 - 15 = 1.
|
||
|
||
4.2.1.1. Huffman Tree Header
|
||
|
||
This is a single byte value (0-255), which describes how the series
|
||
of weights is encoded.
|
||
|
||
headerByte < 128: The series of weights is compressed using FSE (see
|
||
below). The length of the FSE-compressed series is equal to
|
||
headerByte (0-127).
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Collet & Kucherawy Informational [Page 36]
|
||
|
||
RFC 8478 application/zstd October 2018
|
||
|
||
|
||
headerByte >= 128: This is a direct representation, where each
|
||
Weight is written directly as a 4-bit field (0-15). They are
|
||
encoded forward, 2 weights to a byte with the first weight taking
|
||
the top 4 bits and the second taking the bottom 4; for example,
|
||
the following operations could be used to read the weights:
|
||
|
||
Weight[0] = (Byte[0] >> 4)
|
||
Weight[1] = (Byte[0] & 0xf),
|
||
etc.
|
||
|
||
The full representation occupies ceiling(Number_of_Symbols/2)
|
||
bytes, meaning it uses only full bytes even if Number_of_Symbols
|
||
is odd. Number_of_Symbols = headerByte - 127. Note that maximum
|
||
Number_of_Symbols is 255 - 127 = 128. If any literal has a value
|
||
over 128, raw header mode is not possible, and it is necessary to
|
||
use FSE compression.
|
||
|
||
4.2.1.2. FSE Compression of Huffman Weights
|
||
|
||
In this case, the series of Huffman weights is compressed using FSE
|
||
compression. It is a single bitstream with two interleaved states,
|
||
sharing a single distribution table.
|
||
|
||
To decode an FSE bitstream, it is necessary to know its compressed
|
||
size. Compressed size is provided by headerByte. It's also
|
||
necessary to know its maximum possible decompressed size, which is
|
||
255, since literal values span from 0 to 255, and the last symbol's
|
||
Weight is not represented.
|
||
|
||
An FSE bitstream starts by a header, describing probabilities
|
||
distribution. It will create a decoding table. For a list of
|
||
Huffman weights, the maximum accuracy log is 6 bits. For more
|
||
details, see Section 4.1.1.
|
||
|
||
The Huffman header compression uses two states, which share the same
|
||
FSE distribution table. The first state (State1) encodes the even-
|
||
numbered index symbols, and the second (State2) encodes the odd-
|
||
numbered index symbols. State1 is initialized first, and then
|
||
State2, and they take turns decoding a single symbol and updating
|
||
their state. For more details on these FSE operations, see
|
||
Section 4.1.
|
||
|
||
The number of symbols to be decoded is determined by tracking the
|
||
bitStream overflow condition: If updating state after decoding a
|
||
symbol would require more bits than remain in the stream, it is
|
||
assumed that extra bits are zero. Then, symbols for each of the
|
||
final states are decoded and the process is complete.
|
||
|
||
|
||
|
||
|
||
Collet & Kucherawy Informational [Page 37]
|
||
|
||
RFC 8478 application/zstd October 2018
|
||
|
||
|
||
4.2.1.3. Conversion from Weights to Huffman Prefix Codes
|
||
|
||
All present symbols will now have a Weight value. It is possible to
|
||
transform weights into Number_of_Bits, using this formula:
|
||
|
||
if Weight > 0
|
||
Number_of_Bits = Max_Number_of_Bits + 1 - Weight
|
||
else
|
||
Number_of_Bits = 0
|
||
|
||
Symbols are sorted by Weight. Within the same Weight, symbols keep
|
||
natural sequential order. Symbols with a Weight of zero are removed.
|
||
Then, starting from the lowest Weight, prefix codes are distributed
|
||
in sequential order.
|
||
|
||
For example, assume the following list of weights has been decoded:
|
||
|
||
+---------+--------+
|
||
| Literal | Weight |
|
||
+---------+--------+
|
||
| 0 | 4 |
|
||
+---------+--------+
|
||
| 1 | 3 |
|
||
+---------+--------+
|
||
| 2 | 2 |
|
||
+---------+--------+
|
||
| 3 | 0 |
|
||
+---------+--------+
|
||
| 4 | 1 |
|
||
+---------+--------+
|
||
| 5 | 1 |
|
||
+---------+--------+
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Collet & Kucherawy Informational [Page 38]
|
||
|
||
RFC 8478 application/zstd October 2018
|
||
|
||
|
||
Sorting by weight and then the natural sequential order yields the
|
||
following distribution:
|
||
|
||
+---------+--------+----------------+--------------+
|
||
| Literal | Weight | Number_Of_Bits | Prefix Codes |
|
||
+---------+--------+----------------|--------------+
|
||
| 3 | 0 | 0 | N/A |
|
||
+---------+--------+----------------|--------------+
|
||
| 4 | 1 | 4 | 0000 |
|
||
+---------+--------+----------------|--------------+
|
||
| 5 | 1 | 4 | 0001 |
|
||
+---------+--------+----------------|--------------+
|
||
| 2 | 2 | 3 | 001 |
|
||
+---------+--------+----------------|--------------+
|
||
| 1 | 3 | 2 | 01 |
|
||
+---------+--------+----------------|--------------+
|
||
| 0 | 4 | 1 | 1 |
|
||
+---------+--------+----------------|--------------+
|
||
|
||
4.2.2. Huffman-Coded Streams
|
||
|
||
Given a Huffman decoding table, it is possible to decode a Huffman-
|
||
coded stream.
|
||
|
||
Each bitstream must be read backward, which starts from the end and
|
||
goes up to the beginning. Therefore, it is necessary to know the
|
||
size of each bitstream.
|
||
|
||
It is also necessary to know exactly which bit is the last. This is
|
||
detected by a final bit flag: the highest bit of the last byte is a
|
||
final-bit-flag. Consequently, a last byte of 0 is not possible. And
|
||
the final-bit-flag itself is not part of the useful bitstream.
|
||
Hence, the last byte contains between 0 and 7 useful bits.
|
||
|
||
Starting from the end, it is possible to read the bitstream in a
|
||
little-endian fashion, keeping track of already used bits. Since the
|
||
bitstream is encoded in reverse order, starting from the end, read
|
||
symbols in forward order.
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Collet & Kucherawy Informational [Page 39]
|
||
|
||
RFC 8478 application/zstd October 2018
|
||
|
||
|
||
For example, if the literal sequence "0145" was encoded using the
|
||
above prefix code, it would be encoded (in reverse order) as:
|
||
|
||
+---------+----------+
|
||
| Symbol | Encoding |
|
||
+---------+----------+
|
||
| 5 | 0000 |
|
||
+---------+----------+
|
||
| 4 | 0001 |
|
||
+---------+----------+
|
||
| 1 | 01 |
|
||
+---------+----------+
|
||
| 0 | 1 |
|
||
+---------+----------+
|
||
| Padding | 00001 |
|
||
+---------+----------+
|
||
|
||
This results in the following 2-byte bitstream:
|
||
|
||
00010000 00001101
|
||
|
||
Here is an alternative representation with the symbol codes separated
|
||
by underscores:
|
||
|
||
0001_0000 00001_1_01
|
||
|
||
Reading the highest Max_Number_of_Bits bits, it's possible to compare
|
||
the extracted value to the decoding table, determining the symbol to
|
||
decode and number of bits to discard.
|
||
|
||
The process continues reading up to the required number of symbols
|
||
per stream. If a bitstream is not entirely and exactly consumed,
|
||
hence reaching exactly its beginning position with all bits consumed,
|
||
the decoding process is considered faulty.
|
||
|
||
5. Dictionary Format
|
||
|
||
Zstandard is compatible with "raw content" dictionaries, free of any
|
||
format restriction, except that they must be at least 8 bytes. These
|
||
dictionaries function as if they were just the content part of a
|
||
formatted dictionary.
|
||
|
||
However, dictionaries created by "zstd --train" in the reference
|
||
implementation follow a specific format, described here.
|
||
|
||
Dictionaries are not included in the compressed content but rather
|
||
are provided out of band. That is, the Dictionary_ID identifies
|
||
which should be used, but this specification does not describe the
|
||
|
||
|
||
|
||
Collet & Kucherawy Informational [Page 40]
|
||
|
||
RFC 8478 application/zstd October 2018
|
||
|
||
|
||
mechanism by which the dictionary is obtained prior to use during
|
||
compression or decompression.
|
||
|
||
A dictionary has a size, defined either by a buffer limit or a file
|
||
size. The general format is:
|
||
|
||
+--------------+---------------+----------------+---------+
|
||
| Magic_Number | Dictionary_ID | Entropy_Tables | Content |
|
||
+--------------+---------------+----------------+---------+
|
||
|
||
Magic_Number: 4 bytes ID, value 0xEC30A437, little-endian format.
|
||
|
||
Dictionary_ID: 4 bytes, stored in little-endian format.
|
||
Dictionary_ID can be any value, except 0 (which means no
|
||
Dictionary_ID). It is used by decoders to check if they use the
|
||
correct dictionary. If the frame is going to be distributed in a
|
||
private environment, any Dictionary_ID can be used. However, for
|
||
public distribution of compressed frames, the following ranges are
|
||
reserved and shall not be used:
|
||
|
||
low range: <= 32767
|
||
high range: >= (2^31)
|
||
|
||
Entropy_Tables: Follow the same format as the tables in compressed
|
||
blocks. See the relevant FSE and Huffman sections for how to
|
||
decode these tables. They are stored in the following order:
|
||
Huffman table for literals, FSE table for offsets, FSE table for
|
||
match lengths, and FSE table for literals lengths. These tables
|
||
populate the Repeat Stats literals mode and Repeat distribution
|
||
mode for sequence decoding. It is finally followed by 3 offset
|
||
values, populating repeat offsets (instead of using {1,4,8}),
|
||
stored in order, 4-bytes little-endian each, for a total of 12
|
||
bytes. Each repeat offset must have a value less than the
|
||
dictionary size.
|
||
|
||
Content: The rest of the dictionary is its content. The content
|
||
acts as a "past" in front of data to be compressed or
|
||
decompressed, so it can be referenced in sequence commands. As
|
||
long as the amount of data decoded from this frame is less than or
|
||
equal to Window_Size, sequence commands may specify offsets longer
|
||
than the total length of decoded output so far to reference back
|
||
to the dictionary, even parts of the dictionary with offsets
|
||
larger than Window_Size. After the total output has surpassed
|
||
Window_Size, however, this is no longer allowed, and the
|
||
dictionary is no longer accessible.
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Collet & Kucherawy Informational [Page 41]
|
||
|
||
RFC 8478 application/zstd October 2018
|
||
|
||
|
||
6. IANA Considerations
|
||
|
||
IANA has made two registrations, as described below.
|
||
|
||
6.1. The 'application/zstd' Media Type
|
||
|
||
The 'application/zstd' media type identifies a block of data that is
|
||
compressed using zstd compression. The data is a stream of bytes as
|
||
described in this document. IANA has added the following to the
|
||
"Media Types" registry:
|
||
|
||
Type name: application
|
||
|
||
Subtype name: zstd
|
||
|
||
Required parameters: N/A
|
||
|
||
Optional parameters: N/A
|
||
|
||
Encoding considerations: binary
|
||
|
||
Security considerations: See Section 7 of RFC 8478
|
||
|
||
Interoperability considerations: N/A
|
||
|
||
Published specification: RFC 8478
|
||
|
||
Applications that use this media type: anywhere data size is an
|
||
issue
|
||
|
||
Additional information:
|
||
|
||
Magic number(s): 4 bytes, little-endian format.
|
||
Value: 0xFD2FB528
|
||
|
||
File extension(s): zst
|
||
|
||
Macintosh file type code(s): N/A
|
||
|
||
For further information: See [ZSTD]
|
||
|
||
Intended usage: common
|
||
|
||
Restrictions on usage: N/A
|
||
|
||
Author: Murray S. Kucherawy
|
||
|
||
Change Controller: IETF
|
||
|
||
|
||
|
||
Collet & Kucherawy Informational [Page 42]
|
||
|
||
RFC 8478 application/zstd October 2018
|
||
|
||
|
||
Provisional registration: no
|
||
|
||
6.2. Content Encoding
|
||
|
||
IANA has added the following entry to the "HTTP Content Coding
|
||
Registry" within the "Hypertext Transfer Protocol (HTTP) Parameters"
|
||
registry:
|
||
|
||
Name: zstd
|
||
|
||
Description: A stream of bytes compressed using the Zstandard
|
||
protocol
|
||
|
||
Pointer to specification text: RFC 8478
|
||
|
||
6.3. Dictionaries
|
||
|
||
Work in progress includes development of dictionaries that will
|
||
optimize compression and decompression of particular types of data.
|
||
Specification of such dictionaries for public use will necessitate
|
||
registration of a code point from the reserved range described in
|
||
Section 3.1.1.1.3 and its association with a specific dictionary.
|
||
|
||
However, there are at present no such dictionaries published for
|
||
public use, so this document makes no immediate request of IANA to
|
||
create such a registry.
|
||
|
||
7. Security Considerations
|
||
|
||
Any data compression method involves the reduction of redundancy in
|
||
the data. Zstandard is no exception, and the usual precautions
|
||
apply.
|
||
|
||
One should never compress a message whose content must remain secret
|
||
with a message generated by a third party. Such a compression can be
|
||
used to guess the content of the secret message through analysis of
|
||
entropy reduction. This was demonstrated in the Compression Ratio
|
||
Info-leak Made Easy (CRIME) attack [CRIME], for example.
|
||
|
||
A decoder has to demonstrate capabilities to detect and prevent any
|
||
kind of data tampering in the compressed frame from triggering system
|
||
faults, such as reading or writing beyond allowed memory ranges.
|
||
This can be guaranteed by either the implementation language or
|
||
careful bound checkings. Of particular note is the encoding of
|
||
Number_of_Sequences values that cause the decoder to read into the
|
||
block header (and beyond), as well as the indication of a
|
||
Frame_Content_Size that is smaller than the actual decompressed data,
|
||
in an attempt to trigger a buffer overflow. It is highly recommended
|
||
|
||
|
||
|
||
Collet & Kucherawy Informational [Page 43]
|
||
|
||
RFC 8478 application/zstd October 2018
|
||
|
||
|
||
to fuzz-test (i.e., provide invalid, unexpected, or random input and
|
||
verify safe operation of) decoder implementations to test and harden
|
||
their capability to detect bad frames and deal with them without any
|
||
adverse system side effect.
|
||
|
||
An attacker may provide correctly formed compressed frames with
|
||
unreasonable memory requirements. A decoder must always control
|
||
memory requirements and enforce some (system-specific) limits in
|
||
order to protect memory usage from such scenarios.
|
||
|
||
Compression can be optimized by training a dictionary on a variety of
|
||
related content payloads. This dictionary must then be available at
|
||
the decoder for decompression of the payload to be possible. While
|
||
this document does not specify how to acquire a dictionary for a
|
||
given compressed payload, it is worth noting that third-party
|
||
dictionaries may interact unexpectedly with a decoder, leading to
|
||
possible memory or other resource exhaustion attacks. We expect such
|
||
topics to be discussed in further detail in the Security
|
||
Considerations section of a forthcoming RFC for dictionary
|
||
acquisition and transmission, but highlight this issue now out of an
|
||
abundance of caution.
|
||
|
||
As discussed in Section 3.1.2, it is possible to store arbitrary user
|
||
metadata in skippable frames. While such frames are ignored during
|
||
decompression of the data, they can be used as a watermark to track
|
||
the path of the compressed payload.
|
||
|
||
8. Implementation Status
|
||
|
||
Source code for a C language implementation of a Zstandard-compliant
|
||
library is available at [ZSTD-GITHUB]. This implementation is
|
||
considered to be the reference implementation and is production
|
||
ready; it implements the full range of the specification. It is
|
||
routinely tested against security hazards and widely deployed within
|
||
Facebook infrastructure.
|
||
|
||
The reference version is optimized for speed and is highly portable.
|
||
It has been proven to run safely on multiple architectures (e.g.,
|
||
x86, x64, ARM, MIPS, PowerPC, IA64) featuring 32- or 64-bit
|
||
addressing schemes, a little- or big-endian storage scheme, a number
|
||
of different operating systems (e.g., UNIX (including Linux, BSD,
|
||
OS-X, and Solaris) and Windows), and a number of compilers (e.g.,
|
||
gcc, clang, visual, and icc).
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Collet & Kucherawy Informational [Page 44]
|
||
|
||
RFC 8478 application/zstd October 2018
|
||
|
||
|
||
9. References
|
||
|
||
9.1. Normative References
|
||
|
||
[ZSTD] "Zstandard", <http://www.zstd.net>.
|
||
|
||
9.2. Informative References
|
||
|
||
[ANS] Duda, J., "Asymmetric numeral systems: entropy coding
|
||
combining speed of Huffman coding with compression rate of
|
||
arithmetic coding", January 2014,
|
||
<https://arxiv.org/pdf/1311.2540>.
|
||
|
||
[CRIME] "CRIME", June 2018, <https://en.wikipedia.org/w/
|
||
index.php?title=CRIME&oldid=844538656>.
|
||
|
||
[FSE] "FiniteStateEntropy", commit 6efa78a, June 2018,
|
||
<https://github.com/Cyan4973/FiniteStateEntropy/>.
|
||
|
||
[LZ4] "LZ4 Frame Format Description", commit d03224b, January
|
||
2018, <https://github.com/lz4/lz4/blob/master/doc/
|
||
lz4_Frame_format.md>.
|
||
|
||
[RFC1952] Deutsch, P., "GZIP file format specification version 4.3",
|
||
RFC 1952, DOI 10.17487/RFC1952, May 1996,
|
||
<https://www.rfc-editor.org/info/rfc1952>.
|
||
|
||
[XXHASH] "XXHASH Algorithm", <http://www.xxhash.org>.
|
||
|
||
[ZSTD-GITHUB]
|
||
"zstd", commit 8514bd8, August 2018,
|
||
<https://github.com/facebook/zstd>.
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Collet & Kucherawy Informational [Page 45]
|
||
|
||
RFC 8478 application/zstd October 2018
|
||
|
||
|
||
Appendix A. Decoding Tables for Predefined Codes
|
||
|
||
This appendix contains FSE decoding tables for the predefined literal
|
||
length, match length, and offset codes. The tables have been
|
||
constructed using the algorithm as given above in Section 4.1.1. The
|
||
tables here can be used as examples to crosscheck that an
|
||
implementation has built its decoding tables correctly.
|
||
|
||
A.1. Literal Length Code Table
|
||
|
||
+-------+--------+----------------+------+
|
||
| State | Symbol | Number_Of_Bits | Base |
|
||
+-------+--------+----------------+------+
|
||
| 0 | 0 | 0 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 0 | 0 | 4 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 1 | 0 | 4 | 16 |
|
||
+-------+--------+----------------+------+
|
||
| 2 | 1 | 5 | 32 |
|
||
+-------+--------+----------------+------+
|
||
| 3 | 3 | 5 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 4 | 4 | 5 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 5 | 6 | 5 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 6 | 7 | 5 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 7 | 9 | 5 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 8 | 10 | 5 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 9 | 12 | 5 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 10 | 14 | 6 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 11 | 16 | 5 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 12 | 18 | 5 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 13 | 19 | 5 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 14 | 21 | 5 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 15 | 22 | 5 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 16 | 24 | 5 | 0 |
|
||
|
||
|
||
|
||
Collet & Kucherawy Informational [Page 46]
|
||
|
||
RFC 8478 application/zstd October 2018
|
||
|
||
|
||
+-------+--------+----------------+------+
|
||
| 17 | 25 | 5 | 32 |
|
||
+-------+--------+----------------+------+
|
||
| 18 | 26 | 5 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 19 | 27 | 6 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 20 | 29 | 6 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 21 | 31 | 6 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 22 | 0 | 4 | 32 |
|
||
+-------+--------+----------------+------+
|
||
| 23 | 1 | 4 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 24 | 2 | 5 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 25 | 4 | 5 | 32 |
|
||
+-------+--------+----------------+------+
|
||
| 26 | 5 | 5 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 27 | 7 | 5 | 32 |
|
||
+-------+--------+----------------+------+
|
||
| 28 | 8 | 5 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 29 | 10 | 5 | 32 |
|
||
+-------+--------+----------------+------+
|
||
| 30 | 11 | 5 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 31 | 13 | 6 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 32 | 16 | 5 | 32 |
|
||
+-------+--------+----------------+------+
|
||
| 33 | 17 | 5 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 34 | 19 | 5 | 32 |
|
||
+-------+--------+----------------+------+
|
||
| 35 | 20 | 5 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 36 | 22 | 5 | 32 |
|
||
+-------+--------+----------------+------+
|
||
| 37 | 23 | 5 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 38 | 25 | 4 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 39 | 25 | 4 | 16 |
|
||
+-------+--------+----------------+------+
|
||
| 40 | 26 | 5 | 32 |
|
||
|
||
|
||
|
||
Collet & Kucherawy Informational [Page 47]
|
||
|
||
RFC 8478 application/zstd October 2018
|
||
|
||
|
||
+-------+--------+----------------+------+
|
||
| 41 | 28 | 6 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 42 | 30 | 6 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 43 | 0 | 4 | 48 |
|
||
+-------+--------+----------------+------+
|
||
| 44 | 1 | 4 | 16 |
|
||
+-------+--------+----------------+------+
|
||
| 45 | 2 | 5 | 32 |
|
||
+-------+--------+----------------+------+
|
||
| 46 | 3 | 5 | 32 |
|
||
+-------+--------+----------------+------+
|
||
| 47 | 5 | 5 | 32 |
|
||
+-------+--------+----------------+------+
|
||
| 48 | 6 | 5 | 32 |
|
||
+-------+--------+----------------+------+
|
||
| 49 | 8 | 5 | 32 |
|
||
+-------+--------+----------------+------+
|
||
| 50 | 9 | 5 | 32 |
|
||
+-------+--------+----------------+------+
|
||
| 51 | 11 | 5 | 32 |
|
||
+-------+--------+----------------+------+
|
||
| 52 | 12 | 5 | 32 |
|
||
+-------+--------+----------------+------+
|
||
| 53 | 15 | 6 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 54 | 17 | 5 | 32 |
|
||
+-------+--------+----------------+------+
|
||
| 55 | 18 | 5 | 32 |
|
||
+-------+--------+----------------+------+
|
||
| 56 | 20 | 5 | 32 |
|
||
+-------+--------+----------------+------+
|
||
| 57 | 21 | 5 | 32 |
|
||
+-------+--------+----------------+------+
|
||
| 58 | 23 | 5 | 32 |
|
||
+-------+--------+----------------+------+
|
||
| 59 | 24 | 5 | 32 |
|
||
+-------+--------+----------------+------+
|
||
| 60 | 35 | 6 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 61 | 34 | 6 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 62 | 33 | 6 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 63 | 32 | 6 | 0 |
|
||
+-------+--------+----------------+------+
|
||
|
||
|
||
|
||
|
||
Collet & Kucherawy Informational [Page 48]
|
||
|
||
RFC 8478 application/zstd October 2018
|
||
|
||
|
||
A.2. Match Length Code Table
|
||
|
||
+-------+--------+----------------+------+
|
||
| State | Symbol | Number_Of_Bits | Base |
|
||
+-------+--------+----------------+------+
|
||
| 0 | 0 | 0 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 0 | 0 | 6 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 1 | 1 | 4 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 2 | 2 | 5 | 32 |
|
||
+-------+--------+----------------+------+
|
||
| 3 | 3 | 5 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 4 | 5 | 5 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 5 | 6 | 5 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 6 | 8 | 5 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 7 | 10 | 6 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 8 | 13 | 6 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 9 | 16 | 6 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 10 | 19 | 6 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 11 | 22 | 6 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 12 | 25 | 6 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 13 | 28 | 6 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 14 | 31 | 6 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 15 | 33 | 6 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 16 | 35 | 6 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 17 | 37 | 6 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 18 | 39 | 6 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 19 | 41 | 6 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 20 | 43 | 6 | 0 |
|
||
|
||
|
||
|
||
Collet & Kucherawy Informational [Page 49]
|
||
|
||
RFC 8478 application/zstd October 2018
|
||
|
||
|
||
+-------+--------+----------------+------+
|
||
| 21 | 45 | 6 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 22 | 1 | 4 | 16 |
|
||
+-------+--------+----------------+------+
|
||
| 23 | 2 | 4 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 24 | 3 | 5 | 32 |
|
||
+-------+--------+----------------+------+
|
||
| 25 | 4 | 5 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 26 | 6 | 5 | 32 |
|
||
+-------+--------+----------------+------+
|
||
| 27 | 7 | 5 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 28 | 9 | 6 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 29 | 12 | 6 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 30 | 15 | 6 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 31 | 18 | 6 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 32 | 21 | 6 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 33 | 24 | 6 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 34 | 27 | 6 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 35 | 30 | 6 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 36 | 32 | 6 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 37 | 34 | 6 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 38 | 36 | 6 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 39 | 38 | 6 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 40 | 40 | 6 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 41 | 42 | 6 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 42 | 44 | 6 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 43 | 1 | 4 | 32 |
|
||
+-------+--------+----------------+------+
|
||
| 44 | 1 | 4 | 48 |
|
||
|
||
|
||
|
||
Collet & Kucherawy Informational [Page 50]
|
||
|
||
RFC 8478 application/zstd October 2018
|
||
|
||
|
||
+-------+--------+----------------+------+
|
||
| 45 | 2 | 4 | 16 |
|
||
+-------+--------+----------------+------+
|
||
| 46 | 4 | 5 | 32 |
|
||
+-------+--------+----------------+------+
|
||
| 47 | 5 | 5 | 32 |
|
||
+-------+--------+----------------+------+
|
||
| 48 | 7 | 5 | 32 |
|
||
+-------+--------+----------------+------+
|
||
| 49 | 8 | 5 | 32 |
|
||
+-------+--------+----------------+------+
|
||
| 50 | 11 | 6 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 51 | 14 | 6 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 52 | 17 | 6 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 53 | 20 | 6 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 54 | 23 | 6 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 55 | 26 | 6 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 56 | 29 | 6 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 57 | 52 | 6 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 58 | 51 | 6 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 59 | 50 | 6 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 60 | 49 | 6 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 61 | 48 | 6 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 62 | 47 | 6 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 63 | 46 | 6 | 0 |
|
||
+-------+--------+----------------+------+
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Collet & Kucherawy Informational [Page 51]
|
||
|
||
RFC 8478 application/zstd October 2018
|
||
|
||
|
||
A.3. Offset Code Table
|
||
|
||
+-------+--------+----------------+------+
|
||
| State | Symbol | Number_Of_Bits | Base |
|
||
+-------+--------+----------------+------+
|
||
| 0 | 0 | 0 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 0 | 0 | 5 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 1 | 6 | 4 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 2 | 9 | 5 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 3 | 15 | 5 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 4 | 21 | 5 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 5 | 3 | 5 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 6 | 7 | 4 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 7 | 12 | 5 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 8 | 18 | 5 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 9 | 23 | 5 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 10 | 5 | 5 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 11 | 8 | 4 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 12 | 14 | 5 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 13 | 20 | 5 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 14 | 2 | 5 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 15 | 7 | 4 | 16 |
|
||
+-------+--------+----------------+------+
|
||
| 16 | 11 | 5 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 17 | 17 | 5 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 18 | 22 | 5 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 19 | 4 | 5 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 20 | 8 | 4 | 16 |
|
||
|
||
|
||
|
||
Collet & Kucherawy Informational [Page 52]
|
||
|
||
RFC 8478 application/zstd October 2018
|
||
|
||
|
||
+-------+--------+----------------+------+
|
||
| 21 | 13 | 5 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 22 | 19 | 5 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 23 | 1 | 5 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 24 | 6 | 4 | 16 |
|
||
+-------+--------+----------------+------+
|
||
| 25 | 10 | 5 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 26 | 16 | 5 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 27 | 28 | 5 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 28 | 27 | 5 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 29 | 26 | 5 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 30 | 25 | 5 | 0 |
|
||
+-------+--------+----------------+------+
|
||
| 31 | 24 | 5 | 0 |
|
||
+-------+--------+----------------+------+
|
||
|
||
Acknowledgments
|
||
|
||
zstd was developed by Yann Collet.
|
||
|
||
Bobo Bose-Kolanu, Felix Handte, Kyle Nekritz, Nick Terrell, and David
|
||
Schleimer provided helpful feedback during the development of this
|
||
document.
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Collet & Kucherawy Informational [Page 53]
|
||
|
||
RFC 8478 application/zstd October 2018
|
||
|
||
|
||
Authors' Addresses
|
||
|
||
Yann Collet
|
||
Facebook
|
||
1 Hacker Way
|
||
Menlo Park, CA 94025
|
||
United States of America
|
||
|
||
Email: cyan@fb.com
|
||
|
||
|
||
Murray S. Kucherawy (editor)
|
||
Facebook
|
||
1 Hacker Way
|
||
Menlo Park, CA 94025
|
||
United States of America
|
||
|
||
Email: msk@fb.com
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
|
||
Collet & Kucherawy Informational [Page 54]
|
||
|