Wire format

This section specifies an encoding for D3S values as octet sequences.

Each D3S value may have many distinct encodings. There is a unique canonical encoding of each D3S value.

The flexibility permitted by the existence of multiple encodings permits efficient implementations. The canonical encoding is useful in specific circumstances, such as where the encodings are to be hashed, but determining the canonical encoding of a value may be significantly more computationally expensive than merely constructing an arbitrary encoding.

Wire format specification

This section describes the wire format encoding in terms of how to decode it If an encoding decodes to a particular value then that it is an encoding of that value.

Encoding are self-delimiting, i.e., it’s possible to determine where an encoding ends. This fact makes it possible to concatenate encodings without causing ambiguity.

Raw integer format

Non-negative integers are used throughout the encoding, to represent integer values, symbols, and to signify lengths. Such integers are always represented radix-256, most significant octet first.

Encoding structure

An encoding of a value consists of three logical parts:

a format code, which describes how to interpret the remaining octets of the encoding as a value;
an indicator, which is a non-negative integer; and
a payload, whose length and nature is determined by the format and indicator.

In order to save space, the format and indicator can be packed into a single octet.

In the following descriptions, d denotes the indicator value. The following formats are used.

Non-negative integer: The result shall simply be the integer d. The payload shall be empty.
Non-positive integer: The result shall be the integer -d.

It is possible that d=0.

The payload shall be empty.

String: The format and indicator shall be followed by a further d-octet payload containing a Unicode string encoded using UTF-8 [Con07, 3.9]. If the payload octets are ill-formed UTF-8, the encoding is invalid. The result shall be a string whose contents are the Unicode string encoded in the payload.
Symbol, by name: The format and indicator shall be followed by a further d-octet payload containing a Unicode string encoding using UTF-8. If the payload octets are not valid UTF-8, the encoding is invalid. The result shall be the symbol whose name is the Unicode string encoded in the payload.
Byte-block: The format and indicator shall be followed by a further d-octet payload. The result shall be a byte-block whose contents are precisely the payload.
List: The format and indicator shall be followed by a payload consisting of the concatenation of d further value encodings. The result shall be a list whose elements are the values encoded in the payload, in order.
Set: The format and indicator shall be followed by a payload consisting of the concatenation of d further value encodings. The result shall be a set whose elements are the values encoded in the payload.
Map: The format and indicator shall be followed by a payload consisting of the concatenation of 2d further value encodings. The result shall be a map whose associations are constructed by taking the values encoded in the payload in pairs, first key, then value.

Specification

An encoding cannot be empty. The first octet of the encoding describes how to interpret the remaining octets. Table 3.2.1 describes how to determine the format and indicator from this first octet. No other octet begins a valid encoding. Implementations shall not extend the wire format. An encoder shall not construct invalid encodings; a decoder shall reject invalid encodings.

Table 3.2.1: Encodings by the first octet (normative)

First octet Interpretation

First octet	Interpretation
`000d dddd₂`	Format is 'non-negative integer'; indicator is binary `d dddd`.
`0010 dddd₂`	Format is 'string'; indicator is binary `dddd`.
`0011 dddd₂`	Format is 'symbol, by name'; indicator is binary `dddd`.
`1000 dddd₂`	Format is 'byte-block'; indicator is binary `dddd`.
`1001 dddd₂`	Format is 'list'; indicator is binary `dddd`.
`1010 dddd₂`	Format is 'set'; indicator is binary `dddd`.
`1011 dddd₂`	Format is 'map'; indicator is binary `dddd`.
`1100 tttt₂`	Format is given by `tttt`, as shown in Table 3.2.2; indicator is stored in the following octet.
`1101 tttt₂`	Format is given by `tttt`, as shown in Table 3.2.2; indicator is stored in the following 2 octets.
`1111 0000₂`	A padding octet. Ignore this octet and interpret the next as being the start of an encoding.
`1111 0010₂`	Format is given by the next octet, as shown in Table 3.2.2; indicator is stored in the following 4 octets.
`1111 0011₂`	Format is given by the next octet, as shown in Table 3.2.2; indicator is stored in the following 8 octets.
`1111 0100₂`	Format is 'non-negative integer'. The following octets shall encode a byte-block value; the indicator is the integer encoded in the payload of the byte-block.
`1111 0101₂`	Format is 'non-positive integer'. The following octets shall encode a byte-block value; the indicator is the integer encoded in the contents of the byte-block.

000d dddd₂

Format is 'non-negative integer'; indicator is binary d dddd.

0010 dddd₂

Format is 'string'; indicator is binary dddd.

0011 dddd₂

Format is 'symbol, by name'; indicator is binary dddd.

1000 dddd₂

Format is 'byte-block'; indicator is binary dddd.

1001 dddd₂

Format is 'list'; indicator is binary dddd.

1010 dddd₂

Format is 'set'; indicator is binary dddd.

1011 dddd₂

Format is 'map'; indicator is binary dddd.

1100 tttt₂

Format is given by tttt, as shown in Table 3.2.2; indicator is stored in the following octet.

1101 tttt₂

Format is given by tttt, as shown in Table 3.2.2; indicator is stored in the following 2 octets.

1111 0000₂

A padding octet. Ignore this octet and interpret the next as being the start of an encoding.

1111 0010₂

Format is given by the next octet, as shown in Table 3.2.2; indicator is stored in the following 4 octets.

1111 0011₂

Format is given by the next octet, as shown in Table 3.2.2; indicator is stored in the following 8 octets.

1111 0100₂

Format is 'non-negative integer'. The following octets shall encode a byte-block value; the indicator is the integer encoded in the payload of the byte-block.

1111 0101₂

Format is 'non-positive integer'. The following octets shall encode a byte-block value; the indicator is the integer encoded in the contents of the byte-block.

Table 3.2.2: Format codes (normative)

If the code is not one of those listed below, the encoding is invalid.

Code Format

Code	Format
`0000₂`	Non-negative integer
`0001₂`	Non-positive integer
`0011₂`	String
`0100₂`	Symbol, by name
`0101₂`	Byte-block
`1000₂`	List
`1001₂`	Set
`1010₂`	Map

0000₂

Non-negative integer

0001₂

Non-positive integer

0011₂

String

0100₂

Symbol, by name

0101₂

Byte-block

1000₂

List

1001₂

Set

1010₂

Map

Canonical format

The following rules describe how to construct the canonical encoding of a value.

Ordering

Sets and maps do not impose an ordering on their contents.

Types already describes how values of particular atomic types order relative to each other; these orderings are combined to form a total ordering of all atomic values by defining an ordering on types:

integer < symbol < string < byte-block.

If x and y are two atomic values, then x shall compare less than y if and only if either x and y have the same type, and x<y according to the rules stated in Types; or the types of x and y differ and the type of x is less than the type of y as shown above.

In order to ensure uniqueness of the canonical encoding for sets and maps:

the elements of a set shall be encoded in ascending order according to the above ordering;

and

the associations of a map shall be encoded in ascending order of their keys according to the above ordering.

This is sufficient since (a) set elements and association keys are atomic, and (b) two distinct associations within a map cannot have equal keys.

Encoding choices

The following rules specify how to select among the various encoding choices . Earlier rules take precedence over later rules.

The first octet of the encoding shall not be 240=1111 0000₂ That is, padding octets shall not appear in a canonical encoding.

The first octet of the encoding shall be the numerically least first octet of any valid encoding of the value.
The encoding chosen shall be the shortest valid encoding of the value.

Some examples may help.

The canonical encoding of the integer 0 is 00 by rule 2.
The canonical encoding of the integer 65536 is f2 00 00 01 00 00, even though f4 83 01 00 00 is shorter, because f2 < f4 and rule 2 takes precedence over rule 3.