Wire format

This section specifies an encoding for D3S values as octet sequences.

Each D3S value may have many distinct encodings. There is a unique canonical encoding of each D3S value.

The flexibility permitted by the existence of multiple encodings permits efficient implementations. The canonical encoding is useful in specific circumstances, such as where the encodings are to be hashed, but determining the canonical encoding of a value may be significantly more computationally expensive than merely constructing an arbitrary encoding.

Wire format specification

This section describes the wire format encoding in terms of how to decode it If an encoding decodes to a particular value then that it is an encoding of that value.

Encoding are self-delimiting, i.e., it’s possible to determine where an encoding ends. This fact makes it possible to concatenate encodings without causing ambiguity.

Raw integer format

Non-negative integers are used throughout the encoding, to represent integer values, symbols, and to signify lengths. Such integers are always represented radix-256, most significant octet first.

Encoding structure

An encoding of a value consists of three logical parts:

  • a format code, which describes how to interpret the remaining octets of the encoding as a value;

  • an indicator, which is a non-negative integer; and

  • a payload, whose length and nature is determined by the format and indicator.

In order to save space, the format and indicator can be packed into a single octet.

In the following descriptions, d denotes the indicator value. The following formats are used.

Non-negative integer

The result shall simply be the integer d. The payload shall be empty.

Non-positive integer

The result shall be the integer -d.

It is possible that d=0.

The payload shall be empty.

String

The format and indicator shall be followed by a further d-octet payload containing a Unicode string encoded using UTF-8 [Con07, 3.9]. If the payload octets are ill-formed UTF-8, the encoding is invalid. The result shall be a string whose contents are the Unicode string encoded in the payload.

Symbol, by name

The format and indicator shall be followed by a further d-octet payload containing a Unicode string encoding using UTF-8. If the payload octets are not valid UTF-8, the encoding is invalid. The result shall be the symbol whose name is the Unicode string encoded in the payload.

Byte-block

The format and indicator shall be followed by a further d-octet payload. The result shall be a byte-block whose contents are precisely the payload.

List

The format and indicator shall be followed by a payload consisting of the concatenation of d further value encodings. The result shall be a list whose elements are the values encoded in the payload, in order.

Set

The format and indicator shall be followed by a payload consisting of the concatenation of d further value encodings. The result shall be a set whose elements are the values encoded in the payload.

Map

The format and indicator shall be followed by a payload consisting of the concatenation of 2d further value encodings. The result shall be a map whose associations are constructed by taking the values encoded in the payload in pairs, first key, then value.

Specification

An encoding cannot be empty. The first octet of the encoding describes how to interpret the remaining octets. Table 3.2.1 describes how to determine the format and indicator from this first octet. No other octet begins a valid encoding. Implementations shall not extend the wire format. An encoder shall not construct invalid encodings; a decoder shall reject invalid encodings.

Table 3.2.1: Encodings by the first octet (normative)

First octet Interpretation

000d dddd2     

Format is 'non-negative integer'; indicator is binary d dddd.

0010 dddd2

Format is 'string'; indicator is binary dddd.

0011 dddd2

Format is 'symbol, by name'; indicator is binary dddd.

1000 dddd2

Format is 'byte-block'; indicator is binary dddd.

1001 dddd2

Format is 'list'; indicator is binary dddd.

1010 dddd2

Format is 'set'; indicator is binary dddd.

1011 dddd2

Format is 'map'; indicator is binary dddd.

1100 tttt2

Format is given by tttt, as shown in Table 3.2.2; indicator is stored in the following octet.

1101 tttt2

Format is given by tttt, as shown in Table 3.2.2; indicator is stored in the following 2 octets.

1111 00002

A padding octet. Ignore this octet and interpret the next as being the start of an encoding.

1111 00102

Format is given by the next octet, as shown in Table 3.2.2; indicator is stored in the following 4 octets.

1111 00112

Format is given by the next octet, as shown in Table 3.2.2; indicator is stored in the following 8 octets.

1111 01002

Format is 'non-negative integer'. The following octets shall encode a byte-block value; the indicator is the integer encoded in the payload of the byte-block.

1111 01012

Format is 'non-positive integer'. The following octets shall encode a byte-block value; the indicator is the integer encoded in the contents of the byte-block.

Table 3.2.2: Format codes (normative)

If the code is not one of those listed below, the encoding is invalid.

Code Format

00002

Non-negative integer

00012

Non-positive integer

00102

String

01002

Symbol, by name

01012

Byte-block

10002

List

10012

Set

10102

Map

Canonical format

The following rules describe how to construct the canonical encoding of a value.

Ordering

Sets and maps do not impose an ordering on their contents.

Types already describes how values of particular atomic types order relative to each other; these orderings are combined to form a total ordering of all atomic values by defining an ordering on types:

integer < symbol < string < byte-block.

If x and y are two atomic values, then x shall compare less than y if and only if either x and y have the same type, and x<y according to the rules stated in Types; or the types of x and y differ and the type of x is less than the type of y as shown above.

In order to ensure uniqueness of the canonical encoding for sets and maps:

  • the elements of a set shall be encoded in ascending order according to the above ordering;

and

  • the associations of a map shall be encoded in ascending order of their keys according to the above ordering.

This is sufficient since (a) set elements and association keys are atomic, and (b) two distinct associations within a map cannot have equal keys.

Encoding choices

The following rules specify how to select among the various encoding choices . Earlier rules take precedence over later rules.

  1. The first octet of the encoding shall not be 240=1111 00002 That is, padding octets shall not appear in a canonical encoding.

  1. The first octet of the encoding shall be the numerically least first octet of any valid encoding of the value.

  2. The encoding chosen shall be the shortest valid encoding of the value.

Some examples may help.

  • The canonical encoding of the integer 0 is 00 by rule 2.

  • The canonical encoding of the integer 65536 is f2 00 00 01 00 00, even though f4 83 01 00 00 is shorter, because f2 < f4 and rule 2 takes precedence over rule 3.