The grumpy serialisation format

JSON Logo

Every few years we reinvent data serialisation. CDR (CORBA) replaced EDI, and was replaced by XML, and now, clearly JSON is the solution, while ASN.1 still lurks in the shadows. And then it stuck me: I’m well into my fifties and have not invented my own serialisation format, clearly my life is not complete.

So I decided to fix this, and define a new serialisation format, I want it to have the following characteristics:

  • Binary file format, with minimal overhead
  • Efficient string representation – interchange valid UTF-8 strings can be mapped directly from the file, including the final null byte
  • Efficient (1 byte) representation of null, true and false.
  • Easy conversion to and from JSON format
  • File representation is structurally valid UTF-8

The underlying idea is to use the unused C0 (eventually C1) characters for representing structures. Basically use the difference between structurally and interchange valid UTF-8 to encode the serialisation control. After a bit of fuzzing around, I propose the following.

Type Start End Content
Null 0x00 (NUL)
Boolean 0x0E (SI/True), 0x0F (SO/False)
String 0x02 (STX) 0x00 (NUL) Interchange valid text
Number 0x11 (+) or 0x13 (-) 0x10 (DLE) Number in integer 12345, floating point 1234.5 or exponential notation 1.2345e+18. The sign is given by the start character. The number zero can be introduced with either a plus or a minus sign.
Array 0x1C (FS) 0x17 (ETB) Possibly empty sequence of values, values are separated by 0x1E (RS)
Object 0x01 (SOH) 0x04 (EOT) Possibly empty sequence of key (string), 0x1F (US) value. Key/value pairs are separated by 0x1E (RS)

So for instance the following JSON.

{
  "value": 123.4,
  "colour": "puce"
}

Will be serialised like this (23 bytes, control characters replaced with their Unicode representation).

␁␂value␀␞+123.4␐␞␂colour␀␞␂puce␀␄

The format ended up being somehow similar to Binary JSON, the main difference is the weak typing, the encoding does not keep track of the type of numbers, and there is no type specification before the true and false values, as these are only used for booleans.

An interesting aspect of this format is that it would also work for UTF-16 data.

JSON Logo – Public Domain

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.