BJSON

This is Binary-JSON (BJSON) format specification draft ver 0.5.
The BJSON spec can be always found on bjson.org.

Definition

BJSON is binary form of JSON.

Why

Textual JSON is widely used in data storage, serialization and streaming.
Textual JSON is extremely easy to use for developer, however textual form has some disadvantages for production environments.

There is a need of using JSON data in non-textual form which:
- is more compact than JSON, may be shorter in byte representation,
- is easy to parse,
- is easy to be implemented in most current environments,
- is easy be implemented partially when needed (for specific use),
- can be easily traversed without parsing all data (e.g. skipping some entries),
- supports all common data formats nativelly - primitive and structured,
- supports all features of JSON without additional data and coding needed,
- supports path-like addressing of data,
- can be easily transcoded to textual JSON back and forth (only if not extended),
- can be easily embedded into common transports: files, databases, mpeg streams, etc.
- is easy to extend at context specific or even private level,

Dont reinvent the wheel

Developers of BJSON, as well as developers using BJSON, should have understanding of existing standards, protocols and data-formats to not reinvent the wheel and to choose best fitting technology.

Especially, have some understanding of:
- BSON at http://bsonspec.org - all its pros and cons,
- ASN.1 (BER, DER, etc.) and all Abstract Syntax Notation One related work,
- Protocol Buffers from Google,
- Thrift from Apache, especially its protocols (TBinaryProtocol, TCompactProtocol, etc.),
- OGDL and its binary representation,
- XML,
- YAML,
- SmileFormat - check SmileFormatSpec,

The format

Numbers are little-endian by default.
Size fields contain number of bytes.

primitive values:

There are "zero" values, one byte sized:
0 - null
1 - numeric zero, or boolean false
2 - empty string
3 - boolean true (may be also a numeric one)

Comments:
The "numeric zero" and "boolean false" are the same thing in many languages (like C).
The "numeric one" and "boolean true" are the same thing in many languages (like C).

Therefore:
- the encoding process: if it's possible, then use "strict primitives" (data types of 24..27), but it should be noted that in some languages/constructs it's not an option,
- the decoding process: if the language or implementation doesn't care, then don't care. But if the decision HAS to be made, then integer is preferred over boolean.

positive_integer:

4, uint8
5, uint16
6, uint32
7, uint64

negative_integer:

These are in positive form (not mod2 !), to allow easier "manual" processing
8, uint8
9, uint16
10, uint32
11, uint64

float:

12, 32bit float - obsolete , it was "32bit float" in version 0.4, but now its illegal
13, 64bit float (double) - obsolete , it was 32bit float in version 0.4, but now its illegal
14, 32bit float
15, 64bit float (double)

utf8_string:

default coding is utf-8
the string MUST NOT have null-termination code
string cannot have any "zero" bytes to avoid null-termination finishing the string before its real length - its really important for the ease of low-level C implementations in embedded devices.

16, size[uint8], utf8_data[size*byte] - a short string up to 255 bytes
17, size[uint16], utf8_data[size*byte] - a string of up to 64k bytes
18, size[uint32], utf8_data[size*byte] - a long string, 64K to 4GB
19, size[uint64], utf8_data[size*byte] - a very long string, which probably won't be even used for now

binary:

binary data of specified length.
This is not fully JSON transcodable, as the JSON has no native support for binary data.

20, size[uint8], binary_data[size*byte]
21, size[uint16], binary_data[size*byte]
22, size[uint32], binary_data[size*byte]
23, size[uint64], binary_data[size*byte]

strict primitives:

24 - boolean false
25 - boolean true
26 - integer zero
27 - integer one
Strict primitives should be:
- used, when implementation (language) supports it,
- always implemented by the decoder (even if the decoding will loose the type),
- implemented by the encoder if possible,

array:

in JSON represented as array [item0, item1, item2, ...]

32, size[uint8], item0, item1, item2, ...
33, size[uint16], item0, item1, item2, ...
34, size[uint32], item0, item1, item2, ...
35, size[uint64], item0, item1, item2, ...

map of key -> value:

in JSON represented as object {key0:value0, key1:value1, ...}

For JSON compatibility keys shall be utf8_string.
However implementation may ignore that (use any other type as keys, even mixing types) if the JSON-compatibility is not a requirement.

Keys should be unique.

36, size[uint8], key0, value0, key1, value1, ...
37, size[uint16], key0, value0, key1, value1, ...
38, size[uint32], key0, value0, key1, value1, ...
39, size[uint64], key0, value0, key1, value1, ...

Encoding and decoding

Avoid ambiguity - compact the data

Introduction:
some values could be encoded with different binary representation. For example value 2 could be encoded as [8, 2] or [9, 2, 0].

Normative:
- the encoder SHOULD select the best (shortest) form of encoding.
Example #1: the integer 2, SHOULD be encoded as [8, 2] - the shortest possible form, but [9, 2, 0] is possible in non-perfect implementations
Example #2: the integer 0, SHOULD be encoded as either [26] (integer zero) or [1] (numeric zero), but [8, 0] (positive zero) or [9, 0] (negative zero) are possible in non-perfect implementations
- the decoder MUST support ambiguities, for reliability of implementation,

Legals, authors etc.

You can copy, modify, distribute and perform the work, even for commercial purposes, all without asking permission.

Authors

Feel free to contribute.
This document is mantained by Pietrzak Roman (yosh.ke.mu) and Sylwester Wysocki.