Official binary format of the ILO objects

This is the official definition of the ILO format. Is intended to be the ILO Format Version 1.0.

What is ILO binary format

ILO is a binary format for storing and interchanging general purpose information. In this document will be defined the basic ILO binary format, as intended for defining general purpose objects.

The core of the ILO format: the basic ILO format

The basic ILO format defines the binary structure of a simple ILO object. An simple ILO object is a collection of pairs (label, value). Each label is a string secuence of chararcters. A value can be a string, a binary object or another ILO object.

The genral secuence of bytes in a (label, value) pair is as follows:

Byte 0
Header
Bite 1 *N1 Label bytesN2 size bytesN3 value bytes *
Bits 0..3:
label size
Bits 4,5:
value size
Bits 6,7:
value type
Size of
Unicode
label
1..256
N2 = 2 * size
of characters
of the label
0: 1 byte
(0..255)
1: 2 bytes,
0..65535
2: 4 bytes size
3: 8 bytes size
N2 bytes of the
value
Table of pair (label, value) format

The difent bytes in the secuence are as follows:

  • Byte 0: Header byte. Defines the type and size of label, type and size of value.
  • Byte 1: Unicode string label size When bites 0..3 are 0, then you need a label size bit. As unicode characters are endoced in a two byte secuence, the length in bytes is twice the length of an Unicode string (if you don't use the upper extended Unicode 3.1 chars...)

    You can see the following:

  • The bytes marked as * are optional. In fact, most times will be present.
  • The minimun byte secuence is three bytes long. For example a 1 byte ASCII label with no data.

In ILO objects there are two main tipes of strings:

  • One byte encoded strings. These strings use an encoding based on one byte secuences. In this case bytes from string are parsed ony by one. Each caracter of the string can be represented by a single byte or a multibyte secuence. There are three encodings supported:
    • ISO-Latin-1. In this secuence, each byte stores one character from the ISO-Latin-1 character set. In this case, the lenght of the string is the same of the amount of characters stored in the string.\
    • UTF-8. In this case, the string begins with a zero byte, followed by an UTF-8 encoded string.\
    • UTF-8Z. In this case, the string begins with a zero byte, followed by an UTF-8 encoded string, ended by another zero byte. This string is a C-style utf-8 string, beginning at the second byte of the secuence. This encoding is well suited to store strings that will be loaded directly from file in an UNIX-like machine.\
  • Two byte encoded string. This string is encoded in UNICODE 3.1 string. Each simple UNICODE character is represented by a two byte encoded character. Some Unicode 3.1 character use a 4 byte encoding. In most cases, the lenght in characters of the Unicode string will be twice the lenght in bytes. If the encodign byte order is not defined, using the two bytes secuence FF FE for big endian and FE FF for little endian, little endian is assumed. In fact,is an arbitrary choice, but most Intel based PCs and servers of the world support better little endian encoding. Also, utf-8 uses a little endiang schema to store bytes, then is more easy to remember. Machines, like humans, have their own preferences...

Advantages of the ILO string encoding

First at all, the main goal of the ILO binary format is to avoid problems and complexity with character encoding handling. More character sets means more code to develop and test in and software, data size is less important than robustness of applications.This is achieved by simplifying the software design without lost of generallity.

Our goal is to select only two or three character encodings for ILO objects, such a way that this choice balances wide application of ILO objects and cost in bytes per character.

  • To store texts in european languages, ISO-Latin-1 is the best solution (maybe extended to windows Latin 1) . This choice is well suited to most application and text representation, like general text and documents.
  • If the text needs some extended Unicode characters, for example if the text includes short strings out of the Latin-1 range, the best choice is UTF-8 encoding.
  • If the text is out of Latin-1 range (middle est, other europeans character, arabic, etc) and we don't like to use special character sets, then the best choice is standar Unicode two-byte encoding. .. ..
  • For asian and the rest of texts based in character sets bigger than 8 bits ones, the best choice is to use the standar Unicode two-byte encoding.

When ILO parsers needs to select the encoding for a certain string, its is very easy to choose the final encoding needed. The parser scans the string in one-pass scanning, and determines the most efficient encoding.