Official binary format of the ILO objects
This is the official definition of the ILO format. Is intended to be the ILO Format Version 1.0.
What is ILO binary format
ILO is a binary format for storing and interchanging general purpose information. In this document will be defined the basic ILO binary format, as intended for defining general purpose objects.
The core of the ILO format: the basic ILO format
The basic ILO format defines the binary structure of a simple ILO object. An simple ILO object is a collection of pairs (label, value). Each label is a string secuence of chararcters. A value can be a string, a binary object or another ILO object.
The genral secuence of bytes in a (label, value) pair is as follows:
Byte 0 Header | Bite 1 * | N1 Label bytes | N2 size bytes | N3 value bytes * |
Bits 0..3: label size Bits 4,5: value size Bits 6,7: value type | Size of Unicode label 1..256 | N2 = 2 * size of characters of the label | 0: 1 byte (0..255) 1: 2 bytes, 0..65535 2: 4 bytes size 3: 8 bytes size | N2 bytes of the value |
Table of pair (label, value) format
The difent bytes in the secuence are as follows:
In ILO objects there are two main tipes of strings:
- One byte encoded strings. These strings use an encoding based on one byte secuences. In this case bytes from string are parsed ony by one. Each caracter of the string can be represented by a single byte or a multibyte secuence. There are three encodings supported:
- ISO-Latin-1. In this secuence, each byte stores one character from the ISO-Latin-1 character set. In this case, the lenght of the string is the same of the amount of characters stored in the string.\
- UTF-8. In this case, the string begins with a zero byte, followed by an UTF-8 encoded string.\
- UTF-8Z. In this case, the string begins with a zero byte, followed by an UTF-8 encoded string, ended by another zero byte. This string is a C-style utf-8 string, beginning at the second byte of the secuence. This encoding is well suited to store strings that will be loaded directly from file in an UNIX-like machine.\
- Two byte encoded string. This string is encoded in UNICODE 3.1 string. Each simple UNICODE character is represented by a two byte encoded character. Some Unicode 3.1 character use a 4 byte encoding. In most cases, the lenght in characters of the Unicode string will be twice the lenght in bytes. If the encodign byte order is not defined, using the two bytes secuence FF FE for big endian and FE FF for little endian, little endian is assumed. In fact,is an arbitrary choice, but most Intel based PCs and servers of the world support better little endian encoding. Also, utf-8 uses a little endiang schema to store bytes, then is more easy to remember. Machines, like humans, have their own preferences...
Advantages of the ILO string encoding
First at all, the main goal of the ILO binary format is to avoid problems and complexity with character encoding handling. More character sets means more code to develop and test in and software, data size is less important than robustness of applications.This is achieved by simplifying the software design without lost of generallity.
Our goal is to select only two or three character encodings for ILO objects, such a way that this choice balances wide application of ILO objects and cost in bytes per character.
- To store texts in european languages, ISO-Latin-1 is the best solution (maybe extended to windows Latin 1) . This choice is well suited to most application and text representation, like general text and documents.
- If the text needs some extended Unicode characters, for example if the text includes short strings out of the Latin-1 range, the best choice is UTF-8 encoding.
- If the text is out of Latin-1 range (middle est, other europeans character, arabic, etc) and we don't like to use special character sets, then the best choice is standar Unicode two-byte encoding. .. ..
- For asian and the rest of texts based in character sets bigger than 8 bits ones, the best choice is to use the standar Unicode two-byte encoding.
When ILO parsers needs to select the encoding for a certain string, its is very easy to choose the final encoding needed. The parser scans the string in one-pass scanning, and determines the most efficient encoding.