Old School Record Format

Old School Record Format Structure
Header1 Header2 Header3
Value1 Value2 Value2
Value1 Value2 Value2
Value1 Value2 Value2

One very common way of passing data around in computer systems is the tabular format, multiple records that share the same, fixed set of fields. Often time, this kind of data is passed around in the dreaded CSV format, which more or less guarantees there are going to be escaping and parsing problems.

While human readable formats have some advantages, mostly that they are easier to debug by humans, but they also introduce numerous escaping problems, as the same set of character have to be used for both control and content. So humans can read the data, but usually get it wrong. With the advent of Unicode text, the idea of human readable formats has become even more problematic, as most text editors have trouble handling those text.

As I mentioned in an earlier post, the big irony is that ASCII provides all the control characters necessary for building a file with tabular records. With such a format, no need for escaping, the control characters are defined control characters that are forbidden within the records.

So let me propose the old-school record format, it has the following characteristics:

  • The file is encoded as UTF-8, optionally prepended by a BOM sequence.
  • Text fields can contain any human readable character.
  • Text fields can only contain the following control characters in the 0-31 control range: Horizontal Tab (/0x9), Line-Feed (/0xA), Carriage Return (/0xD).
  • Control characters are Start of Header (/0x1), Start of Text (/0x2), Group Separator (/0x1D) and Record Separator (/0x1E).

The file itself has two parts, the first contains the headers, the second contains the actual data. The header part starts with the start of header character () contains C header names each separated by C – 1 Group Separator () characters and ends with a Record Separator ().

The data part starts with the Start of Text () character it contains L records separated by L – 1 Record Separtor () characters. Each record contains C fields separated by C – 1 Group Separator () characters.

The main issue of this format, is that it cannot be handled by text editors, but then again, writing tools that convert between this format and others is pretty trivial. So would be adding support to a text editor…

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.