Text encodings, the original sin


While text is one of simplest form of data a computer can manipulate, it also one of the most misunderstood: many competent computer scientist get confused by unicode and various encodings. One reason for this is the original sin of text processing: assuming that character is a byte. This assumption is encoded in many languages (C, python) and in the mind of many programmers and the cause of many bugs.

At the core, a computer does not know what a character is, it just manipulates numbers, so one has to build a convention that this number is that character. This is largely arbitrary, one encoding might decide that ‘M’ should be encoded 0x0C, 0x1C, 0x4D, or 0xD4. Alphabetical order, Baudot Code, ASCII, or EBCDIC: choose your standard.

This is the reason why Unix system calls like open let you specify if you want to open your file in text mode or binary mode (using the b mode). The problem of converting between different binary encodings of text already existed at that point in time, but this has been all but forgotten, and noways there is no difference between the binary and the text mode.

EBCDIC and Baudot code faded away, and ASCII and its variants came to dominate. ASCII assigns character a number in the 0x000x7F range, which is fine for the characters commonly used in the English language, and sucks for mostly everything else. Also note that the way these 7 bit characters are encoded on systems that typically work with bytes of 8 bits is pretty wasteful, the 8th bit is just left blank. Technically, one could encode 8 ASCII characters in 7 bytes. Interestingly, among the 16 first code points, only 0x0D is actually used, all the others were not meant to represent text, but the control of a teletype, and have died out.

So Unicode was invented, it assigns each character (for some definition of character) a number which currently goes up to 0x300000. This fixed the question of which character has which number. The problem now is: how to map a sequence of numbers that can be pretty large to a bunch of bytes. Basically, this is about choosing an encoding.

One way to solve the problem is by reproducing the original sin of ASCII and just decide that some range is good enough, and move over. For instance Latin-1 defines that characters in the unicode range 0x00 - 0xFF are mapped to the equivalent bytes, this range contains most of the western european characters. Done. This had the advantage that all valid ASCII encoded text would also be valid iso-latin-1 text.

UCS-2 took the same approach, but at a greater scale, all codes in the 0x0000 - 0xFFFF were mapped to the equivalent pairs of bytes, which meant, of course that there would two variants of UCS-2: big-endian and small-endian. A special character called the byte order mark (BOM) was added to detect which was which.

UTF-8 is probably the smartest encoding of code points into bytes, it is a variable byte length encoding with nice properties: values in the ASCII range are encoded as such, i.e. the encoding is compatible with ASCII, all code-points above 0x80 are encoded as multi-byte sequences. The format of each byte makes it possible to determine its position in a multi-byte sequence. Note that UTF-8 is less efficient than Latin-1 for european texts (non-ASCII characters require two bytes instead of one), and pretty bad for asian text (3 bytes per character instead of 2). UTF-8 also makes text manipulation difficult, as the position of each character in the stream is determined by the nature of the previous characters.

Because of this, many languages, including Java and Javascript, use the UTF-16 encoding. UTF-16 is basically an extension UCS-2: most code-points are represented as 16 bits quantities, values above 0xFFFF are represented using a special encoding on two 16 bits values, called surrogate pairs. This means that the code-point 0x10000 is encoded on 32 bits.

The main source of confusion I have observed with computer scientists is that they mix up string objets in the memory of a computer and their byte representation. It is quite common for a program to use the language’s representation of string in memory and a different byte representation when doing I/O on that data. Also having a bunch of bytes of type text does not tell you in which representation said text is, on a web-page, or in an e-mail, there are headers that specify this, but not with the typical operating system files.

4 thoughts on “Text encodings, the original sin”

  1. Small correction: ASCII goes up to 0x7F, not just 0x79 — lowercase ‘z’, for instance, is 0x7A. Also, a few other characters in the 0x00-0x1F range are still used, like TAB (0x09) and LF (0x0A).

  2. Corrected for the end of the ASCII range, as for tab and line feed, they are slowly dying out, tab because its meaning is basically a random number of spaces, as for line-feed, it has become synonymous with carriage-return.

  3. I prefer tabs over “random spaces” for indenting when writing source code, but I suppose this is a matter of opinion – there are good arguments both ways.

    But using spaces instead of tabs (and proper tab stops in the paragraph format) when aligning normal text in a table is a serious error. Just change the font or font size and the table will become misaligned. (Sorry, pet peeve…)

  4. Fair enough, but for tabs to work, the values for tab-stops need to be defined properly outside of the text flow, or a system that properly supports HTS (0x88).

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: