Falsehoods programmers believe about text

Text processing is an old problem in computer science, most programming tutorial do some form of text manipulation – if only to send back an “Hello World” back to the user. This does not mean that it is an easy problem, and there are many misconception about text and its representation. Here is a list – it is certainly not exhaustive and probably not original, just stuff I noticed (and that I need to get off my chest).

English text can be transcribed using American Standard Code for Information Interchange (ASCII).: ASCII has no representation for the thorn letter (Þ) the long S (ſ), and the tréma (like in naïve).
Modern english text can be transcribed using ASCII.: Typographic quotes “ and ” cannot be represented in ASCII.
C programs can handle ASCII data with no problem.: An ASCII block can contain the NUL (0x00) character in its midst. Many C programs cannot handle this by stopping parsing (end of text) which might not be the expected behaviour.
Languages like C require ASCII: For a long time, the C language offered trigram facilities to encode characters that were not available in EBCDIC.
ASCII Does not encode code-points above 127: ANSI escape sequences allow for encoding C1 control characters, which are in the 128 – 159 range. Even pure ASCII which only uses 7 bits can contain escaped C1 control characters.
ASCII data represent text: The bell character does not represent textual information, but sound (or blinking the screen).
Interpreting ASCII data is easy.: ASCII data specifies many obscure control characters in the 0 – 31 range that nearly nobody knows how to interpret.
Interpreting common ASCII data is defined: The width of white space represented by a tab character (7) is not defined. The way carriage return (13) and line-feed (10) are used in text-files is still a matter of debate.
An ASCII text is valid UTF-8: ASCII data is always structurally valid UTF-8, it is not guaranteed to be interchange valid UTF-8, as many ASCII control characters like bell (0x07) are not defined in Unicode.
A Unicode file that only contains ASCII letters, encoded in UTF-8, is a valid ASCII file.
: Unicode allows for an optional starting Byte Order Marker (BOM) character at the start of the stream. Encode in UTF-8 this will be the sequence EFBBBF which is not valid ASCII. This causes some issues when moving data from system that recognise the BOM (Windows) to systems that do not (Unix).
ISO-Latin-1 can represent Western European text: The french o-e ligature character (œ) cannot be represented in ISO-latin-1 (ISO 8859-1), nor can the german Capital sharp s (ẞ).
Windows uses ISO-Latin-1 encoding: Windows often uses the Windows-1252 encoding, which is a variation of ISO-Latin-1 which ads more western latin characters in the range occupied by the C1 control characters in ISO-Latin-1.
ISO Latin-1 can be mapped directly into Unicode: This is true for real ISO Latin-1, but given the confusion between ISO Latin-1 and Windows-1252, HTML5 now recommends to interpret ISO Latin-1 as Windows-1252, so the code-points in the C1 range need to be re-mapped.
Text can be represented in one way in Unicode: Accented character like ‘é’ typically have two representations, one where the accent is composed with the letter, here with code point U+00E9, or decomposed into a letter and an accent, here the sequence U+0065 – U+0301.
Character decomposition is latin script problem: Korean Hangul also has decomposed forms
English text can be represented in one way in Unicode: Unicode has ligature characters for the sequences ﬀ, ﬁ, ﬂ, ﬃ, ﬄ. These are output by tools like LaTex, even for english.
Each ASCII character displayed with a mono-space font has the same width: The tab character can have the width of multiple characters.
Ignoring control and combining characters, Unicode characters have the same width with a mono-space font: Asian character (often named full-width) are typically wider (typically 50%) that latin letters, even in mono-space fonts. See for instance:
Cat 日本語
Full-width characters only represent Asian characters: There is a full-width version of the ASCII range U+0021 to U+0073 starting from U+FF01.
Non control characters are either half-width or full-width: The character ﷽ (Arabic Ligature Bismillah ar-Rahman ar-Raheem) (U+FDFD) is very wide.
A terminal can display Unicode text: Terminals are typically character oriented, so even if it is set up in UTF-8 mode, all fonts are installed, scripts which rely on ligatures (like Arabic), will not display properly.

UTF-8 can be safely manipulated at the byte level

Any substring operation that cuts a multi-byte character will yield invalid UTF-8.

Language like Java have Unicode support

Java and Javascript were designed around the now abandoned UCS-2 encoding, which assumes that all Unicode characters will fit on 16 bits, i.e. the U+0000 – U+FFFF range. They don’t. Java and JavaScript now uses the UTF-16 encoding, which means their character type might represent a fraction of a character (surrogate pairs).

Splitting at code-point boundary is safe

In general, cutting between a combining character and the character it combines with (say an accent), will not yield the expected result. In UTF-16, you might cut surrogate pairs. Certain ranges, like the Emoji flag range are only defined for character pairs.

A Unicode Code-point can only represented in one way in UTF-8

In theory this is true, but some systems wrongly convert UTF-16 into UTF-8 by encoding surrogate characters directly, instead of decoding them and encoding them as UTF-8. Some UTF-8 decoders accept this mistake. They might also decode UTF-8 encoded Windows-1252 code-points into their respective characters.

Unicode does not handle formatting

Even if you ignore ANSI escape sequences (which allow underlining and bolding of text), variation selectors (which select the colour / black and white version of Emojis), there are ranges that duplicate ASCII with various style attributes (italic, bold). I wrote a markless, a tool that allows to render Markdown data using just Unicode tricks.

There is a standard to encode Unicode characters in ASCII

There are many, HTML has three (named entities, hex-encoded, decimal encoded), The C programming language has one (based on UTF-8), C99 has another. URL and CSS use different schemes, and so do e-mails.

ASCII and its variants can be mapped into Unicode

Some obscure variants like PETSCII contain graphical characters that are not mapped (yet).

Unicode data is meant to be represented in black and white

Emoji characters are typically in colour.

Unicode data is meant to be represented visually

The range U+2800 to U+28FF contains braille patterns.

Digital VT 100 Terminal ⓒ Alex Dawson – Creative Commons Attribution-ShareAlike 2.0 Generic (CC BY-SA 2.0)

4 thoughts on “Falsehoods programmers believe about text”

Enjoyed the article — I like all the various “Falsehoods programmers believe…”

Falsehood: ASCII and ANSI are the same. For “ASCII Does not encode code-points above 127” you say “ANSI escape sequences allow for encoding C1 control characters…”
Well, ASCII does not equal ANSI. ASCII, or more properly now “US-ASCII”, is seven bits, encoding 128 values as 0-127 within an 8-bit byte. See RFC 20 (https://tools.ietf.org/html/rfc20) If the high bit is set it is not ASCII, it is something else.

By the way, I know how to interpret all 0-31 control characters, so “nobody knows how to interpret” is untrue. They are less well known, but hardly “obscure”, and are well defined in the same RFC 20 at § 4.1 “Control Characters”. These are largely meant for on-the-wire protocols, including printing and TTY. Some of them are meant for device-specific use so, strictly speaking, you can say “nobody knows how to interpret” those, but that is intentional.

This also somewhat invalidates your rebuttal to “C programs can handle ASCII text with no problem.” You use “the NULL (0x00) character” as your counter-example (actually, it’s not “NULL” it’s “NUL” or “the null character”) — NUL is not ASCII text, it is an ASCII control character, and C uses it as a control to mark the end of a string. NUL would never appear in “text”, however regardless of these truths C programs do need to be careful because a char * is not necessarily a “string”.

Thias on 2024/03/01

Thanks, I updated the article.

Reply

Excluding control characters, a string with more characters is physically wider or the same width?

It seems obvious but it’s not true. Arabic script uses different letter forms depending on position in a word, so adding a character can change the previous one and make the whole string narrower. Tom Scott theorised this, in reverse, was the cause of an iPhone crash bug.

Pingback: Falsehoods Programmers Believe In – Veritas Reporters

Stephen on 2020/08/15

Enjoyed the article — I like all the various “Falsehoods programmers believe…”

Falsehood: ASCII and ANSI are the same. For “ASCII Does not encode code-points above 127” you say “ANSI escape sequences allow for encoding C1 control characters…”
Well, ASCII does not equal ANSI. ASCII, or more properly now “US-ASCII”, is seven bits, encoding 128 values as 0-127 within an 8-bit byte. See RFC 20 (https://tools.ietf.org/html/rfc20) If the high bit is set it is not ASCII, it is something else.

By the way, I know how to interpret all 0-31 control characters, so “nobody knows how to interpret” is untrue. They are less well known, but hardly “obscure”, and are well defined in the same RFC 20 at § 4.1 “Control Characters”. These are largely meant for on-the-wire protocols, including printing and TTY. Some of them are meant for device-specific use so, strictly speaking, you can say “nobody knows how to interpret” those, but that is intentional.

This also somewhat invalidates your rebuttal to “C programs can handle ASCII text with no problem.” You use “the NULL (0x00) character” as your counter-example (actually, it’s not “NULL” it’s “NUL” or “the null character”) — NUL is not ASCII text, it is an ASCII control character, and C uses it as a control to mark the end of a string. NUL would never appear in “text”, however regardless of these truths C programs do need to be careful because a char * is not necessarily a “string”.

- Thias on 2024/03/01
  
  Thanks, I updated the article.
  
Thomas on 2023/02/07

Excluding control characters, a string with more characters is physically wider or the same width?

It seems obvious but it’s not true. Arabic script uses different letter forms depending on position in a word, so adding a character can change the previous one and make the whole string narrower. Tom Scott theorised this, in reverse, was the cause of an iPhone crash bug.

Pingback: Falsehoods Programmers Believe In – Veritas Reporters

Thias の blog

Probablement n'importe quoi…

Falsehoods programmers believe about text

Like this:

Related

4 thoughts on “Falsehoods programmers believe about text”

Leave a ReplyCancel reply