Unicode code

One of the strange things with programming is that it did not really follow the trend in computing. Things have changed a lot since the Commodore 64, my smartphone is a few orders of magnitude more powerful, if I edit an image, I won’t manipulate individual pixels directly, and I certainly don’t peek and poke into the memory directly. Yet, when I code, it’s pretty much the same: instead of PETSCII, In 2023, I’m using ASCII.

There have been multiple attempts of using a graphical user interface to represent code, and they failed, but to be honest I cannot express how a graphical representation would help me write or understand code. What I know is that the text I use in everyday computing has evolved in the last 40 years, while the text used for code has not. On all my devices, I can quickly enter characters from a wide repertoire, with powerful autosuggestion. But when coding, I’m magically transported back into the 80s. The thing is, I’m not coding with BASIC or Pascal anymore.

Let’s look at a Python snippet:

r = ((foo((x + 6) * 5, 4),) for x in range(1, 10))

One problem to read this code is that the parenthesis symbols means four different things:

  • Grouping for operator precedence (like in maths)
  • Function invocation (like in many other programming languages)1.
  • Tuple construction (more a Python thing)
  • Generator construction (another Python thing)

Not that this is a Python specific problem, here is a C++ example.

auto f = [a, b, c]() -> int { return foo<int>(a >> 2) >= b->c;};

The greater than (>) characters has five different semantic meanings:

  • Return type specification
  • Template specification
  • Right bit-shift
  • Mathematical greater or equal than
  • Pointer member access

The core problem is that there are only so many punctuation symbols in the ASCII table, the only one that is not used in most programming languages are the circumflex (0x5E) and grave accents (0x60)2. To get around this, many languages have multiple character operators, like **, --, &&, >= or ->. But this is a kludge, and we were there before: not so long ago, the language like C or Pascal had , sequences of two or three punctuation characters to represent punctuation symbols that are not present in EBCDIC, the predecessor of ASCII.

Most style guides warn against using overloading for semantically different operations, as this creates cognitive load, yet this is what pretty much every language does, bringing in its lot of inconsistencies, for instance, in Python, you can call a method on a string literal: 'fnord'.title(), is perfectly valid, so is 1.5.as_integer_ratio(), but 1.as_integer_ratio() gives me a syntax error.

In C++ you get mythological creatures, like the C++ spaceship operator (<=>), and the downto operator3, i.e.

while (x --> 0) {


The Swift language has a ..< operator, Rust has ..=, so newer languages are not really improving the situation. At the current rate, the next generation of languages will define the []== operator, for element-wise comparison, or something…

You could argue that it’s fine, that we are producing perfectly fine code with this character set. This position has multiple problems:

Status Quo
It’s tempting to claim that the status quo is good enough. The weird thing is, coding is never good enough, and there is always a new paradigm, a new language, and new process that is supposed to fix things every few years. In the last 40 years, I have seen many concepts, like object oriented programming, various modes of multiprocessing, but challenging an encoding from 1968 is disruptive⁇ I learned to code on a Commodore 64, so lowercase letters were something you only used in shift mode, which was only used, well for text processing, code was, of course, in graphic mode. The compact command to load things from the drive was something like L┏ (L + shift-O), and no curly braces of course. So this was sufficient for coding, no sane programming language would assign semantics to identifier casing, or use the curly braces.
The switch to ASCII was probably the least disruptive thing in my computing life…
Code is text
The way language designers and programmer look at language conditions how they think about text and this shows in the crappy unicode support most languages have. This also leads to so called human readable data formats, which aren’t, are very inefficient and are always an escaping mistake away from a security disaster. Basically, if the text assumptions of your programming language are from half a century ago, it will struggle with processing text from today.
Data is code
A large chunk of code is generated, maybe from some IDL, maybe comments in the code, or some arbitrary data files, many of which will be text files. The data is not US only (or it is US only and they use these nasty typographic quotes, or trémas). Now if you generate unit-test cases from textual data, you need need to make text to text conversion because your test data is called ソニー株式会社 and your language cannot handle this, your test-case will be called something like E382BDE3838BE383BCE6A0AAE5BC8FE4BC9AE7A4BE, the first is not readable by everybody, but can be copy-pasted in a search box, the second is well, not readable by anybody, and is not searchable. But it won’t scare people, and that’s the most important thing. I remember when Fortran only supported 8 letter identifiers, so function names were just an arbitrary index in a paper manual…

Code is culture
There is something ironic in having many projects having style guide that promote inclusivity, yet rely on programming languages which are based on a very narrow culture. The language of science is already pretty narrow, but here we are not allowed to use the symbols you learn at school, the constant is called kPi or Math.PI, not you, know, π. All these langages fancy themselves of supporting lambda functions – which is a pretty obscure term if you did not study computer science – but won’t actually support writing it with the greek letter. I find it pretty telling that the two under-used ASCII punctuation symbols are accents, only used in foreign languages.
Coding conventions are the stuff of history books: the 80 character limit comes from a steampunk technology, the punch-card. The fact that i and j are index variables because in the late 50s, the Fortran language decided the variables starting with a letter between i and n. Somehow, using conventions from the Nixon era might send a message…

The real argument against using a richer set of characters is the tooling, which basically means keyboards and editors. The thing is, if you cannot reconfigure your editor to replace the sequence dash, greater than with an arrow, it is probably a crappy editor, and any IDE that does some form of code completion should be able to handle this. Now if your argument is that complex keystrokes are user-hostile, and you are using vim or emacs, please go stand in the corner.

So what could be done? Well, rant, obviously 😀. In the case of C++, quite a few two or three character operators could be represented by some Unicode character with only aesthetic changes. The current format would just be a legacy encoding handled in the same way trigram and bigrams were in the past. This would be completely transparent to users. Maybe some automatic tooling would fix the code to the new form. Here are some examples.

C++ operator Unicode character Codepoint
<= U+2264
>= U+2265
!= U+2260
-> U+2192
<< « U+00AB
>> » U+00BB
C++ multiple character operators mapped to Unicode

In fact I’m using a special font, Cascadia Code that does something similar: it replaces the sequences with a ligature, it displays -> as a wide (two character) , I find this makes code more readable, and does not change the character alignment.

Is there any chance that any of this will change? Honestly, no. While the coding community will happily have a new revolutionary darling language every few years, and a new disruptive framework to do web user-interfaces every new moon, it is actually a pretty conservative culture. In fact, I suspect that there is quite a bit of cultural cringe at play: once you are a coder, it’s best to suppress any non anglo-saxon culture, this would explain how a community that reinvents the wheel every quarter has not touched this core aspect of coding in half a century…

1For the sake of the argument, we will assume having constructors that look like function calls is something you want.
2I know, shells are programming languages, let’s not go there.
3I know there isn’t a down-to operator, just a lack of space between the -- operator and the greater than operator, but that’s kind of the point, isn’t it?
C64 Petscii Charts Public Domain

3 thoughts on “Unicode code”

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.