Unicode Gender

🏃🏾‍♀

A few months ago, I wrote a blog post about Unicode skin colour selectors which lets you change the skin colour of certain characters. Meanwhile a new version came out, which specifies how to select the gender of characters. Interestingly, the mechanism used this time is different, instead of a gender modifier, this is implementing by merging one character with a gender symbol using the zero width joiner character (U+200D). The gender symbol is either ♀ (female) U+2640, or ♂ (male) U+2642.

Why is there an offset of two between the female and male sign? These are actually astronomical symbol for the planets. The female symbol is also used for Venus (and in Alchemy, Copper) and the male symbol for Mars (and in Alchemy, Iron). Between them, there is the symbol for earth (♁). It also means that there are a few spare planets to encode other genders. There are many more alchemical symbols encoded in the range U+1F700U+1F773, so if you need the symbol for antimoniate  🜥, or another symbol for earth 🜨, they are there.

Using an existing character with a clear semantic meaning gives nicer degradation, the combination 🏃︎ + ♀ is kind of understandable, if you know the gender / planet symbols; I had the impression these were not taught in the USA. What is annoying is that there are now two different mechanisms to affect the appearance of a given character, so for instance the Runner character (U+1F3C3) now has 12 variations, two genders (implemented using a zero width joiner + a gender symbol), and six skins modes (implemented using skin colour modifier characters). The table below shows all the combinations (which might or might not work in your browser).

Gender Base Type 1-2 Type 3 Type 4 Type 5 Type 6
Female 🏃‍♀ 🏃🏻‍♀ 🏃🏼‍♀ 🏃🏽‍♀ 🏃🏾‍♀ 🏃🏿‍♀
Male 🏃‍♂ 🏃🏻‍♂ 🏃🏼‍♂ 🏃🏽‍♂ 🏃🏾‍♂ 🏃🏿‍♂

Flattr this!

℮

Cosmetic Symbols

℮

Regular readers of this blog know that I am fascinated by the various symbols and icons found in our everyday life. In Europe, you will often find the following symbols: ℮ period after opening symbol green dot symbol. The first symbol typically follows a weight or volume indication, the second usually contains a number followed by an ‘M’ letter, the last can be found on many types of packaging. What do they mean?

The first symbol is the , the symbol is defined by the European Union, and has its own Unicode character (0x212E). It specifies the precision of the quantity specified on the container. The way this is defined is a bit smarter than just a tolerance, the standard also specifies that the average quantity in a batch of the product cannot be less than the quantity indicated on the packaging. This ensures that producers do not systematically fill the product at the lower end of the error tolerance. Consider a product manufacturer which can produce packages with a 2 ml precision, if the tolerance is 5% and the package is 200 ml, then they could systematically fill the bottles with 192 ml, and always be within the 5% error margin. A very long time ago, I worked in a factory counting machine parts, and this is exactly what happened with packages of screws.

period after opening symbol

The second symbol is the symbol, for products that have very long shelf live (say shampoo), it specifies how long the product can be used after opening the container, in months. So if your shampoo bottle has a 12M mark, it can be used one year after opening.

green dot symbol

The third symbol is the green dot, it does not mean that the packaging is recycled, or recyclable, just that the producer joined the green dot scheme, which basically means he paid some fees. Depending on the country, packaging with that symbol can be put into a separate trash – this is not the case in Switzerland.

Period After Opening (PAO) symbol – Public Domain.
The “Green Dot” – Public Domain.

Flattr this!

Back to Latex

Latex page layout for the Ringstadt city

A long time ago, I was quite involved into Roleplaying. I played a lot with one game a friend wrote, Tigres Volants and created some material for that setting. I recently decided to gather the material related to one city, named Ringstadt, and make a page layout and publish it as a PDF.

Maybe because this was old material and I was feeling geeky, I decided to use Latex again. I used that tool a lot during my academic years, and I felt this would be more convenient to make a simple layout with some old text-files. This would also let me use a versioning system for the source. The source text is in RTF format that I recovered from Word for Macintosh 5 format, by running the whole thing in an emulator.

Converting from RTF to Latex sounded like a simple thing in theory, in practice the two first command-line tools I used crashed with a segfault, the third (unrtf) worked, but converted all non-ASCII characters to Latex escape sequences, I had to fuzz around with the configuration to get something readable. The text is in French, which means many accents, so I really wanted a source-code I can proof-read.

The good news is, Latex did not change much in the ten years since I left academia, the bad news is, Latex did not change much. There are many things I stopped worrying about when using computers: text-encodings, font management, image-formats. Latex is pretty much stuck in the 90’s, just to handle an input file in utf-8 format, you need the following packages:

\usepackage[utf8]{inputenc}
\usepackage[T1]{fontenc}

Do you know what the T1 code means? This is the encoding of the font, it also means that while you can input your text in UTF-8 format, Latex will not support Unicode, if your input contains a character that is not part of the T1 table, like say a double ended arrow, compilation will fail with an obscure error message. If I want to use cyrillic characters, I’ll need to load another codepage. I don’t think there is a way to tell Latex to just handle Unicode.

Error handling is another aspect where Latex stayed in the 90’s, I remember the error messages from GCC at that time, they were not helpful either. Nowadays there is clang which gives you helpful error messages in colour, with hints.

I just wanted to do an page-layout with the Helvetica font, french text, images, and floating boxes. I ended up with a header that includes 20 packages. Things kind of work (see the image), but floats are basically broken in Latex: you need to do everything manually, and they crash at the first difficulty: page breaks, foot-notes. Latex manages the impressive feat of having floats which are more broken that HTML (I use the wrapfigure package).

We are not talking about an exotic feature: just boxes with text flowing around, you see that in any magazine, and many web-pages. I was able to implement this in Word 5.1, more than 20 years ago, and it worked more reliably than what I get with Latex. Apple’s Pages software which I usually use for light word-processing can even handle non-rectangular floats, using the alpha channel of the image as a guide. You can also overlay floats over each other.

The main argument in favour of using Latex is that it does the right thing by default, but for French, even with the babel package, this is not really true. Latex will insert a space before double punctuation, but this is ugly, the proper thing to do would be to add a half-space. I could probably hack one package or other to get that result.

What stuck me is how much Latex is isolated from other systems: it does not use any operating system services for text processing, font-management, image-processing, rendering, so you end up with a very big system (2GB install), that is its own, old thing. Most of the things I complain in this post were already mentioned in a wish-list post, 10 years ago.

I basically did one chapter of the document, and I’m faced with a simple choice, go on fighting with Latex, with the knowledge that the final layout will be pretty mediocre, or swallow my pride, and just redo everything in Apple Pages…

Flattr this!

Markless

Markless Screen Capture

Expressing rich text, has been a thing since ASCII emerged as the standard for text representation. RTF, HTML, LaTeX all use the same idea of using text with some form of escape sequences to express formatting. On format that is getting traction is Markdown, which is used, among other things, for documentation on GitHub.

The always had some partial support for formatting, typically in the format of control-codes, Unicode deprecated some control codes (most of code-points in the C0 and C1 pages), but other forms of control with variant selector, graphical characters, and character compositing.

So I wondered if it could be possible to render Markdown using Unicode characters. Rendering Markdown using ANSI escape sequences is easy, tools like do it. What I wanted was pure text, i.e. something that could be copy-pasted into interfaces that only support text, embedded into code comments.

The result is , a small Python script that converts Markdown into Unicode text. Headings a code blocks are rendered using boxes built out of graphical character, emphasis using compositing underlines. I must stress that this is a hack, it has probably more values as a Unicode stress test than as an actual tool. Here is a sample output:

╔══════════╗
║ Markless ║
╚══════════╝
Markless is a small tool (a h⃨a⃨c⃨k⃨ really) that renders mark-down as plain text, 
using Unicode modifiers characters.
• Emphasis is rendered using underline modifiers.
• Lists is rendered using pretty bullets.
  Continuation is supported.
• Headers and code are rendered in boxes.
▌Blockquote is rendered using block characters
▌▌Second level

The tool is far from complete, and only supports a fraction of the Markdown commands. The code and an up to date version is available on Github

Flattr this!

A pale skinned, dancer with black hair and a red dress

Emoji Skin Color

A pale skinned, dancer with black hair and a red dress

Skin color is not something you traditionally associate with typography, yet in Unicode, there are control characters for skin color. More precisely, the modifiers (1F3FB to 1F3FF) change the skin colour of the previous character. With no modifier, the emoji should display the people with a Lego yellow skin. Now many emoji character support the variant selector control characters, which means that for many characters we now have 7 variants: a text variant, a neutral (yellow) emoji, and then five skin coloured variants. The text and emoji variant are not very consistent, the runner changes direction, and the dancer changes gender – interestingly, a new version of Unicode will allow the specification of gender in emoji.

The table below shows the different variants for some characters, in some cases the skin selection works, in some others it does not, you can change the skin color of the princess but not of the Japanese ogre. The DOS era smiley face has no race. All these features kind of work on OS X, but there are some quirks, the Fitzpatrick seem to implicitly trigger the emoji variant, even when they cannot apply – and as long as there is no line break between the character and the skin selector…

Text Emoji Type 1-2 Type 3 Type 4 Type 5 Type 6
💃︎ 💃️ 💃🏻 💃🏼 💃🏽 💃🏾 💃🏿
🏃︎ 🏃️ 🏃🏻 🏃🏼 🏃🏽 🏃🏾 🏃🏿
👸︎ 👸️ 👸🏻 👸🏼 👸🏽 👸🏾 👸🏿
👹︎ 👹️ 👹🏻 👹🏼 👹🏽 👹🏾 👹🏿
☺︎ ☺️ ☺🏻 ☺🏼 ☺🏽 ☺🏾 ☺🏿

Flattr this!

Bolt character with both ANSI color and Unicode variation selector

Double Escape

Bolt character with both ANSI color and Unicode variation selector

are a mechanism to control the display of text in computer command line tools. While this mechanism is quite old – it originated in the 80s – it is still somehow used nowadays, mostly to color the text in terminals.

The use of control codes to format text has mostly died out, and the range of ASCII characters (escape in particular) used for the escapes has mostly died out. Nowadays people expect text formatting like color, underlines and such not to be expressed in the text itself, but escaped in another language like HTML.

It turns out the idea has not died out, but merely came back, as Unicode as the notion of escape sequences to control the appearance of characters. Some characters, like for instance ⚡ bolt (26A1), can be displayed in two modes:

  • ⚡︎ Text Style
  • ⚡️ Emoji Style

If you look at the source code of this page, you will notice that there is no formatting tag around these characters, instead they are followed by a variation selector: FE0E selects the text variant, and FE0F selects the coloured, emoji variant. If you see only one type of bolt, your browser/operating system does not support variant selectors – if you see nothing, your browser/operating system is missing the font for that particular character.

Unicode variation selectors only apply to the single character they follow where ANSI escape sequences mark a range, with a start and an end. Now the question is, how do they interact? To check this I generated the bolt character in the simplest 7 ANSI colours with both variation selectors. As you can see in the image, ANSI controls the font-color, which is honoured in the text variation and ignored in the emoji (color) variation. This means that in a modern terminal, for certain characters you can get 257 color variations, 256 from ANSI and one from Emoji…

Of course you can get the same behaviour in a web-browser

⚡︎ ⚡️ ⚡︎ ⚡️
⚡︎ ⚡️ ⚡︎ ⚡️
⚡︎ ⚡️ ⚡︎ ⚡️

Flattr this!

😂

Word of the year 2015

😂

Once upon a time, I was an avid reader of slashdot, with its slogan News for Nerds. Stuff that Matters. With the years, the quality of the content has significantly gone down and so did the content, it used to be that you could have a good discussion over there, that time is long gone. Nowadays, I get my technical news from Ars Technica, but I still have my slashdot account, and I still see the headlines in my RSS reader.

It is on slashdot that I found the link to an entry of the Oxford dictionary blog, claiming that the word of the year, was, in fact an emoji. You can argue whenever an emoji is even a word, whatever they are, they are on the rise. What I found most interesting was the fact that slashdot, could not, display that emoji.

There is a certain irony to see a site catering to geeks have such a technical limitation, but I think the problem runs deeper: many computer geeks are scared of Unicode. Unicode is hard, but so are many other things in computer science. Some aspects of the computer geek culture are very normative, and one key aspect of that is the language: english über alles. That notion is so strong it leaks into other geek domains, so I ended up with a group of french speaking role-players being very astonished at the idea that roleplaying could actually be done in French.

I also feel there is a generation gap: the slashdot readership grew up with ASCII as a text encoding, forgetting that there were different previous encoding and that characters used for coding were at some point exotic, the typical example are the curly braces used by the C language, they did not exist in nor in , the encoding of the Commodore 64. This is why many programming languages support digraphs and trigraphsy.

Things change, so-called hyper-operators in Perl 6 use a non-ASCII characters, the so-called french-quotes «». They can be replaced with the digraphs << and >>.

Flattr this!

The mad programmer strikes again

Control Codes

The mad programmer strikes again

I have a fascination for old computing standards, in particular the ones that are not completely dead, but still present in present computers, but largely forgotten. One such standard are ANSI escape sequences, which enable rich features in terminal, like coloured text. One aspect of these standards I did not understand were C1 control codes, i.e. control characters in the above code x7F, that is, 8 bit control codes.

Even in today’s Unicode standard character codes 0 to x20 (C0) and x7F to x9F (C1) are reserved for control codes. Except for four code point: space, carriage return, line feed and form feed, the usage of C0 control codes is not recommended by RFC 5198 and the usage of the C1 codes is explicitly forbidden. If you consider how UTF-8 works, this is a shame: 22% of the code-points that can be expressed using a single byte are unused, so are 1.6% of the code-points that can be expressed using two bytes.

While most of the C0 code are related to controlling devices and line parameters, the C1 ones seem more esoteric, they are mostly related to specifications that have died out. The only one that feels vaguely relevant is CSI (Control Sequence Introducer), which sounds like what ANSI would use to send control sequences. Then I realised that there is a thing called 7-Bit Code Extension Technique which basically lets a C1 code be encoded using 7 bit characters. This is done with the sequence ␛ followed by C1 – 0x40. So CSI becomes ␛[, which is basically the preamble of most ANSI sequences.

On Mac OS X, with Terminal.app set up in Unicode/UTF-8 mode, C1 codes are not active, which makes sense, as the only advantage of using C1 codes over their 7 bit extension equivalent is that it saves a byte, except in the UTF-8 Encoding where both use two bytes.

Still I was curious to see what ANSI features are supported by the terminal in Mac OS. One good way to do this is to download vttest and run the various tests. When I was at the university, I used some real VT-220 terminals, so I expected the graphical features of these devices to mostly work. I found that the OS X terminal supports the following features:

  • Basic terminal text styling: bold, blink, underline, reverse video.
  • 256 colours.
  • Double sized text (first time I realised this feature was present).
  • dtterm window manipulation: move, resize, change title.

The way the double size text is implemented is pretty neat: ␛#3 starts a mode where the upper half of the double size text is printed, ␛#4 starts a mode where the lower half of the double size text is printed. So this means that if the code is not supported, you get twice the text instead of single big text. There is also a double height, single width mode with codes 1 and 2, but this not supported by the OS X’s terminal nor by xterm.

Flattr this!

Waning Crescent Moon (🌘)

Cultural Moon

One of the strangest features of Unicode 8.0 is ability to change the appearance of human faces, by selecting some skin color, as some perceived the default faces has having some cultural bias. It is pretty hard to design things without any bias, consider the characters for the moon phase: the shadow moves from side to side, which is what you perceive when you are far from the equator. The closer to the equator, the more the illuminated part of the moon is down, i.e. the more the crescent moon looks more like this: 🌙. Latitude has a big influence on what you see in the sky…

Flattr this!

Screen Capture of the Unicode Symbol Quizz

Do you Symbol?

A new version of Unicode is coming, and as often people mostly talk about the new emoji. As usual, many people will complain about these symbols, that are no proper characters, but if you look at the Unicode repertoire, there are many symbols that are not letters, and most of them are old, some of them older that computers or the United States: if you have classical eduction you would understand most of them. So I built a small quiz for you to check you know-how of old symbols. None of these are emoji or even related to Asia, in fact people outside of Europe probably don’t understand many of these symbols.

Note that some symbols have multiple meanings, i.e. multiple correct answers. These symbols were commonly used on printed matters: newspapers, timetables, maps. There are, of course, many others symbols, I restricted myself to signs I have seen in use and which meaning I understand. There are some traps, i.e. symbols which are very close graphically.

What I find interesting is that this set of symbol is pretty diverse, and has very different origins: mythology, astronomy, electric engineering, games, etc.

You can leave your score in the comments…

Edit: the from now shows the correct answers when you calculate the score.

Flattr this!