Back to Latex

Latex page layout for the Ringstadt city

A long time ago, I was quite involved into Roleplaying. I played a lot with one game a friend wrote, Tigres Volants and created some material for that setting. I recently decided to gather the material related to one city, named Ringstadt, and make a page layout and publish it as a PDF.

Maybe because this was old material and I was feeling geeky, I decided to use Latex again. I used that tool a lot during my academic years, and I felt this would be more convenient to make a simple layout with some old text-files. This would also let me use a versioning system for the source. The source text is in RTF format that I recovered from Word for Macintosh 5 format, by running the whole thing in an emulator.

Converting from RTF to Latex sounded like a simple thing in theory, in practice the two first command-line tools I used crashed with a segfault, the third (unrtf) worked, but converted all non-ASCII characters to Latex escape sequences, I had to fuzz around with the configuration to get something readable. The text is in French, which means many accents, so I really wanted a source-code I can proof-read.

The good news is, Latex did not change much in the ten years since I left academia, the bad news is, Latex did not change much. There are many things I stopped worrying about when using computers: text-encodings, font management, image-formats. Latex is pretty much stuck in the 90’s, just to handle an input file in utf-8 format, you need the following packages:


Do you know what the T1 code means? This is the encoding of the font, it also means that while you can input your text in UTF-8 format, Latex will not support Unicode, if your input contains a character that is not part of the T1 table, like say a double ended arrow, compilation will fail with an obscure error message. If I want to use cyrillic characters, I’ll need to load another codepage. I don’t think there is a way to tell Latex to just handle Unicode.

Error handling is another aspect where Latex stayed in the 90’s, I remember the error messages from GCC at that time, they were not helpful either. Nowadays there is clang which gives you helpful error messages in colour, with hints.

I just wanted to do an page-layout with the Helvetica font, french text, images, and floating boxes. I ended up with a header that includes 20 packages. Things kind of work (see the image), but floats are basically broken in Latex: you need to do everything manually, and they crash at the first difficulty: page breaks, foot-notes. Latex manages the impressive feat of having floats which are more broken that HTML (I use the wrapfigure package).

We are not talking about an exotic feature: just boxes with text flowing around, you see that in any magazine, and many web-pages. I was able to implement this in Word 5.1, more than 20 years ago, and it worked more reliably than what I get with Latex. Apple’s Pages software which I usually use for light word-processing can even handle non-rectangular floats, using the alpha channel of the image as a guide. You can also overlay floats over each other.

The main argument in favour of using Latex is that it does the right thing by default, but for French, even with the babel package, this is not really true. Latex will insert a space before double punctuation, but this is ugly, the proper thing to do would be to add a half-space. I could probably hack one package or other to get that result.

What stuck me is how much Latex is isolated from other systems: it does not use any operating system services for text processing, font-management, image-processing, rendering, so you end up with a very big system (2GB install), that is its own, old thing. Most of the things I complain in this post were already mentioned in a wish-list post, 10 years ago.

I basically did one chapter of the document, and I’m faced with a simple choice, go on fighting with Latex, with the knowledge that the final layout will be pretty mediocre, or swallow my pride, and just redo everything in Apple Pages…

Flattr this!

The web-site of Matthias Wiesmann Welcome on the old-school part of my web-site, entierly written in vi, just link 20 years ago. You can visit: my blog (Thias no Blog) About me the image format test page the page describing the RPG scenarios I wrote for a French game, R e de Dragon. ____________________ Search my blog

Formatting systems

The web-site of Matthias Wiesmann
     Welcome on the old-school part of my web-site, entierly written in vi,
    just link 20 years ago.
    You can visit:
       my blog (Thias no Blog)
       About me
       the image format test page
       the page describing the RPG scenarios I wrote for a French game, R e    de Dragon.
       ____________________ Search my blog

While many device run some form of graphical user interface or another, the command-line tool is far from dead: it is the main interface both for configuration and hacking together systems. What I find fascinating is that ANSI control sequences have reached the legacy system status a long time ago. Look at the set of tools you invoke in the terminal, only a few ever get support for colours output. At the same time, another formatting language with escape codes has become prevalent: HTML.

While in theory you can implement a command line web-browser – like lynx – the web nowadays is more pixel centric than character centric: there are images, non proportional fonts, compositing, and shadow effects. Web navigation is very mouse centric, you can add keyboard shortcuts to a web-page, but this feature is rarely used. The only concrete text feature present both in ANSI codes and in the original HTML spec is underlining text, a feature that is rarely used in both systems because it hurts legibility. The fact that Tim Berners-Lee chose to build a new language instead of reusing ANSI codes made sense: the system was complex with changing levels of support, weird features. A lot like what HTML is today.

While you can implement the web equivalent of most desktop applications, the complexity increases as you move away from the paper form. While in theory web applications are cross-platform, they more often than not are not usable on devices which are not computers. I have many devices which can run a web browser: laptop, tablet, phone, gaming console. Many web application are only usable on the first device. Note that the problem is not only computing power, but also the assumption that there is a pointing device and a keyboard.

Given the complexity of web development, it is hardly surprising that attention is moving away from the browser to mobile apps. It is probably too early to say if it is a fad or the next evolution, but what is clear that the basic assumptions of the web: a pointing device with a keyboard,a large screen containing text and static images is less and less appropriate.

Flattr this!


HTML5 Logo by World Wide Web Consortium

Après une longue stagnation, HTML, le standard qui sous-tend la majorité du web a finalement recommencé à bouger, avec une version 5 qui s’impose peu à peu. Là où HTML4 a tenté – sans succès – d’imposer un système formel, avec une belle grammaire, HTML5 essaye plutôt d’ajouter les capacités qui manquent au système. Un des ajouts intéressant sont les µ-données (micro-données). L’idée étant de pouvoir ajouter des informations compréhensibles par un ordinateur à une page web.

Pourquoi est-ce utile ? La majorité des pages web sont, d’une manière ou d’une autre, lues par des algorithmes, que ce soit les moteurs de recherche ou les réseaux sociaux. Que ce soit pour inclure la page dans un indexe, ou bien pour partager un lien et afficher une vignette, il est utile pour le système de savoir de quoi parle une page web. Est-ce un article, qui est l’auteur, quel le titre, quelles sont les images ? Est-ce une critique ? Qu’est-ce qui est critiqué ? Quelle est la note de la critique ? Même si de grand progrès ont été faits en matière de traitement automatique de texte, aucun système ne peut pour l’instant extraire de manière fiable ces informations.

La solution est simple : ajouter ces informations, des méta-données, dans la page web dans une forme structurée qui puisse être analysée automatiquement. Ce n’est pas une idée nouvelle : une variété de systèmes ont été proposé à cet escient au cours des années, à commencer par le vénérable balise meta, qui fut abandonnée tant elle fut abusée par les sites web peu scrupuleux.

Le format proposé en 1999 est un système pour décrire des méta-données, mais une manière d’inclure ces informations dans une page web ne fut standardisée qu’en 2008, cet encodage est nommé , il utilise en partie des balises HTML existantes, mais introduit aussi les siennes.

Imaginons que je sois responsable d’un site web présentant des jeux de rôles, que je nommerais la Confrérie des Rôlistes Interstallaires (CRI). Voici comment je présenterais le jeu de rôles au format RDFa. Il me faut d’abord choisir un vocabulaire, i.e un suite d’attributs qui ont un sens pour le sujet dont je vais parler. Ici on parle de livre donc je choisis le vocabulaire , qui est abrégé dc.

<html xmlns="" dc="" xmlns:biblio="" >

<div xmlns:dc="" about="urn:ISBN:9782847890525" typeof="biblio:book">
Titre: <span property="dc:title">Tigres Volants</span><br/>
Auteur: <span property="dc:creator">Stéphane Gallay</span><br/>
Éditeur: <span property="dc:publisher">2D Sans Faces</span><br/>

Avant de pouvoir parler, je dois donc importer les vocabulaires que j’utilise. Dans l’en-tête je définis donc que le préfixe dc correspond au vocabulaire Dublin Core, et biblio au vocabulaire sur les bibliographies (pour le type book). L’extrait de HTML rend explicite le fait qu’on parle d’un livre avec l’ISBN 9782847890525, le titre Tigres Volants et comme auteur Stéphane Gallay, publié par 2D Sans Faces. Il est possible de spécifier d’autres attributs, mais cela devient vite compliqué, car pour chaque domaine, il faut importer un nouveau vocabulaire et il en existe de nombreux, comme par exemple de Facebook.

L’idée des µ-données est de simplifier ce système en abandonnant la généricité du XML et en intégrant mieux le HTML. Comme auparavant, il faut choisir un vocabulaire, mais cette fois-ci il existe un vocabulaire généraliste qui décrit assez bien la plupart des choses : . Le même extrait ressemblerait cette fois-ci à cela :

<div itemscope itemtype="">Titre: <span itemprop="name">Tigres Volants>/span><br/>
Auteur: <span itemprop="author">Stéphane Gallay</span><br/>
Éditeur: <span itemprop="publisher">2D Sans Faces</span><br/>
ISBN: <span itemprop="isbn">9782847890525</span><br/>

Le gros avantage de la solution µ-donnée est que le format est plus simple, le vocabulaire unifé et mieux documenté : pour savoir quels attributs sont supportés par un type, il suffit d’aller voir à l’url qui définit le type, ici le type livre.

Il existe un autre système, parallèle, appelé micro-formats (rien n’est simple), qui utilise uniquement les balise class utilisées par le CSS, ils ont l’avantage, mineur à mon sens, de ne pas introduire de nouvelle balise, ils ont l’inconvénient d’être plus complexe et d’interférer potentiellement avec la feuille de style.

Bon d’accord, mais en pratique, ça fait quoi ? Principalement, cela permet aux moteurs de recherche d’afficher les résultats de manière plus pertinente, par exemple si une page contient un évènement, d’afficher les dates, s’il s’agit d’une critique, d’afficher le score, par exemple avec des étoiles. D’une manière générale, cela permet aux engins de recherche d’afficher des informations à propos d’un sujet indépendamment d’une page web, par exemple, savoir que Tigres Volants est un jeu de rôle écrit par Stéphane Gallay.

Quelques Outils

Flattr this!

Screen Capture from the document HTML predefined icon-like symbols

Escape Sequences

1994 proposed named escape sequences vs. unicode characters
Original Unicode Symbol
&audio; 1F509 🔉
&cd.rom; 1F4BF 💿
&clock; 1F557 🕗
&diskette; 1F4BE 💾
&display; 1F4BB 💻
&document; 1F4C4 📄
&fax; 1F4E0 📠
&folder; 1F4C1 📁
&home; 1F3E0 🏠
&index; 1F4C7 📇
&keyboard; 2328
&mail; 1F4E7 📧
&; 1F4E5 📥
&mail.out; 1F4E4 📤
&next; 2398
&notebook; 1F4D3 📓
&previous; 2397
&printer; 2399
&sadsmiley; 1F61E 😞
&smiley; 1F604 😄
&telephone; 1F4DE 📞
&text.document; 1F4DD 📝
&trash; 267B

One early proposal to the HTML standard was by Bert Bos to have named entities for often used icons (folders and such). The list of the proposed escape sequences was closely modelled to the icon set of the day: hypercard, gopher, etc.

Some time ago there was discussion on this list about defining a set of standard icons for things like Gopher types, “home” buttons, etc.
The discussion didn’t reach a conclusion. Below is a proposal.
Reactions please! The text and two sets of example icons are also available at:

The idea never took off, but what is ironical, is that in the end, 18 years later, many of these symbols are now part of various segments of unicode, inluding emoji, so the only difference between now and the 1994 proposal is the type of escape sequence: unicode numbers vs. named entities. It is one of the rare cases where something ends up being implemented one level down in the abstraction stack, not in the browser, but in the text rendering system of the operating system (schematically at least, chrome seems to be be doing strange things with emoji).

Edit: here is the W3C proposal

Flattr this!

A Beginner’s Guide to HTML

NCSA Mosaic Logo

One of the basic ideas of the web is that you do not copy data to your machine, instead you keep pointers (urls) to stuff. The underlying assumption is that said pointer will remain valid in the future. When I first learned to do HTML, I clearly remember printing out the page A Beginner’s Guide to HTML written by Marc Andreessen, so I could work offline, at home. I had dialup access, but loading a web page over a modem was just to slow. As internet became faster and more prevalent, I stopped keeping paper versions of documentation, and moved on to more modern HTML features, like the ones supported by Netscape. I eventually lost the paper version.

In response to assorted requests and queries, I have written a simple “Beginner’s Guide” to writing documents in HTML. It’s up for grabs at at the moment; comments are welcome (but no complaints about my coverage or use of the IMG tag that Mosaic supports; it’s important internally).
The guide also points to a rudimentary primer on URL’s that might be of interest to Web beginners (certainly the number of people who have sent me Mosaic bug reports saying “URL ‘’ doesn’t connect to the ftp server”, etc., would seem to indicate that basic knowledge of URL’s is not yet a given on the net).

Finding that particular guide again was not completely trivial, so I’m now putting another mirror online here. This document confirms what I outlined in my post about image formats, that what was considered a reasonable image file in these days is not anymore. We are still struggling with the format of video data, although that subject is already touched upon in 1993. In a sense, HTML-5 is much closer to the spirit of HTML-1, with various teams trying to get something done instead of having a nice formalism.

Flattr this!

Shifting vs. Scaling


With the advent of the web, colours expressed in numerical form have become much more prevalent, while CSS supports named colours, the most common way is to use the hexadecimal RGB (Red Green Blue) format. A pure blue colour would be represented as #F00000 or #F00. The first format represents the colour as three 8 bit values, the second as three 4 bit values. People often use the 4 bit format because it is shorter and the difference are very minor.

If you look at the box on the side, it contains sixteen red blocks, specified each with a slightly different shade of red, the darkest at the top-left edge, the lightest at the bottom right. Many mobile screen cannot display 24 (3×8) bits colour, but handle a more conservative 15 (3×5) bits. The old web-safe colour table further restricted the set of colours to six hex values per channel: 0, 3, 6, 9, C and F, for a total of 216 colours.

Converting from 8 bit data to 4 bits is very easy, you just shift right and drop the superfluous bits. So the binary 11110000 becomes 1111. How do you do the reverse conversion? The naive version is to simply shift to the left, so 1111 becomes 11110000. This is a very bad solution, imagine you are converting from a 2 bit per channel model to 8 bit per channel model. So you convert 11 to 11000000 so a red block would become dark red instead of bright red .

So instead of shifting, the proper way is do scaling. This can of course be done mathematically, in this case, if c is your 4 bit colour, then your 8 bit colour is 256 × (c ÷ 4). The quick way to do this with bit manipulation is to replicate the short bit sequence in the long format, so the new binary value would be 11111111, first the shifted bits 11, then the same bits replicated for each sequence of two bits.

Flattr this!

HTML5 for old-school coders

HTML has gone a long way since it was just a page description language, served either from static files or a CGI script. Still a lot of software engineers use it in exactly that way, disregarding most of the recent changes in the language and sticking to a minimal set of features. This makes a lot of sense when building simple tools: by only relying on the basics, one is sure to have something that runs fine on every device, be it the lynx browser or a weakly powered phone.

Version 5 of HTML is now becoming quite stable, and the good news is, there are quite a few features that can be used when doing minimal web-interfaces, feature that do not require any javascript, and who fail gracefully in older browsers.

New form input types
Forms now support new input types: email, url, search, number and range. These will fallback to input type text on older browsers, but on newer ones, in particular on mobile phones and tablet, these will display different, specialised inputs. For some types (number in particular) desktop clients also do additional checks, for instance that the number falls within a specified range.
Placeholder attribute
Input control in forms now support a placeholder attribute which adds a greyed placeholder in the form, a good way to give an example to the user. Older browser will just ignore this attribute.
Quick Example

This is boring stuff that should be hidden away by default

Details & summary
HTML5 provides tags that let you collapse some part of the page that contains details, to only show the summary. Just enclose the whole block of text in the details tag and the summary in the summary tag. Older browsers will just ignore both, while newer-ones will display a collapse widget.

Meter 50%
Progress 50%
Meter and Progress
Do you want to display some gauge or some progress bar? There are tags for this. You specify the min, max and the value attribute, and put the fallback display text within the open and closing tags. Older browsers will show the fallback text (say 50%) while the newer ones display the gauge or progress bar. Note that those attributes (min and max) are also supported by the number and range input mentioned above.

Of course, there is a lot more involved in HTML5, but if you just do old-style HTML coding, these are the ones you can use without thinking about Javascript or complex document description.

Flattr this!

Image Formats

Telefunken PAL test pattern in XBM image format

One fascinating aspect of HTML is the fact that a web page written for a web browser released 19 years ago will still display fine (pages designed for later browser are much more of a problem). Today we take the ability to have images within a web page for granted, but it was the big innovation of the Mosaic web browser. The tag for inserting images into web-pages was proposed in a simple e-mail from Marc Andreessen:

I’d like to propose a new, optional HTML tag:
Required argument is SRC="url".
This names a bitmap or pixmap file for the browser to attempt to pull over the network and interpret as an image, to be embedded in the text at the point of the tag’s occurrence.
An example is:
<IMG SRC="file://">
(There is no closing tag; this is just a standalone tag.)
This tag can be embedded in an anchor like anything else; when that happens, it becomes an icon that’s sensitive to activation just like a regular text anchor.
Browsers should be afforded flexibility as to which image formats they support. Xbm and Xpm are good ones to support, for example. If a browser cannot interpret a given format, it can do whatever it wants instead (X Mosaic will pop up a default bitmap as a placeholder).

While the tag was universally adopted, the image formats he suggested were not. No browser I know of supports XPM images, and XBM support is far from universal. If you do not see an image in this post, your browser does not support XBM. This made me curious, so I created a small image test page, it contains the same image (the Telefunken PAL test pattern) in various formats. The image range from esoteric formats like Targa, to old workhorses like Tiff to the newest proposals like WebP. As far as I know, no browser manages to display all the images.

Flattr this!

AEI Logo – Matthias Wiesmann - 1996

Ghost Pages

AEI Logo – Matthias Wiesmann - 1996

A long time ago, in 1996, when I was still at University of Geneva, I was part of the student association for computer science (Association des Étudiants en Informatique) and wrote some web-pages for said association. Those pages are still accessible by the way of the web archive. It is interesting to consider theses page 14 years later:

  • The comments, with my signature use the old style syntax, they mention the use of standard tags and Netscape 2.0 extensions, a new thing at the time.
  • The javascript scroller definitely dates the whole thing
  • The logo, created using the Canvas software, is very amateurish, but the transparent background means that it still looks ok with a white background (in those days, the default background was gray).
  • Tables were the new thing, and I used most of the features available at the time to do the courses time tables. I overdid the larger borders…
  • All the accents are done using escape-sequences, at that time encodings were not supported and generally a mess. One of the accessory pages uses the Mac-Roman encoding, without any header information…
  • The general layout works nicely even with newer, larger displays, as everything in it is relative.
  • Everything works in my iPhone, which is way powerful than the computers I used at the time.

Flattr this!

Pendant ce temps, sur le web

NCSA Mosaic Logo

Un des phénomènes que j’ai pu suivre depuis presque le début a été le développement du web. Beaucoup de choses se sont passées depuis l’époque où j’utilisais Mosaic sur les machines de l’Université de Genève. Si je n’ai jamais été un web-designer, j’ai régulièrement bricolé du HTML et suivi l’évolution du langage.

La première version de HTML était une description logique, avec peu ou pas de mise en page. On pouvait sentir une forte influence de LaTex notamment au niveau des structures de liste. Les version subséquentes ont surtout ajouté des tags de mise en page, ce qui fait que l’aspect logique a rapidement été noyé dans la masse. Les CSS et l’émergence des méta-données ont finalement commencé à renverser la tendance.

Moi même j’utilise de plus en plus les CSS pour faire la mise en page de mon blog. La mise en page du tableau des kanjis est entièrement contrôlée par le CSS. J’essaye aussi d’ajouter autant de meta-données, notamment pour les acronymes. J’ai aussi finalement pris le temps de voir comment le site apparaît avec d’autres browser, en général, le résultat est plutôt bon. Le code du site valide aussi à présent avec l’engin de validation de iCab. Je regrette que WordPress n’insère pas encore les liens prev and next dans le header, mais au moins il y a déjà les liens vers les archives.

Il y a deux développements aux HTML que j’aimerais bien voir supportés plus largement, il s’agit de MathML et Ruby. Le premier est un langage pour exprimer des équations. L’avantage serait de pouvoir enfin éviter de représenter les équations comme des graphiques, et, je pense aide beaucoup à la publication online d’articles scientifiques.

Le but du second est plus exotique, il s’agit en fait de généraliser l’idée de furigana. Le furigana est le fait d’écrire un texte phonétique ou dessus d’un texte. Il est typiquement utilisé au Japon pour expliquer la lecture des kanjis, mais l’idée peut de fait être généralisée, par exemple pour les acronymes. Je trouve ce sujet intéressant, d’abord car une telle structure m’aiderait pour mon site, mais surtout car c’est une manière de représenter des textes de manière structurée et non linéaire. Malheureusement, ce système est pour l’instant uniquement supporté par Microsoft, et les projets open-source sont à la traîne.

Flattr this!