Associative memory

Mental Graph

One thing that makes communication very difficult is that people have very different thought processes. Any explanation or logical argument can be undermined because the though train of the two persons involved in the discussion have followed different tracks. While I would not describe myself as an illogical person, my though processes tend to be very associative: I have very bad memory and I remember everything by association, so my mind wanders along the connections on my memory graph.

One example of how this works is the graph around the word apple. For many english words with germanic roots, I don’t remember them as separate words, but more like variants of their german counterpart, in this case Apfel, from there you get to the north-german word Apfelsine which is not an apple, but an orange, and was originally an apple from China. Orange is a good word because it is used in many languages, French, German (in the south) and English, the word stays similar in japanese: オレンジ (ORENJI).

The word for Apple in French is pomme, there is a similar word in Italian: pomo, which is not used for apples, the word is mela, which stems from malus, which is apple in Latin, but also means bad (like malus/bonus). From there you hit the whole evil fruit thing with Adam and Eve. If you backtrack to pomo, there is a derived word which basically means apple of gold: Pomodoro, a tomato.

Apple is the name of a record-label, who publishes the music of the Beatles: Apple-records. The beater of the Beatles is Ringo Starr, interestingly, apple in Japanese is said リンゴ (RINGO). This is probably not a coincidence, given the presence of 洋子 小野 (Yoko Ono) around the Beatles. In the Kanji (YO) means ocean, but is also used to designate western related things, like culture or food. That kanji is build-up from the water radical (⺡) and the kanji for goat (羊).

Ringo is usually written in Katakana, the kanji form is 林檎 (リンゴ), the first kanji is 林, which means woods, and is basically tree (木) written twice. (GO) is not used in Japanese, it just means fruit or red apple in Chinese and is pronounced qín.

There is another company called Apple (the two had a lot of fun suing each other), which produces computers, the most known is probably the Macintosh. Interestingly, McIntosh is also a kind of apple. Apple also used to build a PDA called the Newton, named after a physicists who became famous for getting hit by an apple (and some stuff about gravitation and derivation). Nowadays Apple produces a lot of computers in China, but they are not called oranges.

One really weird apple derived word is pineapple, first because these things really don’t look like apples, second because most of the planet agreed to call these things ananas. French has pomme de pin, but those are the cones of the pine trees. In swiss-french these are called pives.

Flattr this!

Evil CSV Result

CSV Parsing

Inigo,"You killed my father
Darth, I am your father"
Evil Guy,""";drop table"
"Expert", "Trust me, I'm an expert"
Balmer,"""Developers, Developers"""
Yoda,Do,do not try
Me,"Do not
quote me, please"

I recently wrote about the complexity of the CSV format. Many people think the format is well defined, and well understood, so I though I would build an example file highlighting the complexity of such data. You can download the original CSV file.

The goal of the game is simple: tell me what is the correct parsing of that data, how many lines, how many columns, and their content. Of course, you can open the file with some tool, but that’s cheating, and you will have to trust the tool to do the right thing…

Flattr this!

A young woman in a white dress wearing a red glowing ring

Eternal Vows

A young woman in a white dress wearing a red glowing ring

While I use technology a lot, I have only started using ebooks, mostly because my backlog of paper-book is sufficient to keep me busy for some time. Pretty randomly, I discovered that Eternal Vows by Chrissy Peebles was for free, so I gave it a try [spoiler alert, but hey!]. I read it on three devices: my laptop, my nexus 7 tablet, and my iPhone 5. The Kindle application on those various devices is not very advanced, lacking even basic copy-paste, but it works, the text, including the reading position in the book was synchronised between the devices. I had the feeling that screen pixel density has a very large impact on the reading experience, larger that the screen size. My laptop has 50 pixels per centimetre, the tablet 85, the phone 128.

Eternal Vows

ISBN: 978-1-4841-3167-1

I read quite a lot of fantasy books when I was younger, and the archetypical story is that a random young guy gets projected into some fantasy world, where he has various adventures. Eternal Vows follows the same basic structure, but the main character is a woman, pretty early in the story she puts on a magical ring that give her magical powers, but that she cannot get rid from, and she looks for a way back to her own universe.

The book is described as Paranormal romance and fantasy adventure on amazon, personally, I would describe this as non-geeky fantasy: while the text has all the superficial bits of fantasy, the fantasy part is quite secondary: neither the character not the author seem to actually care for it. After reading the book, I cannot really say much about the world the heroine entered, it is medieval, people speak and understand modern american, there is some kind of church, and there is some magic, reserved to an immortal elite.

Clearly I’m not the intended readership of this book, quite the opposite in fact, but it is pretty interesting to see this variant of the archetypical fantasy story. The main character, Sarah, is a scientist, but she does not work on something that would be mentioned in the American Scientific, instead she is looking for the Bigfoot, science is not really a driver for her, in fact she is just looking for her lost sister. Readers can rest assured that the book is pretty devoid of sciency bits after chapter 1.

Sarah is supposed to be the leader of a team, but her leadership skills are pretty low: she basically says no a lot, rolls her eyes, and tells reminds her team-member that she is paying them, while they are in another dimension with no way back. Besides that, she mostly agonises about the situation and does what the male character tell her to do.

The core plot point of the book, and I suspect the following ones, is that Sarah weds Victor, some immortal lord and puts on a magical ring that makes herself immortal and capable of magic. Her decision to do this is not of her own devising, but rather an idea of her sceptic journalist ex-boyfriend, Frank, who also got projected into this world.

While Frank is presented as a jerk, he seems to be the only character with some skills, he asks questions about the universe they are in, and organises a rescue for Sarah, and generally tries to do things. The central plot is basically Sarah being torn between Frank who is, as his names indicates, the honest good guy™, deep-down, and Victor, who is basically the immortal, century old, über-winner. Both guys are of course hot, but Victor more so.

Once the story went over the dimension rift and the wedding, it is just a sequence of chases with some interludes of bickering between Sarah, Frank and various members of her team, a lot of emoting is involved. They are helped by various locals, luck and the magical ring, which is a good thing, because Sarah and her team have the general skill level of lost US tourists, even when the author conveniently whisked away the whole language issue.

On the other hand, Victor, who is supposed to be a century old tyrant is not very good at recovering his runaway wife, once she puts on the magical ring, he is quite smitten over her, and tells her telepathically that he admires the fact she stood up against him (by accepting to marry him after less than a day in dungeon). So he basically follows her remotely in the pretty random quest Sarah sets herself up to at the end of the first book.

Generally this book reminded me of Alexia Tarabotti story I read (in French), and while I cannot claim that Gail Carriger’s writing is good (I read a translation), I found the style of Chrissy Peebles very weak. The plot, the character and the universe are pretty shallow, and even the emotions are underwhelming, a lot of them are thrown around, but given the uninteresting the characters are and how little they actually do, it is difficult to care, by the end of the first book, I really wanted Sarah to move with Victor and get over with it all… In conclusion an interesting read in the academic sense, but a pretty lame book.

Flattr this!

A Kitten

La première église unifiée des chatons

Un chaton écaille de tortue

Une chose relativement compliquée à expliquer est ce que font les gens sur les réseaux sociaux en général, et Facebook en particulier. L’explications la plus courte est qu’ils partagent des photos de paysages, de bébés et de chatons. Dit comme ça, cela semble être une perte de temps, mais on peut aussi voir cela comme une forme de communion.

Communier, verbe. Être en union spirituelle ou affective avec d’autres personnes, partager une condition, un sentiment.

Ces images consensuelles font que tout le monde peut, un instant, être d’accord. Communier et échanger des ragots, on est clairement dans le marché de l’église traditionnelle, avec en prime une théologie simplifiée  : point de dieu unique mais triple, point de fils de dieu et charpentier cloué pour expier nos crimes, juste des chatons…

Xuxa the Kitten #1 © Ryan Poplin Creative Commons Attribution – Partage dans les Mêmes Conditions 2.0 Générique.

Flattr this!

CSV Files

Coma Separated Value files are one of the most common ways of passing around data, while there is an RFC describing that format, it is extremely fuzzy, and there are many subtle variants of the format, and many people get them wrong, simply assuming they can join the values with coma characters and lines with carriage returns, maybe adding quotes around each record; this is bound to fail, as textual data typically contains quotes, commas, and carriage returns. So we get numerous bugs because some program does not properly escape characters, or some other does not properly decode escape characters.

The big irony of the situation is that the venerable ASCII code contains characters design to solve that exact problem 0x36 (record separator) and 0x35 (group separator). These characters are never used in textual data (and should be removed if they are), so building and decoding a file using those control characters would be much easier and more robust.

Why is this not happening? Text editors cannot properly handle those characters, and one of the legacies of Unix is that it is better to have a broken, brittle format that can be manipulated with a text editor than a well specified binary format. There is a certain irony of calling a file using ASCII control characters binary, but as they are not handled by text-editor, they are, for all intent and purposes, binary.

Some people will argue that XML is the solution – it really is not. Because first there is no standard XML format for passing around flat records, second because XML has the same escaping problem, the only difference is that the characters to escape are different…

Edit: two more entries about CSV parsing: CSV parsing and More CSV evil.

Flattr this!


Three characters in front of an ornate ring, a man with glowing fire in his palm, a red-haired woman in armour holding a spear and a black haired woman with the same spear, but with wings a a glow on her brow.

I recently bought the first volume of the comic book Ravine by Stjepan Šejić, aka Nebezial and Ron Marz. This purchase was a bit curious for me: my interest in fantasy has waned with the years, and I tend to be picky about the drawing style. I had been following Stjepan Šejić on deviantart for some time and I really love his drawings, he has a style that is both detailed and energetic, his pictures reminded me of the art by or , basically the good covers of Dungeon and Dragons modules. Except Stjepan Šejić has this level of quality throughout each page of the whole album.

Ravine Book 1
Text: &
Illustration: Stjepan Šejić
Top-Cow Production
ISBN : 978-1-60706722-1

The universe seems like a mix of classical fantasy fare: a founding drama, lots of dragons, many humanoid races, a falling empire, a growing religion, and of course a malediction. The book starts with a map and there is a glossary at the end of the book, along with character and race descriptions. While this sounds very much run of the mill, this was clearly done with both passion and skill, the art is just gorgeous. More importantly, the story with its two chaotic main character is what makes the whole thing tick. Usually fantasy is extremely predictable, but this story seems to have the level of chaos of a RPG session but few of the conventions, making the story very interesting.

All in all a very interesting first book, and I’m curious to see where the second one will go, the sneak peeks on deviantart look promising.

Flattr this!

Google Serve 2013

Red cross truck

Each year, Google employee can spend one day helping out some non-profit organisation. Like last year, I went to help out the Red Cross in Bern. While in the afternoon, I helped in same shop, in the morning I helped a team that emptied a house. One of the ways the Red Cross gets the stuff it sells in the second hand shops is by operating a free disposal service, typically for people who move out of their houses.

In this case, we had to empty a gorgeous three story house. While we moved the stuff out we did the primary sorting of what would have a chance of being sold, only 20% of the stuff typically makes it to the shops, the rest needs to be sorted and properly recycled. I was not surprised to see CRT Television sets, heavy vacuum cleaners from the past century and 486 computers being sent to recycling, but also a lot of furniture and books don’t make it to the shop, people don’t like dark furniture from the sixties, or photo books from the eighties.

Everything that would not go to the shop would therefore be dismantled for easier transport. It’s surprising how easy it is to break furniture when you kick in the right place. I felt kind of sad flattening the delicate wooden structure build for some miniature train model. Once the truck was packed full, we drove to the recycling centre, where the trash was sorted and disposed off properly: glass, metal, electronics, paper, construction material, but also plastic containers for bottles, are all handled separately.

In the afternoon, I went to help in the same shop I had gone to last year: La Trouvaille in Bümbpliz. I sorted the books, redid the shop window, operated the cash register, hang up and sorted clothes.

In the end it was a very fun day, very interesting as I ended-up doing quite a few things I had never done before. I’ll probably do the same next year.

Flattr this!

The camel has two humps


A friend of mine shared with me this interesting article named The camel has two humps which talks about teaching computer science. It echoes my own experience teaching computer science, but also interviewing candidates: you can teach a lot, but certain people just don’t get programming. With enough emphasis on mathematics and other academic branches, you get graduates that can’t code.

This was one of the most frustrating aspect of teaching basic programming, the grade distribution looked like a camel, with one hump of student who just did not get coding and another (smaller) hump of students who understood it. Writing the course for the average student meant targeting the rare students in the valley in between. The lower hump would still be lost, and the upper one bored. A lot of assumptions of academia and teaching are based around a bell distribution, many of them crumble when you have a camel distribution.

Flattr this!

Text encodings, the original sin


While text is one of simplest form of data a computer can manipulate, it also one of the most misunderstood: many competent computer scientist get confused by unicode and various encodings. One reason for this is the original sin of text processing: assuming that character is a byte. This assumption is encoded in many languages (C, python) and in the mind of many programmers and the cause of many bugs.

At the core, a computer does not know what a character is, it just manipulates numbers, so one has to build a convention that this number is that character. This is largely arbitrary, one encoding might decide that ‘M’ should be encoded 0x0C, 0x1C, 0x4D, or 0xD4. Alphabetical order, Baudot Code, ASCII, or EBCDIC: choose your standard.

This is the reason why Unix system calls like open let you specify if you want to open your file in text mode or binary mode (using the b mode). The problem of converting between different binary encodings of text already existed at that point in time, but this has been all but forgotten, and noways there is no difference between the binary and the text mode.

EBCDIC and Baudot code faded away, and ASCII and its variants came to dominate. ASCII assigns character a number in the 0x000x7F range, which is fine for the characters commonly used in the English language, and sucks for mostly everything else. Also note that the way these 7 bit characters are encoded on systems that typically work with bytes of 8 bits is pretty wasteful, the 8th bit is just left blank. Technically, one could encode 8 ASCII characters in 7 bytes. Interestingly, among the 16 first code points, only 0x0D is actually used, all the others were not meant to represent text, but the control of a teletype, and have died out.

So Unicode was invented, it assigns each character (for some definition of character) a number which currently goes up to 0x300000. This fixed the question of which character has which number. The problem now is: how to map a sequence of numbers that can be pretty large to a bunch of bytes. Basically, this is about choosing an encoding.

One way to solve the problem is by reproducing the original sin of ASCII and just decide that some range is good enough, and move over. For instance Latin-1 defines that characters in the unicode range 0x00 - 0xFF are mapped to the equivalent bytes, this range contains most of the western european characters. Done. This had the advantage that all valid ASCII encoded text would also be valid iso-latin-1 text.

UCS-2 took the same approach, but at a greater scale, all codes in the 0x0000 - 0xFFFF were mapped to the equivalent pairs of bytes, which meant, of course that there would two variants of UCS-2: big-endian and small-endian. A special character called the byte order mark (BOM) was added to detect which was which.

UTF-8 is probably the smartest encoding of code points into bytes, it is a variable byte length encoding with nice properties: values in the ASCII range are encoded as such, i.e. the encoding is compatible with ASCII, all code-points above 0x80 are encoded as multi-byte sequences. The format of each byte makes it possible to determine its position in a multi-byte sequence. Note that UTF-8 is less efficient than Latin-1 for european texts (non-ASCII characters require two bytes instead of one), and pretty bad for asian text (3 bytes per character instead of 2). UTF-8 also makes text manipulation difficult, as the position of each character in the stream is determined by the nature of the previous characters.

Because of this, many languages, including Java and Javascript, use the UTF-16 encoding. UTF-16 is basically an extension UCS-2: most code-points are represented as 16 bits quantities, values above 0xFFFF are represented using a special encoding on two 16 bits values, called surrogate pairs. This means that the code-point 0x10000 is encoded on 32 bits.

The main source of confusion I have observed with computer scientists is that they mix up string objets in the memory of a computer and their byte representation. It is quite common for a program to use the language’s representation of string in memory and a different byte representation when doing I/O on that data. Also having a bunch of bytes of type text does not tell you in which representation said text is, on a web-page, or in an e-mail, there are headers that specify this, but not with the typical operating system files.

Flattr this!

High Traffic

Traffic Spike – falshoods about geography

This blog was in a pretty shaken state lately, the PHP virtual machine seems to run out of memory on a regular basis (so much for garbage collection) and this week has seen some network problem with Deutsche Telekom that affected Swisscom and probably also the network provider where our machine resides:

The internet outage occurred due to technical problems at Deutsche Telekom. Partially, our international connections could not be transmitted over their servers. Specialists from Deutsche Telekom are working intensively to resolve the issue.

To make the situation even worse, an article titled Falsehoods programmers believe about addresses added a link to my post about Falsehoods programmers believe about geography from a year ago. While it is really nice to be referenced in such a way, it caused an important traffic spike and caused the blog to become temporarily unavailable. The situation has now settled, but it has become clear that the current configuration is not optimal.

Flattr this!