Python & Unicode

One the main traps of Unicode is that many people assume that you only need to store everything in utf-8 and you will be safe. Alas, this is not the case. So even if your language supposedly supports Unicode, you can end up in the strange situations. Consider the following code:

# -*- coding: utf-8 -*-

pachyderm_1 = u'éléphant'
pachyderm_2 = u'éléphant'
format = u'The length of %s is %d'

print pachyderm_1 == pachyderm_2
print format % (pachyderm_1, len(pachyderm_1))
print format % (pachyderm_2, len(pachyderm_2))

What do you think is the result of the execution of this code? To see by yourselves, just download the source file and run it. The problem is that Python does not normalize unicode strings, the first string represents the strings in the canonical composition, and each “é” character is a single entity (x00E9), while in the second, each one is the sequence of a plain e (x0065) and a zero-width acute character (x0031). So Python will see the first string as being 8 characters longs, while the second will appear to have 10. The comparison operation will return False. Of course, doing operations on those strings, like slices, will return completely different results. One funny result is that methods like title() give strange results: ÉLéPhant for the second string. So while Python supports Unicode, you basically can’t trust any of the basic string manipulation functions.

5 thoughts on “Python & Unicode”

  1. Ca marche avec 2.4 et 2.5. J’ai pas essayé 2.6 mais je pense que c’est pareil. Et python 3, pour l’instant, n’entre pas en ligne de compte. Et de toute façon il fait pareil.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: