
One the main traps of Unicode is that many people assume that you only need to store everything in utf-8 and you will be safe. Alas, this is not the case. So even if your language supposedly supports Unicode, you can end up in the strange situations. Consider the following code:
#!/usr/bin/python
# -*- coding: utf-8 -*-
pachyderm_1 = u'éléphant'
pachyderm_2 = u'éléphant'
format = u'The length of %s is %d'
print pachyderm_1 == pachyderm_2
print format % (pachyderm_1, len(pachyderm_1))
print format % (pachyderm_2, len(pachyderm_2))
What do you think is the result of the execution of this code? To see by yourselves, just download the source file and run it. The problem is that Python does not normalize unicode strings, the first string represents the strings in the canonical composition, and each “é” character is a single entity (x00E9), while in the second, each one is the sequence of a plain e (x0065) and a zero-width acute character (x0031). So Python will see the first string as being 8 characters longs, while the second will appear to have 10. The comparison operation will return False
. Of course, doing operations on those strings, like slices, will return completely different results. One funny result is that methods like title()
give strange results: ÉLéPhant for the second string. So while Python supports Unicode, you basically can’t trust any of the basic string manipulation functions.
Tu fais bien de le signaler, je n’aurais jamais même compris ce qu’était ce bug !
M’endormirai moins con ce soir.
Tu parles de quelle version de python?
Ca marche avec 2.4 et 2.5. J’ai pas essayé 2.6 mais je pense que c’est pareil. Et python 3, pour l’instant, n’entre pas en ligne de compte. Et de toute façon il fait pareil.
C’est pas faux !
Un python, c’est une sorte de serpent.