String Length

ⓊⓉⒻ-⑧

One of the draw­backs of the UTF-8 character en­co­ding is that it uses a variable num­ber of bytes to re­pre­sent a character. This can make text proc­essing more dif­fi­cult and pos­si­bly slower. The canonical blog has an interesting entry about coding the strlen function for ASCII and UTF-8 strings in 8086 assembly. The conclusions are interesting: the version that gcc inlines is less efficient that the naive code, and the UTF-8 version is not that much slower.

The canonical blog is down, but there is a follow up article.

2 thoughts on “String Length”

  1. Well, you can always convert everything to UTF-32, you’ll always have 4 bytes for each character. Space disk is cheap :-)

  2. Disk space might be cheap, but disk throughput is very bad, and even if your data is in memory, memory access is still very bad, so I suspect parsing a long utf-32 string will be slower than an utf-16, or even an utf-8 (assuming most of your characters are ASCII).

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.