ⓊⓉⒻ-⑧
One of the drawbacks of the UTF-8 character encoding is that it uses a variable number of bytes to represent a character. This can make text processing more difficult and possibly slower. The canonical blog has an interesting entry about coding the strlen
function for ASCII and UTF-8 strings in 8086 assembly. The conclusions are interesting: the version that gcc inlines is less efficient that the naive code, and the UTF-8 version is not that much slower.
The canonical blog is down, but there is a follow up article.
Well, you can always convert everything to UTF-32, you’ll always have 4 bytes for each character. Space disk is cheap :-)
Disk space might be cheap, but disk throughput is very bad, and even if your data is in memory, memory access is still very bad, so I suspect parsing a long utf-32 string will be slower than an utf-16, or even an utf-8 (assuming most of your characters are ASCII).