Substring considered harmful – an example

php_substring_bug

I have complained on languages that allow substring operations. It should come as no surprise that I found an occurrence of that bug in the theme I use in my blog.

As you can see on the image, the post preview contains some corrupted Unicode data. The thing is, the title this data is generated from is perfectly valid and contains the following text:

<!–:fr–>Bonne Année<!–:–><!–:en–>Happy New Year!<!–:–><!–:ja–>明けましておめでとう!<!–:–><!–:de–>Einen Guten Rutsch ins neue Jahr!<!–:–>

The weird comment tags are leftover of the previous plugin I used for handling multiple languages, and serve as delimiters between languages. They should be ignored by the rest of system.

So why do I end with corrupt data? The problem lies in the following PHP snippet (there are two of them in fact):

<header class="entry-header">
  
  < ?php 
    if (strlen(get_the_title()) >= 85) { ?>
      <h1 class="entry-title"><a href="<?php the_permalink(); ?>" data-title="< ?php the_title(); ?>" rel="bookmark">
  < ?php echo substr(get_the_title(), 0, 84)."...";
  }
      
    else { ?>
    <h1 class="entry-title"><a href="<?php the_permalink(); ?>" rel="bookmark">
  < ?php the_title();  
    } 
       ?>
</a></h1>
</a></h1></header>

The intent of the author of this code is pretty clear, if the entry-title is longer than 85 characters, cut the title and append an ellipsis. This is a code pattern you will find in many user-interface codes.

Problem is, this code does not do what the author think it does. In PHP substr is defined in bytes, not characters. In UTF-8, characters are thus typically 2 bytes long and Kanji (like 明けましておめでとう!) are three bytes. Here the 84th byte happens to fall in the middle of the ‘し’ character, and cutting there produces invalid UTF-8 data. The biggest irony is that because string length is computed before the invisible tags are stripped, the selected cut point is wrong anyway…

What is the fix? PHP actually has functions to get the width of a string in runes and cutting to the right number of unicode characters: mb_strwidth and mb_strimwidth.

You can fix your sixteen installation by replacing the following files:

One thought on “Substring considered harmful – an example”

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.