Most injection attacks follow the same pattern: a character or a sequence with a special meaning is not properly handled in user provided data, and ends up being interpreted the wrong way. While most software engineer have learnt to escape delimiters in various scripting languages, unicode offers its own vector for the malicious user in the form of bidi control characters. Those characters are meant to be used for embedding left-to-right text within a right-to-left text, or vice versa, they can also be misused to change the behaviour of user-interfaces. The core problem is that those characters are control characters, and should not be left unchecked in user provided text, but as their existence is pretty obscure, they are not well known.
So what can a malicious user do thanks to those characters? Basically change the behaviour of display of text after the insertion point of the text. Two characters are particularly useful for this: right-to-left-override (
0x202E) and right-to-left-embedding (
0x202B). The first inverts the display order of all characters until the end of the run. The second inverts the display orders of tokens (words) and the direction of punctuation until the next strong character. A strong character is a character whose reading direction is fully defined, typically a roman character.
|User evil (email@example.com)
||User evil (firstname.lastname@example.org)
|User evil is not trusted!
||User evil is not trusted!
|Level of evil > 66
||Level of evil < 100 > 66
The table above shows some examples of abusing those characters. In the first two line, user evil, has embedded a right-to-left-override at the end of its display name (evil). On the first line, this lets him reverse the display of his e-mail address, making it look like the domain is
good.org and not
evil.com. In the second line, he uses this to transform a warning into gibberish. Note that the user-provided field is enclosed in an <em> tag, but the formatting instructions escape the tag.
In the third, the evil users controls the name of some measurement, and uses right-to-left-embedding to reverse the display order of the tokens, but also invert the direction of a greater than sign. So by injecting the sequence <
0x202B 100, the UI now displays the reverse semantic information, i.e. that the level of evil is lower than 66, when in fact, it is higher.
Another example of abusing such characters would be in source code, by embedding them in the string constants and the comments, one can craft code that will look a given way in a web based review tool, but executes another way. Another use is to submit certain keyword reversed, so they do not match a black-list, but embedded with bidi control character to display in the right order.
There are multiple ways to mitigate such attacks. The best one is probably to remove all unicode control characters from user-provided input. Another one is to add pop directional formatting (
0x202C) at the end of the user-provided area.
Bidi injections are already used in practice, it is used by at least one Mac OS X malware.