good,⃣evil
My post on CSV parsing got quite some attention, with various systems parsing them quite differently, one google+ posting by Kristian Köhntopp referencing said post had a nice phrase:
If it is not a state machine, it ain’t a correct parser
This got me thinking: nowadays CSV files include unicode characters, whose parsing requires its own state machine. Is it possible to make them interact? In other words, can I construct a file that is valid unicode text, which, if parsed as CSV, produces invalid unicode records?
The answer is yes, thanks to Unicode’s Combining Characters. These character combine with the character preceding them, modifying it. One example of such a character is 20E3
, which ads a rounded box to the preceding character, so we can build a boxed A character: A⃣.
What happens when we box a comma? Either the unicode parser has precedence, and it consumes the comma to build a combined character boxed-comma, which means CSV parsing will not see it anymore. Or The CSV parser takes precedence, and consumes the comma, leaving a boxing character at the start of a text, which is illegal. RFC 4180 says nothing of unicode combining characters, and unicode says nothing of CSV files. If you need something more confusing, there is also a combining comma with code-point 0326
. Here is a very short example of the words good and evil separated with such a boxed comma. How does your favourite library parse this data?
Actually the RFC 4180 says nothing (unfortunately) about Unicode at all:
From the RFC at the end of section 2:
TEXTDATA = %x20-21 / %x23-2B / %x2D-7E
Hence Unicode characters are not allowed according to the grammar.
Kind of sad seeing that the document is from 2005.
What is needed is an updated version which tackles Unicode properly, with the issues you mentioned and BOM issues, UTF-8 vs UTF-16, and so on …