My post on CSV parsing got quite some attention, with various systems parsing them quite differently, one google+ posting by Kristian Köhntopp referencing said post had a nice phrase:
If it is not a state machine, it ain’t a correct parser
This got me thinking: nowadays CSV files include unicode characters, whose parsing requires its own state machine. Is it possible to make them interact? In other words, can I construct a file that is valid unicode text, which, if parsed as CSV, produces invalid unicode records?
The answer is yes, thanks to Unicode’s Combining Characters. These character combine with the character preceding them, modifying it. One example of such a character is
20E3, which ads a rounded box to the preceding character, so we can build a boxed A character: A⃣.
What happens when we box a comma? Either the unicode parser has precedence, and it consumes the comma to build a combined character boxed-comma, which means CSV parsing will not see it anymore. Or The CSV parser takes precedence, and consumes the comma, leaving a boxing character at the start of a text, which is illegal. RFC 4180 says nothing of unicode combining characters, and unicode says nothing of CSV files. If you need something more confusing, there is also a combining comma with code-point
0326. Here is a very short example of the words good and evil separated with such a boxed comma. How does your favourite library parse this data?