CSV Files

Comma Separated Value files are one of the most common ways of passing around data, while there is an RFC describing that format, it is extremely fuzzy, and there are many subtle variants of the format, and many people get them wrong, simply assuming they can join the values with comma characters and lines with carriage returns, maybe adding quotes around each record; this is bound to fail, as textual data typically contains quotes, commas, and carriage returns. So we get numerous bugs because some program does not properly escape characters, or some other does not properly decode escape characters.

The big irony of the situation is that the venerable ASCII code contains characters design to solve that exact problem 0x36 (record separator) and 0x35 (group separator). These characters are never used in textual data (and should be removed if they are), so building and decoding a file using those control characters would be much easier and more robust.

Why is this not happening? Text editors cannot properly handle those characters, and one of the legacies of Unix is that it is better to have a broken, brittle format that can be manipulated with a text editor than a well specified binary format. There is a certain irony of calling a file using ASCII control characters binary, but as they are not handled by text-editor, they are, for all intent and purposes, binary.

Some people will argue that XML is the solution – it really is not. Because first there is no standard XML format for passing around flat records, second because XML has the same escaping problem, the only difference is that the characters to escape are different…

Edit: two more entries about CSV parsing: CSV parsing and More CSV evil.

4 thoughts on “CSV Files”

  1. OMG YES!! This annoys the crap out of me, too.

  2. Wait wait wait, a CSV is an Excel file, isn’t???

    I still have problems with my customers who do not even know what a text editor is. We ask them to a few rows of data, and we have to choice to make them fill an Excel file (that our ETL is often too stupid to load), a CSV file (with Excel) or a CSV file (with Word).

  3. I think you missed the point. CSV files wasn’t formalized because some Unix guys prefer text format. But because a lot of developers prefer text format. CSV have been formalized later, when it was clear many use it and no real definition existed.

    Why use text format? It’s easy to create, to use and to transfer. Any text editor can be used as tool.

    By contrast, your proposition are contradictory. You want use some binary characters present in ASCII, and you complain because text editor can’t handle it. If text editor could handle it, this character would be legal text character. And you would use it in your string.

    It is a vicious circle. You can’t have text format data without escape problems.

  4. I’m not complaining, I think we should use binary format, and stop hacking around with textual representation, as they are both inefficient and insecure. The point is actually that you cannot safely edit a textual file if you have not fully integrated the escaping rules, most people who think they understand the escaping rules are wrong…

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: