CSV Parsing

character,quote
Inigo,"You killed my father
Darth, I am your father"
Buddha,""
Dada,'dodo'
Evil Guy,""";drop table"
"Expert", "Trust me, I'm an expert"
Balmer,"""Developers, Developers"""
Yoda,Do,do not.do not try
Me,"Do not
quote me, please"

I recently wrote about the complexity of the CSV format. Many people think the format is well defined, and well understood, so I though I would build an example file highlighting the complexity of such data. You can download the original CSV file.

The goal of the game is simple: tell me what is the correct parsing of that data, how many lines, how many columns, and their content. Of course, you can open the file with some tool, but that’s cheating, and you will have to trust the tool to do the right thing…

10 thoughts on “CSV Parsing”

  1. Yea, csv is evil!
    https://gist.github.com/gebi/5985047

    % go run csv-test.go -lazy_quotes=0 <evil.csv
    1 – 0:"character" 1:"quote"
    2 – 0:"Inigo" 1:"You killed my father\nDarth, I am your father"
    3 – 0:"Buddha" 1:""
    4 – 0:"Dada" 1:"'dodo'"
    5 – 0:"Evil Guy" 1:"\";drop table"
    line 7, column 10: bare " in non-quoted-field
    exit status 1

    % go run csv-test.go -lazy_quotes=1 <evil.csv
    1 – 0:"character" 1:"quote"
    2 – 0:"Inigo" 1:"You killed my father\nDarth, I am your father"
    3 – 0:"Buddha" 1:""
    4 – 0:"Dada" 1:"'dodo'"
    5 – 0:"Evil Guy" 1:"\";drop table"
    6 – 0:"Expert" 1:" \"Trust me" 2:" I'm an expert\""
    7 – 0:"Balmer" 1:"\"Developers, Developers\""
    8 – 0:"Yoda,Do,do not.do not try"
    9 – 0:"Me" 1:""Do not"
    10 – 0:"quote me" 1:" please""

  2. Impressive, go gets confused in both modes, there is no reason to have three records in line 6…

  3. Yea it was actually my fault it seems.
    Though should this behaviour be the default?

    % go run csv-test.go -lazy_quotes=0 -trim_leading_space=1 <evil.csv
    1 – 0:"character" 1:"quote"
    2 – 0:"Inigo" 1:"You killed my father\nDarth, I am your father"
    3 – 0:"Buddha" 1:""
    4 – 0:"Dada" 1:"'dodo'"
    5 – 0:"Evil Guy" 1:"\";drop table"
    6 – 0:"Expert" 1:"Trust me, I'm an expert"
    7 – 0:"Balmer" 1:"\"Developers, Developers\""
    8 – 0:"Yoda,Do,do not.do not try"
    9 – 0:"Me" 1:""Do not"
    10 – 0:"quote me" 1:"please""

  4. It seems python also produces 3 columns for line 6 per default.
    6 – 0 Expert 1 “Trust me 2 I’m an expert”

  5. Decided to check my own CSV parser (which lives at https://github.com/roman-kashitsyn/text-csv)

    The output is:

    ==== Line 1
    Field 0 : Value: {character}
    Field 1 : Value: {quote}
    ==== Line 2
    Field 0 : Value: {Inigo}
    Field 1 : Value: {You killed my father
    Darth, I am your father}
    ==== Line 3
    Field 0 : Value: {Buddha}
    Field 1 : Value: {}
    ==== Line 4
    Field 0 : Value: {Dada}
    Field 1 : Value: {‘dodo’}
    ==== Line 5
    Field 0 : Value: {Evil Guy}
    Field 1 : Value: {“;drop table}
    ==== Line 6
    Field 0 : Value: {Expert}
    Field 1 : Value: { “Trust me}
    Field 2 : Value: { I’m an expert”}
    ==== Line 7
    Field 0 : Value: {Balmer}
    Field 1 : Value: {“Developers, Developers”}
    ==== Line 8
    Field 0 : Value: {Yoda,Do,do not.do not try}
    ==== Line 9
    Field 0 : Value: {Me}
    Field 1 : Value: {"Do not}
    ==== Line 10
    Field 0 : Value: {quote me}
    Field 1 : Value: { please"}

    So there are 3 fields in the 6th line. The RFC (https://tools.ietf.org/html/rfc4180) states that
    1. Spaces are considered part of a field and should not be ignored.
    2. Fields containing line breaks (CRLF), double quotes, and commas should be enclosed in double-quotes.

    The proposed grammar is:

    file = [header CRLF] record *(CRLF record) [CRLF]
    header = name *(COMMA name)
    record = field *(COMMA field)
    name = field
    field = (escaped / non-escaped)
    escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE
    non-escaped = *TEXTDATA

    My parser just sees, the there’s no DQUOTE after the comma, so the field is not escaped, so we just read the field until the next comma. Since the RFC doesn’t define error recovery modes (and how to deal with Unicode, which makes this RFC pretty useless), it’s a reasonable thing to do.

    So your example is not RFC-compliant and hence different parsers can treat it’s content in different ways.

    Go parser actually does a pretty good job by checking that the non-escaped fields contain no DQUOTEs. Probably, I should add a similar option too.

Leave a Reply to gebiCancel reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.