CSV parsing with Commons CSV - quotes inside quotes throwing an IOException

I am using Commons CSV to parse TV show related CSV content. One of the shows has a display name that includes double quotes;

116,6,2,29 Sep 10, "JJ" (60 min) "," http://www.tvmaze.com/episodes/4855/criminal-minds-6x02-jj "

The character shown is "JJ" (60 minutes), which is already in double quotes. This throws an IOException java.io.IOException: (line 1) Invalid char between encapsulated token and delimiter.

    ArrayList<String> allElements = new ArrayList<String>();
    CSVFormat csvFormat = CSVFormat.DEFAULT;
    CSVParser csvFileParser = new CSVParser(new StringReader(line), csvFormat);

    List<CSVRecord> csvRecords = null;

    csvRecords = csvFileParser.getRecords();

    for (CSVRecord record : csvRecords) {
        int length = record.size();
        for (int x = 0; x < length; x++) {
            allElements.add(record.get(x));
        }
    }

    csvFileParser.close();
    return allElements;

      

CSVFormat.DEFAULT already sets withQuote ('"')

I think this CSV is not formatted correctly as "JJ" (60 min) "should be" JJ (60 min) "- but is there a way to get the CSV to handle this or do I need to commit this entry manually?

Additional information: Other display names contain spaces and commas in the CSV record and are placed in double quotes.

+3


source to share


3 answers


The problem is the quotes are not being escaped properly. Your parser can't handle this. Try univocity-parsers as it is the only java parser I know that can handle non-exclusive quotes inside the specified value. It's also 4x faster than Commons CSV. Try this code:

//configure the parser to handle your situation
CsvParserSettings settings = new CsvParserSettings();
settings.setUnescapedQuoteHandling(STOP_AT_CLOSING_QUOTE);

//create the parser
CsvParser parser = new CsvParser(settings);

//parse your line
String[] out = parser.parseLine("116,6,2,29 Sep 10,\"\"JJ\" (60 min)\",\"http://www.tvmaze.com/episodes/4855/criminal-minds-6x02-jj\"");

for(String e : out){
    System.out.println(e);
}

      

This will print:



116
6
2
29 Sep 10
"JJ" (60 min)
http://www.tvmaze.com/episodes/4855/criminal-minds-6x02-jj

      

Hope it helps.

Disclosure: I am the author of this library, open source and free (Apache 2.0 license)

+3


source


Quote basically allows fields to contain separator characters. If the embedded quotes in the field are not escaped, it won't work, so there is no point in using quotes. If your example was "JJ", 60 minutes, how is the parser part of the field to know the comma? The data format cannot reliably handle embedded commas, so if you want to do this, your best bet is to modify your source code to create an RFC compliant csv format.

Otherwise, it looks like the data source is just surrounding non-numeric fields with quotes and separating each field with a comma, so the parser needs to do the opposite. You should probably just treat the data as comma separated and separate leading / trailing quotes yourself with removeStart / removeEnd.



You can use CSVFormat.withQuote (null) or forget about it and just use String.split (',')

+1


source


I think using AND quotes of spaces in the same token is what confuses the parser. Try the following:

CSVFormat csvFormat = CSVFormat.DEFAULT.withQuote('"').withQuote(' ');

      

This should fix it.


Example

For your input line:

String line = "116,6,2,29 Sep 10,\"\"JJ\" (60 min)\",\"http://www.tvmaze.com/episodes/4855/criminal-minds-6x02-jj\"";

      

Exit (and no exception thrown):

[116, 6, 2, 29 Sep 10, ""JJ" (60 min)", "http://www.tvmaze.com/episodes/4855/criminal-minds-6x02-jj"]

      

0


source







All Articles