You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@commons.apache.org by "Damjan Jovanovic (Jira)" <ji...@apache.org> on 2022/12/27 01:55:00 UTC

[jira] [Commented] (CSV-141) Handle malformed CSV files

    [ https://issues.apache.org/jira/browse/CSV-141?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17652083#comment-17652083 ] 

Damjan Jovanovic commented on CSV-141:
--------------------------------------

[~nhatminh12369]  "But if I open the file on Microsoft Excel, or using OpenCSV to parse lines 3 and 4, they work fine."

Yes, in Apache OpenOffice we found a number of CSV files that open in Excel but not in OpenOffice, and eventually I patched OpenOffice to open them too. We should get commons-csv to do the same, at least when using CSVFormat.EXCEL.

Particularly see the discussion on [https://bz.apache.org/ooo/show_bug.cgi?id=126805]

What Excel allows, and what is very surprising if you don't know about it, is this: extra text after the closing quote and before the field separator, is allowed, and gets appended verbatim to the field contents. Any quoting in this extra text is ignored.

For example:

"a" and b => a and b

"a" "b" => a "b"

"a" "b => a "b

That's the basic issue in lines 3-4 in the sample posted here.

> Handle malformed CSV files
> --------------------------
>
>                 Key: CSV-141
>                 URL: https://issues.apache.org/jira/browse/CSV-141
>             Project: Commons CSV
>          Issue Type: Wish
>          Components: Parser
>    Affects Versions: 1.0
>            Reporter: Nguyen Minh
>            Priority: Minor
>             Fix For: 1.x
>
>
> My java application has to handle thousands of CSV files uploaded by the client phones everyday. So, there some CSV files have the wrong format which I'm not sure why.
> Here is my sample CSV. Microsoft Excel parses it correctly, but both Common CSV and OpenCSV can't parse it. Open CSV can't parse line 2 (due to '\' character) and Common CSV will crash on line 3 and 4:
> "1414770317901","android.widget.EditText","pass sem1 _84*|*","0","pass sem1 _8"
> "1414770318470","android.widget.EditText","pass sem1 _84:*|*","0","pass sem1 _84:\"
> "1414770318327","android.widget.EditText","pass sem1 
> "1414770318628","android.widget.EditText","pass sem1 _84*|*","0","pass sem1
> Line 3: java.io.IOException: (line 5) invalid char between encapsulated token and delimiter
> 	at org.apache.commons.csv.CSVParser$1.getNextRecord(CSVParser.java:398)
> 	at org.apache.commons.csv.CSVParser$1.hasNext(CSVParser.java:407)
> Line 4: java.io.IOException: (startline 5) EOF reached before encapsulated token finished
> 	at org.apache.commons.csv.CSVParser$1.getNextRecord(CSVParser.java:398)
> 	at org.apache.commons.csv.CSVParser$1.hasNext(CSVParser.java:407)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)