You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Mikhail Lipkovich (JIRA)" <ji...@apache.org> on 2017/08/31 12:47:00 UTC

[jira] [Commented] (FLINK-6016) Newlines should be valid in quoted strings in CSV

    [ https://issues.apache.org/jira/browse/FLINK-6016?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16148931#comment-16148931 ] 

Mikhail Lipkovich commented on FLINK-6016:
------------------------------------------

Hi Luke,
I've just started diving into Flink so maybe my comment is irrelevant or/and obvious. As I understand it's quite challenging to implement your request. Imagine the situation when 
{code}
"3
{code}
appears in the end of current split 'A', while 
{code}
4",5
{code}
appears in the beginning of the next split 'B'. 
In this situation in order to process the whole record during the processing of split 'A' we have to request split 'B' and to process it till the end of line as well. At the same time during processing of split 'B' we have to make sure that we are skipping the first row containing any quote characters (since this row should be read during processing of split 'A'). It seems for me that this logic requires significant changes.
One can also imagine some extreme situations with quoted records that are larger than split size. 

As much easier workaround we could force users to process such CSV files in one split. What do you think?


> Newlines should be valid in quoted strings in CSV
> -------------------------------------------------
>
>                 Key: FLINK-6016
>                 URL: https://issues.apache.org/jira/browse/FLINK-6016
>             Project: Flink
>          Issue Type: Bug
>          Components: Batch Connectors and Input/Output Formats
>    Affects Versions: 1.2.0
>            Reporter: Luke Hutchison
>
> The RFC for the CSV format specifies that newlines are valid in quoted strings in CSV:
> https://tools.ietf.org/html/rfc4180
> However, when parsing a CSV file with Flink containing a newline, such as:
> {noformat}
> "3
> 4",5
> {noformat}
> you get this exception:
> {noformat}
> Line could not be parsed: '"3'
> ParserError UNTERMINATED_QUOTED_STRING 
> Expect field types: class java.lang.String, class java.lang.String 
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)