You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Malte Schwarzer <ms...@mieo.de> on 2014/12/05 16:16:56 UTC

Quotes in fields of CsvInputFormat

Hi,

I¹m try to import a CSV file but the parser seems to have problems this
quotes in the beginning of a field. Is there a way to set or disable
enclosures for the CSV input?

This is my  code:

DataSet<Tuple2<String, String>> res = env.readCsvFile(inputCsvFilename)
                .fieldDelimiter('|')
                .types(String.class, String.class)

CSV:

A|ggg
B|"hhh" xx
C|xxx

As result I¹m receiving a ParserException for line B:

org.apache.flink.api.common.io.ParseException: Line could not be parsed:
'B|"hhh" xxŒ


Thanks,
Malte



Re: Quotes in fields of CsvInputFormat

Posted by Fabian Hueske <fh...@apache.org>.
I think that's a fair assumption to make.

I'll open a JIRA for making quoted string parsing optional and a
configurable quote character.

2014-12-09 18:51 GMT+01:00 Max Michels <ma...@data-artisans.com>:

> That sounds like a good idea. Just like setDelimeter("|"), one should be
> able to do a setParseDoubleQuotes(false) to disable the special handling of
> double quotes.
>
> You're right, Fabian, the current implementation treats all String fields
> alike. Maybe we can expect the user to provide a consistently formatted
> input file (i.e. with or without the use of double quotes as identifiers)?
>
> On Tue, Dec 9, 2014 at 2:32 PM, Fabian Hueske <fh...@apache.org> wrote:
>
>> With the current implementation, quoted string parsing kicks in, if the
>> first non-whitespace character of a field is a double quote (just as in
>> Malte's case). I think this behaviour can be quite unexpected for users.
>> Wouldn't it be better to make the behaviour of the String parsing more
>> explicit, i.e., add a switch to dis/enable quoted string parsing. With the
>> current implementation, the configuration would affect all String fields in
>> a file, though...
>>
>> Cheers, Fabian
>>
>> 2014-12-09 12:17 GMT+01:00 Max Michels <ma...@data-artisans.com>:
>>
>>> Hi Malte,
>>>
>>> Typically, double quotes are used to identify strings and thus are not
>>> interpreted literally. Any data in a field after a double quoted string is
>>> regarded as invalid trailing data.
>>>
>>> You could replace double quotes with single quotes:
>>>
>>> A|ggg
>>> B|'hhh' xx
>>> C|xxx
>>>
>>> This results in the expected >'hhh' xx< for the second line.
>>>
>>> Best regards,
>>> Max
>>>
>>> On Fri, Dec 5, 2014 at 4:44 PM, Malte Schwarzer <ms...@mieo.de> wrote:
>>>
>>>> Hi Stephan,
>>>>
>>>> The result should be >"hhh“ xx<  as field value. Enclosures should be
>>>> disabled but there seems to be no method to do that.
>>>>
>>>>
>>>> Malte
>>>>
>>>> Von: Stephan Ewen <se...@apache.org>
>>>> Antworten an: <us...@flink.incubator.apache.org>
>>>> Datum: Freitag, 5. Dezember 2014 16:28
>>>> An: <us...@flink.incubator.apache.org>
>>>> Betreff: Re: Quotes in fields of CsvInputFormat
>>>>
>>>> Hi!
>>>>
>>>> The parser interprets the quotes as quotes for the field. That means
>>>> the second field (the string) stops after the "hhh" and the xx is
>>>> considered invalid trailing data.
>>>>
>>>> What do you expect as the result of parsing that line?
>>>>
>>>> Stephan
>>>>
>>>>
>>>> On Fri, Dec 5, 2014 at 4:16 PM, Malte Schwarzer <ms...@mieo.de> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I’m try to import a CSV file but the parser seems to have problems
>>>>> this quotes in the beginning of a field. Is there a way to set or disable
>>>>> enclosures for the CSV input?
>>>>>
>>>>> This is my  code:
>>>>>
>>>>> DataSet<Tuple2<String, String>> res = env.readCsvFile(inputCsvFilename)
>>>>>                 .fieldDelimiter('|')
>>>>>                 .types(String.class, String.class)
>>>>>
>>>>> CSV:
>>>>>
>>>>> A|ggg
>>>>> B|"hhh" xx
>>>>> C|xxx
>>>>>
>>>>> As result I’m receiving a ParserException for line B:
>>>>>
>>>>> *org.apache.flink.api.common.io.ParseException: Line could not be
>>>>> parsed: 'B|"hhh" xx**‘*
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Malte
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Quotes in fields of CsvInputFormat

Posted by Max Michels <ma...@data-artisans.com>.
That sounds like a good idea. Just like setDelimeter("|"), one should be
able to do a setParseDoubleQuotes(false) to disable the special handling of
double quotes.

You're right, Fabian, the current implementation treats all String fields
alike. Maybe we can expect the user to provide a consistently formatted
input file (i.e. with or without the use of double quotes as identifiers)?

On Tue, Dec 9, 2014 at 2:32 PM, Fabian Hueske <fh...@apache.org> wrote:

> With the current implementation, quoted string parsing kicks in, if the
> first non-whitespace character of a field is a double quote (just as in
> Malte's case). I think this behaviour can be quite unexpected for users.
> Wouldn't it be better to make the behaviour of the String parsing more
> explicit, i.e., add a switch to dis/enable quoted string parsing. With the
> current implementation, the configuration would affect all String fields in
> a file, though...
>
> Cheers, Fabian
>
> 2014-12-09 12:17 GMT+01:00 Max Michels <ma...@data-artisans.com>:
>
>> Hi Malte,
>>
>> Typically, double quotes are used to identify strings and thus are not
>> interpreted literally. Any data in a field after a double quoted string is
>> regarded as invalid trailing data.
>>
>> You could replace double quotes with single quotes:
>>
>> A|ggg
>> B|'hhh' xx
>> C|xxx
>>
>> This results in the expected >'hhh' xx< for the second line.
>>
>> Best regards,
>> Max
>>
>> On Fri, Dec 5, 2014 at 4:44 PM, Malte Schwarzer <ms...@mieo.de> wrote:
>>
>>> Hi Stephan,
>>>
>>> The result should be >"hhh“ xx<  as field value. Enclosures should be
>>> disabled but there seems to be no method to do that.
>>>
>>>
>>> Malte
>>>
>>> Von: Stephan Ewen <se...@apache.org>
>>> Antworten an: <us...@flink.incubator.apache.org>
>>> Datum: Freitag, 5. Dezember 2014 16:28
>>> An: <us...@flink.incubator.apache.org>
>>> Betreff: Re: Quotes in fields of CsvInputFormat
>>>
>>> Hi!
>>>
>>> The parser interprets the quotes as quotes for the field. That means the
>>> second field (the string) stops after the "hhh" and the xx is considered
>>> invalid trailing data.
>>>
>>> What do you expect as the result of parsing that line?
>>>
>>> Stephan
>>>
>>>
>>> On Fri, Dec 5, 2014 at 4:16 PM, Malte Schwarzer <ms...@mieo.de> wrote:
>>>
>>>> Hi,
>>>>
>>>> I’m try to import a CSV file but the parser seems to have problems this
>>>> quotes in the beginning of a field. Is there a way to set or disable
>>>> enclosures for the CSV input?
>>>>
>>>> This is my  code:
>>>>
>>>> DataSet<Tuple2<String, String>> res = env.readCsvFile(inputCsvFilename)
>>>>                 .fieldDelimiter('|')
>>>>                 .types(String.class, String.class)
>>>>
>>>> CSV:
>>>>
>>>> A|ggg
>>>> B|"hhh" xx
>>>> C|xxx
>>>>
>>>> As result I’m receiving a ParserException for line B:
>>>>
>>>> *org.apache.flink.api.common.io.ParseException: Line could not be
>>>> parsed: 'B|"hhh" xx**‘*
>>>>
>>>>
>>>> Thanks,
>>>> Malte
>>>>
>>>
>>>
>>
>

Re: Quotes in fields of CsvInputFormat

Posted by Fabian Hueske <fh...@apache.org>.
With the current implementation, quoted string parsing kicks in, if the
first non-whitespace character of a field is a double quote (just as in
Malte's case). I think this behaviour can be quite unexpected for users.
Wouldn't it be better to make the behaviour of the String parsing more
explicit, i.e., add a switch to dis/enable quoted string parsing. With the
current implementation, the configuration would affect all String fields in
a file, though...

Cheers, Fabian

2014-12-09 12:17 GMT+01:00 Max Michels <ma...@data-artisans.com>:

> Hi Malte,
>
> Typically, double quotes are used to identify strings and thus are not
> interpreted literally. Any data in a field after a double quoted string is
> regarded as invalid trailing data.
>
> You could replace double quotes with single quotes:
>
> A|ggg
> B|'hhh' xx
> C|xxx
>
> This results in the expected >'hhh' xx< for the second line.
>
> Best regards,
> Max
>
> On Fri, Dec 5, 2014 at 4:44 PM, Malte Schwarzer <ms...@mieo.de> wrote:
>
>> Hi Stephan,
>>
>> The result should be >"hhh“ xx<  as field value. Enclosures should be
>> disabled but there seems to be no method to do that.
>>
>>
>> Malte
>>
>> Von: Stephan Ewen <se...@apache.org>
>> Antworten an: <us...@flink.incubator.apache.org>
>> Datum: Freitag, 5. Dezember 2014 16:28
>> An: <us...@flink.incubator.apache.org>
>> Betreff: Re: Quotes in fields of CsvInputFormat
>>
>> Hi!
>>
>> The parser interprets the quotes as quotes for the field. That means the
>> second field (the string) stops after the "hhh" and the xx is considered
>> invalid trailing data.
>>
>> What do you expect as the result of parsing that line?
>>
>> Stephan
>>
>>
>> On Fri, Dec 5, 2014 at 4:16 PM, Malte Schwarzer <ms...@mieo.de> wrote:
>>
>>> Hi,
>>>
>>> I’m try to import a CSV file but the parser seems to have problems this
>>> quotes in the beginning of a field. Is there a way to set or disable
>>> enclosures for the CSV input?
>>>
>>> This is my  code:
>>>
>>> DataSet<Tuple2<String, String>> res = env.readCsvFile(inputCsvFilename)
>>>                 .fieldDelimiter('|')
>>>                 .types(String.class, String.class)
>>>
>>> CSV:
>>>
>>> A|ggg
>>> B|"hhh" xx
>>> C|xxx
>>>
>>> As result I’m receiving a ParserException for line B:
>>>
>>> *org.apache.flink.api.common.io.ParseException: Line could not be
>>> parsed: 'B|"hhh" xx**‘*
>>>
>>>
>>> Thanks,
>>> Malte
>>>
>>
>>
>

Re: Quotes in fields of CsvInputFormat

Posted by Max Michels <ma...@data-artisans.com>.
Hi Malte,

Typically, double quotes are used to identify strings and thus are not
interpreted literally. Any data in a field after a double quoted string is
regarded as invalid trailing data.

You could replace double quotes with single quotes:

A|ggg
B|'hhh' xx
C|xxx

This results in the expected >'hhh' xx< for the second line.

Best regards,
Max

On Fri, Dec 5, 2014 at 4:44 PM, Malte Schwarzer <ms...@mieo.de> wrote:

> Hi Stephan,
>
> The result should be >"hhh“ xx<  as field value. Enclosures should be
> disabled but there seems to be no method to do that.
>
>
> Malte
>
> Von: Stephan Ewen <se...@apache.org>
> Antworten an: <us...@flink.incubator.apache.org>
> Datum: Freitag, 5. Dezember 2014 16:28
> An: <us...@flink.incubator.apache.org>
> Betreff: Re: Quotes in fields of CsvInputFormat
>
> Hi!
>
> The parser interprets the quotes as quotes for the field. That means the
> second field (the string) stops after the "hhh" and the xx is considered
> invalid trailing data.
>
> What do you expect as the result of parsing that line?
>
> Stephan
>
>
> On Fri, Dec 5, 2014 at 4:16 PM, Malte Schwarzer <ms...@mieo.de> wrote:
>
>> Hi,
>>
>> I’m try to import a CSV file but the parser seems to have problems this
>> quotes in the beginning of a field. Is there a way to set or disable
>> enclosures for the CSV input?
>>
>> This is my  code:
>>
>> DataSet<Tuple2<String, String>> res = env.readCsvFile(inputCsvFilename)
>>                 .fieldDelimiter('|')
>>                 .types(String.class, String.class)
>>
>> CSV:
>>
>> A|ggg
>> B|"hhh" xx
>> C|xxx
>>
>> As result I’m receiving a ParserException for line B:
>>
>> *org.apache.flink.api.common.io.ParseException: Line could not be parsed:
>> 'B|"hhh" xx**‘*
>>
>>
>> Thanks,
>> Malte
>>
>
>

Re: Quotes in fields of CsvInputFormat

Posted by Malte Schwarzer <ms...@mieo.de>.
Hi Stephan,

The result should be >"hhh³ xx<  as field value. Enclosures should be
disabled but there seems to be no method to do that.


Malte

Von:  Stephan Ewen <se...@apache.org>
Antworten an:  <us...@flink.incubator.apache.org>
Datum:  Freitag, 5. Dezember 2014 16:28
An:  <us...@flink.incubator.apache.org>
Betreff:  Re: Quotes in fields of CsvInputFormat

Hi!

The parser interprets the quotes as quotes for the field. That means the
second field (the string) stops after the "hhh" and the xx is considered
invalid trailing data.

What do you expect as the result of parsing that line?

Stephan


On Fri, Dec 5, 2014 at 4:16 PM, Malte Schwarzer <ms...@mieo.de> wrote:
> Hi,
> 
> I¹m try to import a CSV file but the parser seems to have problems this quotes
> in the beginning of a field. Is there a way to set or disable enclosures for
> the CSV input?
> 
> This is my  code:
> 
> DataSet<Tuple2<String, String>> res = env.readCsvFile(inputCsvFilename)
>                 .fieldDelimiter('|')
>                 .types(String.class, String.class)
> 
> CSV:
> 
> A|ggg
> B|"hhh" xx
> C|xxx
> 
> As result I¹m receiving a ParserException for line B:
> 
> org.apache.flink.api.common.io.ParseException: Line could not be parsed:
> 'B|"hhh" xxŒ
> 
> 
> Thanks,
> Malte




Re: Quotes in fields of CsvInputFormat

Posted by Stephan Ewen <se...@apache.org>.
Hi!

The parser interprets the quotes as quotes for the field. That means the
second field (the string) stops after the "hhh" and the xx is considered
invalid trailing data.

What do you expect as the result of parsing that line?

Stephan


On Fri, Dec 5, 2014 at 4:16 PM, Malte Schwarzer <ms...@mieo.de> wrote:

> Hi,
>
> I’m try to import a CSV file but the parser seems to have problems this
> quotes in the beginning of a field. Is there a way to set or disable
> enclosures for the CSV input?
>
> This is my  code:
>
> DataSet<Tuple2<String, String>> res = env.readCsvFile(inputCsvFilename)
>                 .fieldDelimiter('|')
>                 .types(String.class, String.class)
>
> CSV:
>
> A|ggg
> B|"hhh" xx
> C|xxx
>
> As result I’m receiving a ParserException for line B:
>
> *org.apache.flink.api.common.io.ParseException: Line could not be parsed:
> 'B|"hhh" xx**‘*
>
>
> Thanks,
> Malte
>