You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Tamara Mendt <ta...@gmail.com> on 2015/08/24 10:40:24 UTC

Read CSV Parse Quoted Strings Function

Hi all,

When using the parseQuotedStrings function for the CsvReader class, I have
noticed that if the caracter of the quotes is also inside of the string,
the parsing fails.

For example, if there is a field of this form:

"RT @sportsguy33: New Time Warner slogan: "Time Warner, where we make you
long for the days before cable.""

I think it is not so uncommon to have a case like this and it should not
fail, but rather the string should be parsed as:

RT @sportsguy33: New Time Warner slogan: "Time Warner, where we make you
long for the days before cable."

I have found the part of the Flink code that raised this exception and can
fix it, but wanted to consult first if others agree that this is an issue.

Cheers,

Tamara

Re: Read CSV Parse Quoted Strings Function

Posted by Stephan Ewen <se...@apache.org>.
Be aware that the CSV input format extends the delimited input format. The
delimited input format splits at the line delimiter (such as \n) without
awareness of quotes. So that character can never be part of a quote...

On Mon, Aug 24, 2015 at 11:55 AM, Tamara Mendt <ta...@gmail.com> wrote:

> Thank you Maximilian,
>
> I agree and would be happy to fix this issue.
>
> Cheers,
>
> Tamara.
>
> On Mon, Aug 24, 2015 at 11:50 AM, Maximilian Michels <mx...@apache.org>
> wrote:
>
>> Hi Tamara,
>>
>> Quoted strings should not contain the quoting character. The way to work
>> around this is to escape the quote characters. However, currently there is
>> no option to escape quotes which pretty much forbids any use of quote
>> characters within quoted fields. This should be fixed. I opened a JIRA for
>> this issue: https://issues.apache.org/jira/browse/FLINK-2567
>>
>> As for your idea for parsing quoted fields, I personally prefer escaping
>> the quoting characters. In quoted fields, Flink allows all characters
>> except quotes which means, we have to read the entire file to know whether
>> we can close a quote. Additionally, we need to keep track of how many
>> quotes are opened and closed.
>>
>> While your proposal is a very convenient feature, I think we should
>> rather implement explicit quoting for performance and clarity reasons.
>>
>> Cheers,
>> Max
>>
>>
>>
>> On Mon, Aug 24, 2015 at 10:40 AM, Tamara Mendt <ta...@gmail.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> When using the parseQuotedStrings function for the CsvReader class, I
>>> have noticed that if the caracter of the quotes is also inside of the
>>> string, the parsing fails.
>>>
>>> For example, if there is a field of this form:
>>>
>>> "RT @sportsguy33: New Time Warner slogan: "Time Warner, where we make
>>> you long for the days before cable.""
>>>
>>> I think it is not so uncommon to have a case like this and it should not
>>> fail, but rather the string should be parsed as:
>>>
>>> RT @sportsguy33: New Time Warner slogan: "Time Warner, where we make you
>>> long for the days before cable."
>>>
>>> I have found the part of the Flink code that raised this exception and
>>> can fix it, but wanted to consult first if others agree that this is an
>>> issue.
>>>
>>> Cheers,
>>>
>>> Tamara
>>>
>>>
>>
>
>
> --
> Tamara Mendt
>

Re: Read CSV Parse Quoted Strings Function

Posted by Tamara Mendt <ta...@gmail.com>.
Thank you Maximilian,

I agree and would be happy to fix this issue.

Cheers,

Tamara.

On Mon, Aug 24, 2015 at 11:50 AM, Maximilian Michels <mx...@apache.org> wrote:

> Hi Tamara,
>
> Quoted strings should not contain the quoting character. The way to work
> around this is to escape the quote characters. However, currently there is
> no option to escape quotes which pretty much forbids any use of quote
> characters within quoted fields. This should be fixed. I opened a JIRA for
> this issue: https://issues.apache.org/jira/browse/FLINK-2567
>
> As for your idea for parsing quoted fields, I personally prefer escaping
> the quoting characters. In quoted fields, Flink allows all characters
> except quotes which means, we have to read the entire file to know whether
> we can close a quote. Additionally, we need to keep track of how many
> quotes are opened and closed.
>
> While your proposal is a very convenient feature, I think we should rather
> implement explicit quoting for performance and clarity reasons.
>
> Cheers,
> Max
>
>
>
> On Mon, Aug 24, 2015 at 10:40 AM, Tamara Mendt <ta...@gmail.com>
> wrote:
>
>> Hi all,
>>
>> When using the parseQuotedStrings function for the CsvReader class, I
>> have noticed that if the caracter of the quotes is also inside of the
>> string, the parsing fails.
>>
>> For example, if there is a field of this form:
>>
>> "RT @sportsguy33: New Time Warner slogan: "Time Warner, where we make you
>> long for the days before cable.""
>>
>> I think it is not so uncommon to have a case like this and it should not
>> fail, but rather the string should be parsed as:
>>
>> RT @sportsguy33: New Time Warner slogan: "Time Warner, where we make you
>> long for the days before cable."
>>
>> I have found the part of the Flink code that raised this exception and
>> can fix it, but wanted to consult first if others agree that this is an
>> issue.
>>
>> Cheers,
>>
>> Tamara
>>
>>
>


-- 
Tamara Mendt

Re: Read CSV Parse Quoted Strings Function

Posted by Maximilian Michels <mx...@apache.org>.
Hi Tamara,

Quoted strings should not contain the quoting character. The way to work
around this is to escape the quote characters. However, currently there is
no option to escape quotes which pretty much forbids any use of quote
characters within quoted fields. This should be fixed. I opened a JIRA for
this issue: https://issues.apache.org/jira/browse/FLINK-2567

As for your idea for parsing quoted fields, I personally prefer escaping
the quoting characters. In quoted fields, Flink allows all characters
except quotes which means, we have to read the entire file to know whether
we can close a quote. Additionally, we need to keep track of how many
quotes are opened and closed.

While your proposal is a very convenient feature, I think we should rather
implement explicit quoting for performance and clarity reasons.

Cheers,
Max



On Mon, Aug 24, 2015 at 10:40 AM, Tamara Mendt <ta...@gmail.com> wrote:

> Hi all,
>
> When using the parseQuotedStrings function for the CsvReader class, I have
> noticed that if the caracter of the quotes is also inside of the string,
> the parsing fails.
>
> For example, if there is a field of this form:
>
> "RT @sportsguy33: New Time Warner slogan: "Time Warner, where we make you
> long for the days before cable.""
>
> I think it is not so uncommon to have a case like this and it should not
> fail, but rather the string should be parsed as:
>
> RT @sportsguy33: New Time Warner slogan: "Time Warner, where we make you
> long for the days before cable."
>
> I have found the part of the Flink code that raised this exception and can
> fix it, but wanted to consult first if others agree that this is an issue.
>
> Cheers,
>
> Tamara
>
>