You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by Martin Neumann <mn...@spotify.com> on 2014/10/15 15:36:31 UTC

CsvInputFormat delimiter fields

Hej,

A lot of my inputs are csv files so I use the CsvInputFormat a lot. What I
find kind of odd that the Line delimiter is a String but the Field
delimiter is a Character.

*see:* new CsvInputFormat<Tuple2<String,String>>(new
Path(pVecPath),"\n",'\t',String.class,String.class)

Is there a reason for this? I'm currently working with a file that has a
more complex field delimiter so I had to write a mapper to read from
StringInputFormat.

cheers Martin

Re: CsvInputFormat delimiter fields

Posted by Fabian Hueske <fh...@apache.org>.

I created FLINK-1168 for this feature request.

2014-10-16 11:28 GMT+02:00 Fabian Hueske <fh...@apache.org>:

> I don't think, that multi-char field delimiters would cause a performance
> problem. The data needs to be parsed anyway.
> Only in cases where the delimiter has a prefix that occurs often in the
> regular data, it could have a major impact.
>
> Fabian
>
> 2014-10-15 16:07 GMT+02:00 Martin Neumann <mn...@spotify.com>:
>
>> Would changing it cost performance?
>> If not I thing it would be a good change to make since it allows to
>> (ab)use
>> the csv reader to load structured Text files (for example by putting
>> Keywords as delimiter).
>>
>> Being able to put a regular expression there would be even nicer but maybe
>> it should end up in its own InputFormat then.
>>
>> cheers Martin
>>
>> On Wed, Oct 15, 2014 at 3:47 PM, Stephan Ewen <se...@apache.org> wrote:
>>
>> > Hi!
>> >
>> > The reason is the current way the csv parsers work. They are pushed into
>> > the byte stream parsing and are restricted to recognize one char
>> > delimiters. It is possible to change that, but would be a bit of work.
>> >
>> > Stephan
>> >
>> > On Wed, Oct 15, 2014 at 3:36 PM, Martin Neumann <mn...@spotify.com>
>> > wrote:
>> >
>> > > Hej,
>> > >
>> > > A lot of my inputs are csv files so I use the CsvInputFormat a lot.
>> What
>> > I
>> > > find kind of odd that the Line delimiter is a String but the Field
>> > > delimiter is a Character.
>> > >
>> > > *see:* new CsvInputFormat<Tuple2<String,String>>(new
>> > > Path(pVecPath),"\n",'\t',String.class,String.class)
>> > >
>> > > Is there a reason for this? I'm currently working with a file that
>> has a
>> > > more complex field delimiter so I had to write a mapper to read from
>> > > StringInputFormat.
>> > >
>> > > cheers Martin
>> > >
>> >
>>
>
>

Re: CsvInputFormat delimiter fields

Posted by Fabian Hueske <fh...@apache.org>.

I don't think, that multi-char field delimiters would cause a performance
problem. The data needs to be parsed anyway.
Only in cases where the delimiter has a prefix that occurs often in the
regular data, it could have a major impact.

Fabian

2014-10-15 16:07 GMT+02:00 Martin Neumann <mn...@spotify.com>:

> Would changing it cost performance?
> If not I thing it would be a good change to make since it allows to (ab)use
> the csv reader to load structured Text files (for example by putting
> Keywords as delimiter).
>
> Being able to put a regular expression there would be even nicer but maybe
> it should end up in its own InputFormat then.
>
> cheers Martin
>
> On Wed, Oct 15, 2014 at 3:47 PM, Stephan Ewen <se...@apache.org> wrote:
>
> > Hi!
> >
> > The reason is the current way the csv parsers work. They are pushed into
> > the byte stream parsing and are restricted to recognize one char
> > delimiters. It is possible to change that, but would be a bit of work.
> >
> > Stephan
> >
> > On Wed, Oct 15, 2014 at 3:36 PM, Martin Neumann <mn...@spotify.com>
> > wrote:
> >
> > > Hej,
> > >
> > > A lot of my inputs are csv files so I use the CsvInputFormat a lot.
> What
> > I
> > > find kind of odd that the Line delimiter is a String but the Field
> > > delimiter is a Character.
> > >
> > > *see:* new CsvInputFormat<Tuple2<String,String>>(new
> > > Path(pVecPath),"\n",'\t',String.class,String.class)
> > >
> > > Is there a reason for this? I'm currently working with a file that has
> a
> > > more complex field delimiter so I had to write a mapper to read from
> > > StringInputFormat.
> > >
> > > cheers Martin
> > >
> >
>

Re: CsvInputFormat delimiter fields

Posted by Martin Neumann <mn...@spotify.com>.

Would changing it cost performance?
If not I thing it would be a good change to make since it allows to (ab)use
the csv reader to load structured Text files (for example by putting
Keywords as delimiter).

Being able to put a regular expression there would be even nicer but maybe
it should end up in its own InputFormat then.

cheers Martin

On Wed, Oct 15, 2014 at 3:47 PM, Stephan Ewen <se...@apache.org> wrote:

> Hi!
>
> The reason is the current way the csv parsers work. They are pushed into
> the byte stream parsing and are restricted to recognize one char
> delimiters. It is possible to change that, but would be a bit of work.
>
> Stephan
>
> On Wed, Oct 15, 2014 at 3:36 PM, Martin Neumann <mn...@spotify.com>
> wrote:
>
> > Hej,
> >
> > A lot of my inputs are csv files so I use the CsvInputFormat a lot. What
> I
> > find kind of odd that the Line delimiter is a String but the Field
> > delimiter is a Character.
> >
> > *see:* new CsvInputFormat<Tuple2<String,String>>(new
> > Path(pVecPath),"\n",'\t',String.class,String.class)
> >
> > Is there a reason for this? I'm currently working with a file that has a
> > more complex field delimiter so I had to write a mapper to read from
> > StringInputFormat.
> >
> > cheers Martin
> >
>

Re: CsvInputFormat delimiter fields

Posted by Stephan Ewen <se...@apache.org>.

Hi!

The reason is the current way the csv parsers work. They are pushed into
the byte stream parsing and are restricted to recognize one char
delimiters. It is possible to change that, but would be a bit of work.

Stephan

On Wed, Oct 15, 2014 at 3:36 PM, Martin Neumann <mn...@spotify.com>
wrote:

> Hej,
>
> A lot of my inputs are csv files so I use the CsvInputFormat a lot. What I
> find kind of odd that the Line delimiter is a String but the Field
> delimiter is a Character.
>
> *see:* new CsvInputFormat<Tuple2<String,String>>(new
> Path(pVecPath),"\n",'\t',String.class,String.class)
>
> Is there a reason for this? I'm currently working with a file that has a
> more complex field delimiter so I had to write a mapper to read from
> StringInputFormat.
>
> cheers Martin
>