You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by madan <ma...@gmail.com> on 2018/10/31 04:50:50 UTC

CsvInputFormat - read header line first

Hi,

When we are splitting a csv file into multiple parts we are not sure which
part is read first. Is there any way to make sure first part with header is
read first ? I need to read header line first to store column name vs index
and use this index for processing next records.

I could read header line from the file before submitting job to the flink,
but that way we are opening the file 2 times. Is there any better way to do
this? Please suggest.

-- 
Thank you.

Re: CsvInputFormat - read header line first

Posted by madan <ma...@gmail.com>.
Hi Ken,

Yep correct.

Thank you.

On Wed, Oct 31, 2018 at 7:24 PM Ken Krugler <kk...@transpac.com>
wrote:

> Hi Madan,
>
> If your source has a parallelism > 1, then when the CSV file is split,
> only one of the operators will get the split with the header row.
>
> So in that case, how would you communicate the column name->index
> information to the other operators?
>
> If you force a parallelism of 1 for the source, then I’m pretty sure
> you’re guaranteed that the file will be processed in order.
>
> — Ken
>
> On Oct 31, 2018, at 12:50 AM, madan <ma...@gmail.com> wrote:
>
> Hi,
>
> When we are splitting a csv file into multiple parts we are not sure which
> part is read first. Is there any way to make sure first part with header is
> read first ? I need to read header line first to store column name vs index
> and use this index for processing next records.
>
> I could read header line from the file before submitting job to the flink,
> but that way we are opening the file 2 times. Is there any better way to do
> this? Please suggest.
>
> --
> Thank you.
>
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://www.scaleunlimited.com
> Custom big data solutions & training
> Flink, Solr, Hadoop, Cascading & Cassandra
>
>

-- 
Thank you,
Madan.

Re: CsvInputFormat - read header line first

Posted by Ken Krugler <kk...@transpac.com>.
Hi Madan,

If your source has a parallelism > 1, then when the CSV file is split, only one of the operators will get the split with the header row.

So in that case, how would you communicate the column name->index information to the other operators?

If you force a parallelism of 1 for the source, then I’m pretty sure you’re guaranteed that the file will be processed in order.

— Ken

> On Oct 31, 2018, at 12:50 AM, madan <ma...@gmail.com> wrote:
> 
> Hi,
> 
> When we are splitting a csv file into multiple parts we are not sure which part is read first. Is there any way to make sure first part with header is read first ? I need to read header line first to store column name vs index and use this index for processing next records.
> 
> I could read header line from the file before submitting job to the flink, but that way we are opening the file 2 times. Is there any better way to do this? Please suggest.
> 
> -- 
> Thank you.

--------------------------
Ken Krugler
+1 530-210-6378
http://www.scaleunlimited.com
Custom big data solutions & training
Flink, Solr, Hadoop, Cascading & Cassandra