You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Alex Reid <al...@gmail.com> on 2016/11/22 18:42:18 UTC

Reading files from an S3 folder

Hi, I've been playing around with using apache flink to process some data,
and I'm starting out using the batch DataSet API.

To start, I read in some data from files in an S3 folder:

DataSet<String> records = env.readTextFile("s3://my-s3-bucket/some-folder/");


Within the folder, there are 20 gzipped files, and I have 20
node/tasks run (so parallel 20). It looks like each node is reading in
ALL the files (whole folder), but what I really want is for each
node/task to read in 1 file each and each process the data within the
file they read in.

Is this expected behavior? Am I suppose to be doing something
different here to get the results I want?

Thanks.

Re: Reading files from an S3 folder

Posted by Alex Reid <al...@gmail.com>.

Each file is ~1.8G compressed (and about 15G uncompressed, so a little over
300G total for all the files).

In the Web Client UI, when I look at the Plan, I click on the subtask for
reading in the files, I see a line for each host and the Bytes Sent for
each host is like 350G.

The job takes longer than I'd expect, so just trying to track down where
the time spent / is it doing what I'm expecting it to.

On Wed, Nov 23, 2016 at 8:45 AM, Robert Metzger <rm...@apache.org> wrote:

> Hi,
> This is not the expected behavior.
> Each parallel instance should read only one file. The files should not be
> read multiple times by the different parallel instances.
> How did you check / find out that each node is reading all the data?
>
> Regards,
> Robert
>
> On Tue, Nov 22, 2016 at 7:42 PM, Alex Reid <al...@gmail.com>
> wrote:
>
>> Hi, I've been playing around with using apache flink to process some
>> data, and I'm starting out using the batch DataSet API.
>>
>> To start, I read in some data from files in an S3 folder:
>>
>> DataSet<String> records = env.readTextFile("s3://my-s3-bucket/some-folder/");
>>
>>
>> Within the folder, there are 20 gzipped files, and I have 20 node/tasks run (so parallel 20). It looks like each node is reading in ALL the files (whole folder), but what I really want is for each node/task to read in 1 file each and each process the data within the file they read in.
>>
>> Is this expected behavior? Am I suppose to be doing something different here to get the results I want?
>>
>> Thanks.
>>
>>
>

Re: Reading files from an S3 folder

Posted by Robert Metzger <rm...@apache.org>.

Hi,
This is not the expected behavior.
Each parallel instance should read only one file. The files should not be
read multiple times by the different parallel instances.
How did you check / find out that each node is reading all the data?

Regards,
Robert

On Tue, Nov 22, 2016 at 7:42 PM, Alex Reid <al...@gmail.com>
wrote:

> Hi, I've been playing around with using apache flink to process some data,
> and I'm starting out using the batch DataSet API.
>
> To start, I read in some data from files in an S3 folder:
>
> DataSet<String> records = env.readTextFile("s3://my-s3-bucket/some-folder/");
>
>
> Within the folder, there are 20 gzipped files, and I have 20 node/tasks run (so parallel 20). It looks like each node is reading in ALL the files (whole folder), but what I really want is for each node/task to read in 1 file each and each process the data within the file they read in.
>
> Is this expected behavior? Am I suppose to be doing something different here to get the results I want?
>
> Thanks.
>
>