You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by David Dreyfus <dd...@gmail.com> on 2017/10/27 02:12:22 UTC

Data sources and slices

Hello,

If I am on a cluster with 2 task managers with 64 CPUs each, I can configure
128 slots in accordance with the documentation. If I set parallelism to 128
and read a 64 MB file (one datasource with a single file), will flink really
create 500K slices? Or, will it check the default blocksize of the host it
is reading from and allocate only as many slices as there are blocks? 

If the file is on S3:
1. Does a single thread copy it to local disk and then have 128 slices
consume it?
2. Does a single thread read read the file from S3 and consume it, treating
it as one slice?
3. Does flink talk to S3 and make a multi-part read to local storage and
then read from local storage in 128 slices?

If a datasource has a large number of files, does each slot read one file at
a time with a single thread, or does each slot read one part of each file
such that 128 slots consume each file one at a time?

More generally, does flink try to allocate files to slots such that each
slot reads the same volume with as long a sequential read as possible? 

How does it distinguish between reading from the local HDFS and S3, given
that they might have vastly different performance characteristics.

Thanks,
David

Thank you,
David



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Data sources and slices

Posted by Till Rohrmann <tr...@apache.org>.

Hi David,

in case of a streaming program with a degree of parallelism of 128, Flink
would create 128 splits. One split for each parallel sub task. The logic is
that one split will have the size of one block unless this would not give
you enough splits such that every task receives at least one split.

As far as I know, there is no special handling for S3 located files. Every
reading subtask will open the file via the loaded FileSystem. Depending on
how clever this FileSystem implementation is, it could save some
downloading work by buffering the file or not.

If there are multiple files to read, then Flink tries to assign file splits
of blocksize to the individual tasks. This means that it is not guaranteed
that a single file is read by the same reader task.

Flink does not try to match as many splits of a single file as possible to
the same task. Instead it will spread the work across as many tasks as
possible wrt the block size of the file.

I hope I could clarify some of your questions.

Cheers,
Till

On Fri, Oct 27, 2017 at 4:12 AM, David Dreyfus <dd...@gmail.com> wrote:

> Hello,
>
> If I am on a cluster with 2 task managers with 64 CPUs each, I can
> configure
> 128 slots in accordance with the documentation. If I set parallelism to 128
> and read a 64 MB file (one datasource with a single file), will flink
> really
> create 500K slices? Or, will it check the default blocksize of the host it
> is reading from and allocate only as many slices as there are blocks?
>
> If the file is on S3:
> 1. Does a single thread copy it to local disk and then have 128 slices
> consume it?
> 2. Does a single thread read read the file from S3 and consume it, treating
> it as one slice?
> 3. Does flink talk to S3 and make a multi-part read to local storage and
> then read from local storage in 128 slices?
>
> If a datasource has a large number of files, does each slot read one file
> at
> a time with a single thread, or does each slot read one part of each file
> such that 128 slots consume each file one at a time?
>
> More generally, does flink try to allocate files to slots such that each
> slot reads the same volume with as long a sequential read as possible?
>
> How does it distinguish between reading from the local HDFS and S3, given
> that they might have vastly different performance characteristics.
>
> Thanks,
> David
>
> Thank you,
> David
>
>
>
> --
> Sent from: http://apache-flink-user-mailing-list-archive.2336050.
> n4.nabble.com/
>