You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by Luis Alves <lm...@gmail.com> on 2017/08/13 16:59:27 UTC

Number of consumers per IntermediateResult

Hi,

Can someone validate the following regarding the ExecutionGraph:

Each IntermediateResult can only be consumed by a single ExecutionJobVertex, i.e. if two ExecutionJobVertex consume the same tuples (same “stream") that is produced by the same ExecutionJobVertex, then the producer will have two IntermediateResult, one per consumer. 

In other words: if an ExecutionJobVertex performs a map operation, and has two consumers (different ExecutionJobVertex), the ExecutionJobVertex will produce two datasets/IntermediateResults (both with the same “content”, but different consumers).

Each ExecutionVertex will then have the same amount of IntermediateResultPartitions as the number of ExecutionJobVertex that consume the datasets generated by the respective ExecutionJobVertex.

Thus, at runtime:

ResultPartition maps to an IntermediateResultPartition (as documented in the javadoc). Thus, 3. is also valid for ResultPartitions.

ResultSubPartition maps to an ExecutionEdge (since it contains the information on how to send the partition to the actual consumer Task).

Thanks,

Luís Alves

Re: Number of consumers per IntermediateResult

Posted by Luis Alves <lm...@gmail.com>.

Hey Ufuk,

Yes, I've seen that page (and the "// NOTE: currently we support only one consumer per result!!!“ comment in the IntermidiateResultPartition :) ).

Thanks,

Luís Alves

On Mon, 14 Aug 2017 at 10:57 Ufuk Celebi

<
mailto:Ufuk Celebi <uc...@apache.org>
> wrote:

a, pre, code, a:link, body { word-wrap: break-word !important; }

Hey Luis,

this is correct, yes. Note that these are "only" limitations of the

implementation and there is no fundamental reason to do it like this.

The different characteristics of intermediate results allow us to make

trade-offs here as seen fit.

Furthermore, the type of intermediate result describes when the

consumers can start consuming results. In streaming jobs, all results

are pipelined (consumers consume after the partition has some data

available). In batch jobs, you will find both pipelined and blocking

results (consumers consume only after the partition has all data

available).

Did you also see this Wiki page here?
https://cwiki.apache.org/confluence/display/FLINK/Data+exchange+between+tasks
– Ufuk

On Sun, Aug 13, 2017 at 6:59 PM, Luis Alves <
mailto:lmtjalves@gmail.com
> wrote:

> Hi,

>

> Can someone validate the following regarding the ExecutionGraph:

>

> Each IntermediateResult can only be consumed by a single ExecutionJobVertex, i.e. if two ExecutionJobVertex consume the same tuples (same “stream") that is produced by the same ExecutionJobVertex, then the producer will have two IntermediateResult, one per consumer.

>

> In other words: if an ExecutionJobVertex performs a map operation, and has two consumers (different ExecutionJobVertex), the ExecutionJobVertex will produce two datasets/IntermediateResults (both with the same “content”, but different consumers).

>

> Each ExecutionVertex will then have the same amount of IntermediateResultPartitions as the number of ExecutionJobVertex that consume the datasets generated by the respective ExecutionJobVertex.

>

> Thus, at runtime:

>

> ResultPartition maps to an IntermediateResultPartition (as documented in the javadoc). Thus, 3. is also valid for ResultPartitions.

>

> ResultSubPartition maps to an ExecutionEdge (since it contains the information on how to send the partition to the actual consumer Task).

>

> Thanks,

>

> Luís Alves

Re: Number of consumers per IntermediateResult

Posted by Ufuk Celebi <uc...@apache.org>.

Hey Luis,

this is correct, yes. Note that these are "only" limitations of the
implementation and there is no fundamental reason to do it like this.
The different characteristics of intermediate results allow us to make
trade-offs here as seen fit.

Furthermore, the type of intermediate result describes when the
consumers can start consuming results. In streaming jobs, all results
are pipelined (consumers consume after the partition has some data
available). In batch jobs, you will find both pipelined and blocking
results (consumers consume only after the partition has all data
available).

Did you also see this Wiki page here?

https://cwiki.apache.org/confluence/display/FLINK/Data+exchange+between+tasks

– Ufuk

On Sun, Aug 13, 2017 at 6:59 PM, Luis Alves <lm...@gmail.com> wrote:
> Hi,
>
> Can someone validate the following regarding the ExecutionGraph:
>
> Each IntermediateResult can only be consumed by a single ExecutionJobVertex, i.e. if two ExecutionJobVertex consume the same tuples (same “stream") that is produced by the same ExecutionJobVertex, then the producer will have two IntermediateResult, one per consumer.
>
> In other words: if an ExecutionJobVertex performs a map operation, and has two consumers (different ExecutionJobVertex), the ExecutionJobVertex will produce two datasets/IntermediateResults (both with the same “content”, but different consumers).
>
> Each ExecutionVertex will then have the same amount of IntermediateResultPartitions as the number of ExecutionJobVertex that consume the datasets generated by the respective ExecutionJobVertex.
>
> Thus, at runtime:
>
> ResultPartition maps to an IntermediateResultPartition (as documented in the javadoc). Thus, 3. is also valid for ResultPartitions.
>
> ResultSubPartition maps to an ExecutionEdge (since it contains the information on how to send the partition to the actual consumer Task).
>
> Thanks,
>
> Luís Alves