You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Lu Weizheng <lu...@hotmail.com> on 2019/11/24 06:35:02 UTC

Flink distributed runtime architucture

Hi all,

I have been paying attention on Flink for about half a year and have read official documents several times. I have already got a comprehensive understanding of Flink distributed runtime architecture, but still have some questions that need to be clarify.

[cid:3227874C-8447-4A7E-A7D2-2F776D0E3524]

On Flink documents website, this picture shows the dataflow model of Flink. In the picture, keyBy window and apply operators share the same circle. Is is because these operators are chaining together?

[cid:B7CDEBB7-9558-48A7-AAA7-E038FBC038EC]

In the parallelized view, data stream is partition into multiple partitions. Each partition is a subset of source data. Repartition happens when we use keyBy operator. If these tasks share task slots and run like picture above, the repartition process happens both inside TaskManager and between TaskManager. Inside TaskManger, the data transmit overload maybe just in memory. Between TaskManager, the data transmit overload may inter-process or inter-container, depending on how I deploy the Flink cluster. Is my understanding right? These details may highly related to Actor model? As I have little knowledge of Actor model.

This is my first time to use Flink maillist. Thank you so much if anyone can explain it.

Weizheng

Re: Flink distributed runtime architucture

Posted by Piotr Nowojski <pi...@ververica.com>.
Hi,

I’m glad to hear that you are interested in Flink! :)

>  In the picture, keyBy window and apply operators share the same circle. Is is because these operators are chaining together? 

It’s not as much about chaining, as the chain of DataStream API invocations `someStream.keyBy(…).window(…).apply(…)` creates a single logical operations - each one of them on it’s own doesn’t make sense, but together they define how a single `WindowOperator` should behave (`keyBy` additionally defines how records should be shuffle).

Chaining happens _usually_ between “shuffling” boundaries. So for example:

someStream
	.map(…) // first chain …
	.filter(…) // … still first chain
	.keyBy(…) // operator chain boundary
	.window(…).apply(…) // beginning of 2nd chain
	.map(…) // 2nd chain
	.filter(…) // still 2nd chain …
	.keyBy(…) // operator chain boundary
	(…)

>  the repartition process happens both inside TaskManager and between TaskManager. Inside TaskManger, the data transmit overload maybe just in memory.

Yes, exactly. Data transfer happens in-memory, however records are still being serialised and deserialised bot “local input channels” (that’s how we call communication between operators inside a single TaskManager).

> Between TaskManager, the data transmit overload may inter-process or inter-container, depending on how I deploy the Flink cluster. Is my understanding right? 

Yes, between TaskManagers network sockets are always used, regardless if they are happening on one physical machine (localhost) or not.

> These details may highly related to Actor model? As I have little knowledge of Actor model.

I’m not sure if I fully understand your questions. Flink is not using Actor model for the data pipelines.

I hope that helps :)

Piotrek

> On 24 Nov 2019, at 07:35, Lu Weizheng <lu...@hotmail.com> wrote:
> 
> Hi all,
> 
> I have been paying attention on Flink for about half a year and have read official documents several times. I have already got a comprehensive understanding of Flink distributed runtime architecture, but still have some questions that need to be clarify.
> 
> <PastedGraphic-2.png>
> 
> On Flink documents website, this picture shows the dataflow model of Flink. In the picture, keyBy window and apply operators share the same circle. Is is because these operators are chaining together? 
> 
> <PastedGraphic-3.png>
> 
> In the parallelized view, data stream is partition into multiple partitions. Each partition is a subset of source data. Repartition happens when we use keyBy operator. If these tasks share task slots and run like picture above, the repartition process happens both inside TaskManager and between TaskManager. Inside TaskManger, the data transmit overload maybe just in memory. Between TaskManager, the data transmit overload may inter-process or inter-container, depending on how I deploy the Flink cluster. Is my understanding right? These details may highly related to Actor model? As I have little knowledge of Actor model.
> 
> This is my first time to use Flink maillist. Thank you so much if anyone can explain it.
> 
> Weizheng