You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Flavio Pompermaier <po...@okkam.it> on 2017/04/05 17:43:00 UTC

Flink slots, threads, task, etc

Hi to all,
I had a very long but useful chat with Fabian and I understood a lot of
concepts that was not clear at all to me. We started from the Flink runtime
documentation page (
https://ci.apache.org/projects/flink/flink-docs-release-1.2/concepts/runtime.html)
but
I discovered that the terminology is very inconsistent and misleading along
the page...

For example, one of the very first sentences is :
"Flink chains operator subtasks together into tasks. Each task is executed
by one thread."
What I first understood was that every operator can be executed only by a
single thread in all the cluster....probably it should be better "one
thread per task slot" (at least).
Moreover, if I'm not wrong, a Task Slot can execute only 1 subtask (aka
parallel instance) of each task and there's no limit to the number of
subtasks per slot (and this is not highlighted at all in that document).
The only constraint is that they should belong to different tasks (right?).

If there's a google doc version of that page I could try to rewrite it down
in order to make it easier to understand some parts...however I still have
some more questions:

   1. Is it correct that a single Task Slot can execute only a single
   subtask of each task and that this task is executed by a single thread
   within the slot)?
   2. If it so:
      1. why at that page there's written "By default, Flink allows
      subtasks to share slots even if they are subtasks of different tasks, so
      long as they are from the same job"? It seems that it is more
common to run
      multiple subtasks of the same task (in a slot) than executing different
      substasks of different tasks, although this is still
permitted...from what
      I understood a slot cannot run multiple subtask of the same task at all!
      2. and why this constraint? Is there any good reason for that? A
      subtask is mapped to 1 thread in the TaskManager, so why a TM
with 2 slots
      can run 2 subtasks of the same task (in the same JVM) while a TM with 1
      slot cannot  (while it can execute an arbitrary number of subtasks of
      different tasks)?
   3. It it is not so, there's no images representing such a situation in
   that page...
   4. Isn't dangerous to allow (potentially) an unlimited number of threads
   per TM slot??

Cheers,
Flavio

Re: Flink slots, threads, task, etc

Posted by Aljoscha Krettek <al...@apache.org>.
Hi,
there are currently no built-in metrics for InputSplit consumption but I do see that this could be quite helpful. I think you can have a custom RichInputFormat that uses metrics to record stuff, though.

I think adding built-in metrics should be possible at this point in the code: https://github.com/apache/flink/blob/8f3d6d239996c83f7cbd102dc8a85ee626a56bf5/flink-runtime/src/main/java/org/apache/flink/runtime/operators/DataSourceTask.java#L144-L144 <https://github.com/apache/flink/blob/8f3d6d239996c83f7cbd102dc8a85ee626a56bf5/flink-runtime/src/main/java/org/apache/flink/runtime/operators/DataSourceTask.java#L144-L144>

Best,
Aljoscha
> On 19. Apr 2017, at 09:25, Flavio Pompermaier <po...@okkam.it> wrote:
> 
> Hi Aljoscha,
> thanks for the reply, it was not urgent and I was aware of the FF...btw, congratulations for it, I saw many interesting talks!
> Flink community has grown a lot since it was Stratosphere ;)
> Just one last question: in many of my use cases it could be helpful to see how many of the created splits were "consumed" by an inputFormat/source.
> Is it possible to monitor this part somewhere in the dashboards or with a custom metric?
> 
> On Tue, Apr 18, 2017 at 5:24 PM, Aljoscha Krettek <aljoscha@apache.org <ma...@apache.org>> wrote:
> Hi,
> sorry for not getting any responses but I think everyone was quite busy with Flink Forward SF. I’m also no expert on the topic but I’ll try and give some answers.
> 
> Regarding a Google Doc version, I don’t think that there is any. You would have to modify the Markdown version we have in the doc.
> 
> For the other answers I’ll reuse an example program that consists of Source -> Map -> Sink, with chaining disabled and parallelism 2. We’ll this have three Tasks: Source, Map, and Sink, with each having two subtasks. Let’s denote the subtasks by a number in parenthesis so the first subtask for Source is Source(1), second one is Source(2). I’ll also refer to Source(1) -> Map(1) -> Sink(1) as a slice of the execution graph since these can be executed within one slot.
> 
> Regarding 1, I think this is true. However, a single slot can execute a complete slice of the execution graph where each subtask (from a different task) would be executed by its own thread.
> 
> Regarding 2.1, Yes, I think it cannot run multiple subtasks of the same task while it is possible (and in fact done) to execute all the subtasks of a slide in the same slot.
> 
> Regarding 2.2, This is so to allow executing a pipeline of parallelism 8 using a cluster that has 8 free slots. Basically, each slice fills one slot.
> 
> Regarding 3, I don’t really have an answer.
> 
> Regarding 4, Yes, this can get a bit out of hand if you have very long pipelines.
> 
> Best,
> Aljoscha
> 
>> On 11. Apr 2017, at 14:37, Flavio Pompermaier <pompermaier@okkam.it <ma...@okkam.it>> wrote:
>> 
>> Any feedback here..?
>> 
>> On Wed, Apr 5, 2017 at 7:43 PM, Flavio Pompermaier <pompermaier@okkam.it <ma...@okkam.it>> wrote:
>> Hi to all,
>> I had a very long but useful chat with Fabian and I understood a lot of concepts that was not clear at all to me. We started from the Flink runtime documentation page (https://ci.apache.org/projects/flink/flink-docs-release-1.2/concepts/runtime.html <https://ci.apache.org/projects/flink/flink-docs-release-1.2/concepts/runtime.html>) but
>> I discovered that the terminology is very inconsistent and misleading along the page...
>> 
>> For example, one of the very first sentences is :
>> "Flink chains operator subtasks together into tasks. Each task is executed by one thread."
>> What I first understood was that every operator can be executed only by a single thread in all the cluster....probably it should be better "one thread per task slot" (at least). 
>> Moreover, if I'm not wrong, a Task Slot can execute only 1 subtask (aka parallel instance) of each task and there's no limit to the number of subtasks per slot (and this is not highlighted at all in that document). The only constraint is that they should belong to different tasks (right?).
>> 
>> If there's a google doc version of that page I could try to rewrite it down in order to make it easier to understand some parts...however I still have some more questions:
>> Is it correct that a single Task Slot can execute only a single subtask of each task and that this task is executed by a single thread within the slot)?
>> If it so:
>> why at that page there's written "By default, Flink allows subtasks to share slots even if they are subtasks of different tasks, so long as they are from the same job"? It seems that it is more common to run multiple subtasks of the same task (in a slot) than executing different substasks of different tasks, although this is still permitted...from what I understood a slot cannot run multiple subtask of the same task at all!
>> and why this constraint? Is there any good reason for that? A subtask is mapped to 1 thread in the TaskManager, so why a TM with 2 slots can run 2 subtasks of the same task (in the same JVM) while a TM with 1 slot cannot  (while it can execute an arbitrary number of subtasks of different tasks)? 
>> It it is not so, there's no images representing such a situation in that page...
>> Isn't dangerous to allow (potentially) an unlimited number of threads per TM slot?? 
>> Cheers,
>> Flavio
>> 
>> 
> 


Re: Flink slots, threads, task, etc

Posted by Flavio Pompermaier <po...@okkam.it>.
Hi Aljoscha,
thanks for the reply, it was not urgent and I was aware of the FF...btw,
congratulations for it, I saw many interesting talks!
Flink community has grown a lot since it was Stratosphere ;)
Just one last question: in many of my use cases it could be helpful to see
how many of the created splits were "consumed" by an inputFormat/source.
Is it possible to monitor this part somewhere in the dashboards or with a
custom metric?

On Tue, Apr 18, 2017 at 5:24 PM, Aljoscha Krettek <al...@apache.org>
wrote:

> Hi,
> sorry for not getting any responses but I think everyone was quite busy
> with Flink Forward SF. I’m also no expert on the topic but I’ll try and
> give some answers.
>
> Regarding a Google Doc version, I don’t think that there is any. You would
> have to modify the Markdown version we have in the doc.
>
> For the other answers I’ll reuse an example program that consists of
> Source -> Map -> Sink, with chaining disabled and parallelism 2. We’ll this
> have three Tasks: Source, Map, and Sink, with each having two subtasks.
> Let’s denote the subtasks by a number in parenthesis so the first subtask
> for Source is Source(1), second one is Source(2). I’ll also refer to
> Source(1) -> Map(1) -> Sink(1) as a slice of the execution graph since
> these can be executed within one slot.
>
> Regarding 1, I think this is true. However, a single slot can execute a
> complete slice of the execution graph where each subtask (from a different
> task) would be executed by its own thread.
>
> Regarding 2.1, Yes, I think it cannot run multiple subtasks of the same
> task while it is possible (and in fact done) to execute all the subtasks of
> a slide in the same slot.
>
> Regarding 2.2, This is so to allow executing a pipeline of parallelism 8
> using a cluster that has 8 free slots. Basically, each slice fills one slot.
>
> Regarding 3, I don’t really have an answer.
>
> Regarding 4, Yes, this can get a bit out of hand if you have very long
> pipelines.
>
> Best,
> Aljoscha
>
> On 11. Apr 2017, at 14:37, Flavio Pompermaier <po...@okkam.it>
> wrote:
>
> Any feedback here..?
>
> On Wed, Apr 5, 2017 at 7:43 PM, Flavio Pompermaier <po...@okkam.it>
> wrote:
>
>> Hi to all,
>> I had a very long but useful chat with Fabian and I understood a lot of
>> concepts that was not clear at all to me. We started from the Flink runtime
>> documentation page (https://ci.apache.org/project
>> s/flink/flink-docs-release-1.2/concepts/runtime.html) but
>> I discovered that the terminology is very inconsistent and misleading
>> along the page...
>>
>> For example, one of the very first sentences is :
>> "Flink chains operator subtasks together into tasks. Each task is
>> executed by one thread."
>> What I first understood was that every operator can be executed only by a
>> single thread in all the cluster....probably it should be better "one
>> thread per task slot" (at least).
>> Moreover, if I'm not wrong, a Task Slot can execute only 1 subtask (aka
>> parallel instance) of each task and there's no limit to the number of
>> subtasks per slot (and this is not highlighted at all in that document).
>> The only constraint is that they should belong to different tasks (right?).
>>
>> If there's a google doc version of that page I could try to rewrite it
>> down in order to make it easier to understand some parts...however I still
>> have some more questions:
>>
>>    1. Is it correct that a single Task Slot can execute only a single
>>    subtask of each task and that this task is executed by a single thread
>>    within the slot)?
>>    2. If it so:
>>       1. why at that page there's written "By default, Flink allows
>>       subtasks to share slots even if they are subtasks of different tasks, so
>>       long as they are from the same job"? It seems that it is more common to run
>>       multiple subtasks of the same task (in a slot) than executing different
>>       substasks of different tasks, although this is still permitted...from what
>>       I understood a slot cannot run multiple subtask of the same task at all!
>>       2. and why this constraint? Is there any good reason for that? A
>>       subtask is mapped to 1 thread in the TaskManager, so why a TM with 2 slots
>>       can run 2 subtasks of the same task (in the same JVM) while a TM with 1
>>       slot cannot  (while it can execute an arbitrary number of subtasks of
>>       different tasks)?
>>    3. It it is not so, there's no images representing such a situation
>>    in that page...
>>    4. Isn't dangerous to allow (potentially) an unlimited number of
>>    threads per TM slot??
>>
>> Cheers,
>> Flavio
>>
>>
>

Re: Flink slots, threads, task, etc

Posted by Aljoscha Krettek <al...@apache.org>.
Hi,
sorry for not getting any responses but I think everyone was quite busy with Flink Forward SF. I’m also no expert on the topic but I’ll try and give some answers.

Regarding a Google Doc version, I don’t think that there is any. You would have to modify the Markdown version we have in the doc.

For the other answers I’ll reuse an example program that consists of Source -> Map -> Sink, with chaining disabled and parallelism 2. We’ll this have three Tasks: Source, Map, and Sink, with each having two subtasks. Let’s denote the subtasks by a number in parenthesis so the first subtask for Source is Source(1), second one is Source(2). I’ll also refer to Source(1) -> Map(1) -> Sink(1) as a slice of the execution graph since these can be executed within one slot.

Regarding 1, I think this is true. However, a single slot can execute a complete slice of the execution graph where each subtask (from a different task) would be executed by its own thread.

Regarding 2.1, Yes, I think it cannot run multiple subtasks of the same task while it is possible (and in fact done) to execute all the subtasks of a slide in the same slot.

Regarding 2.2, This is so to allow executing a pipeline of parallelism 8 using a cluster that has 8 free slots. Basically, each slice fills one slot.

Regarding 3, I don’t really have an answer.

Regarding 4, Yes, this can get a bit out of hand if you have very long pipelines.

Best,
Aljoscha
> On 11. Apr 2017, at 14:37, Flavio Pompermaier <po...@okkam.it> wrote:
> 
> Any feedback here..?
> 
> On Wed, Apr 5, 2017 at 7:43 PM, Flavio Pompermaier <pompermaier@okkam.it <ma...@okkam.it>> wrote:
> Hi to all,
> I had a very long but useful chat with Fabian and I understood a lot of concepts that was not clear at all to me. We started from the Flink runtime documentation page (https://ci.apache.org/projects/flink/flink-docs-release-1.2/concepts/runtime.html <https://ci.apache.org/projects/flink/flink-docs-release-1.2/concepts/runtime.html>) but
> I discovered that the terminology is very inconsistent and misleading along the page...
> 
> For example, one of the very first sentences is :
> "Flink chains operator subtasks together into tasks. Each task is executed by one thread."
> What I first understood was that every operator can be executed only by a single thread in all the cluster....probably it should be better "one thread per task slot" (at least). 
> Moreover, if I'm not wrong, a Task Slot can execute only 1 subtask (aka parallel instance) of each task and there's no limit to the number of subtasks per slot (and this is not highlighted at all in that document). The only constraint is that they should belong to different tasks (right?).
> 
> If there's a google doc version of that page I could try to rewrite it down in order to make it easier to understand some parts...however I still have some more questions:
> Is it correct that a single Task Slot can execute only a single subtask of each task and that this task is executed by a single thread within the slot)?
> If it so:
> why at that page there's written "By default, Flink allows subtasks to share slots even if they are subtasks of different tasks, so long as they are from the same job"? It seems that it is more common to run multiple subtasks of the same task (in a slot) than executing different substasks of different tasks, although this is still permitted...from what I understood a slot cannot run multiple subtask of the same task at all!
> and why this constraint? Is there any good reason for that? A subtask is mapped to 1 thread in the TaskManager, so why a TM with 2 slots can run 2 subtasks of the same task (in the same JVM) while a TM with 1 slot cannot  (while it can execute an arbitrary number of subtasks of different tasks)? 
> It it is not so, there's no images representing such a situation in that page...
> Isn't dangerous to allow (potentially) an unlimited number of threads per TM slot?? 
> Cheers,
> Flavio
> 
> 


Re: Flink slots, threads, task, etc

Posted by Flavio Pompermaier <po...@okkam.it>.
Any feedback here..?

On Wed, Apr 5, 2017 at 7:43 PM, Flavio Pompermaier <po...@okkam.it>
wrote:

> Hi to all,
> I had a very long but useful chat with Fabian and I understood a lot of
> concepts that was not clear at all to me. We started from the Flink runtime
> documentation page (https://ci.apache.org/projects/flink/flink-docs-
> release-1.2/concepts/runtime.html) but
> I discovered that the terminology is very inconsistent and misleading
> along the page...
>
> For example, one of the very first sentences is :
> "Flink chains operator subtasks together into tasks. Each task is executed
> by one thread."
> What I first understood was that every operator can be executed only by a
> single thread in all the cluster....probably it should be better "one
> thread per task slot" (at least).
> Moreover, if I'm not wrong, a Task Slot can execute only 1 subtask (aka
> parallel instance) of each task and there's no limit to the number of
> subtasks per slot (and this is not highlighted at all in that document).
> The only constraint is that they should belong to different tasks (right?).
>
> If there's a google doc version of that page I could try to rewrite it
> down in order to make it easier to understand some parts...however I still
> have some more questions:
>
>    1. Is it correct that a single Task Slot can execute only a single
>    subtask of each task and that this task is executed by a single thread
>    within the slot)?
>    2. If it so:
>       1. why at that page there's written "By default, Flink allows
>       subtasks to share slots even if they are subtasks of different tasks, so
>       long as they are from the same job"? It seems that it is more common to run
>       multiple subtasks of the same task (in a slot) than executing different
>       substasks of different tasks, although this is still permitted...from what
>       I understood a slot cannot run multiple subtask of the same task at all!
>       2. and why this constraint? Is there any good reason for that? A
>       subtask is mapped to 1 thread in the TaskManager, so why a TM with 2 slots
>       can run 2 subtasks of the same task (in the same JVM) while a TM with 1
>       slot cannot  (while it can execute an arbitrary number of subtasks of
>       different tasks)?
>    3. It it is not so, there's no images representing such a situation in
>    that page...
>    4. Isn't dangerous to allow (potentially) an unlimited number of
>    threads per TM slot??
>
> Cheers,
> Flavio
>
>