You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by "Till Rohrmann (Jira)" <ji...@apache.org> on 2020/02/18 10:49:00 UTC

[jira] [Commented] (FLINK-16069) Creation of TaskDeploymentDescriptor can block main thread for long time

    [ https://issues.apache.org/jira/browse/FLINK-16069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17038976#comment-17038976 ] 

Till Rohrmann commented on FLINK-16069:
---------------------------------------

Hi [~huwh], do you know what exactly is taking so long. Is the creation of the {{TaskDeploymentDescriptors}}? If yes, is it the iteration over the input edges?

I think it is not as easy as moving the {{TaskDeploymentDescriptor}} creation into a future because we are accessing the {{ExecutionGraph}} through the passed result partitions. This means that in case of a concurrent recovery we might have a race condition where we read state from an already reset {{Execution}}, for example.

> Creation of TaskDeploymentDescriptor can block main thread for long time
> ------------------------------------------------------------------------
>
>                 Key: FLINK-16069
>                 URL: https://issues.apache.org/jira/browse/FLINK-16069
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>            Reporter: huweihua
>            Priority: Major
>
> The deploy of tasks will take long time when we submit a high parallelism job. And Execution#deploy run in mainThread, so it will block JobMaster process other akka messages, such as Heartbeat. The creation of TaskDeploymentDescriptor take most of time. We can put the creation in future.
> For example, A job [source(8000)->sink(8000)], the total 16000 tasks from SCHEDULED to DEPLOYING took more than 1mins. This caused the heartbeat of TaskManager timeout and job never success.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)