You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Xintong Song (Jira)" <ji...@apache.org> on 2020/01/08 08:09:00 UTC

[jira] [Comment Edited] (FLINK-13554) ResourceManager should have a timeout on starting new TaskExecutors.

    [ https://issues.apache.org/jira/browse/FLINK-13554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17010451#comment-17010451 ] 

Xintong Song edited comment on FLINK-13554 at 1/8/20 8:08 AM:
--------------------------------------------------------------

We have confirmed that the release-1.10 blocker FLINK-15456 is actually caused by the problem described in this ticket.
Since this problem is not introduced in 1.10, I believe it should not be a blocker. But how do we fix the problem, and whether it needs to be fixed in 1.10 still need to be discussed.
I'm setting this ticket to be release-1.10 critical for now, to avoid overlooking it before a decision being made.
cc [~gjy] [~liyu] [~zhuzh] [~chesnay] [~trohrmann] [~karmagyz]


was (Author: xintongsong):
We have confirmed that the release-1.10 blocker FLINK-15456 is actually caused by the problem described in this ticket.
Since this problem is not introduced in 1.10, I believe it should not be a blocker. But how do we fix the problem, and whether it needs to be fixed in 1.10 still need to be discussed.
I'm setting this ticket to be release-1.10 critical for now, to avoid overlooking it before a decision being made.

> ResourceManager should have a timeout on starting new TaskExecutors.
> --------------------------------------------------------------------
>
>                 Key: FLINK-13554
>                 URL: https://issues.apache.org/jira/browse/FLINK-13554
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>    Affects Versions: 1.9.0
>            Reporter: Xintong Song
>            Priority: Critical
>             Fix For: 1.10.0
>
>
> Recently, we encountered a case that one TaskExecutor get stuck during launching on Yarn (without fail), causing that job cannot recover from continuous failovers.
> The reason the TaskExecutor gets stuck is due to our environment problem. The TaskExecutor gets stuck somewhere after the ResourceManager starts the TaskExecutor and waiting for the TaskExecutor to be brought up and register. Later when the slot request timeouts, the job fails over and requests slots from ResourceManager again, the ResourceManager still see a TaskExecutor (the stuck one) is being started and will not request new container from Yarn. Therefore, the job can not recover from failure.
> I think to avoid such unrecoverable status, the ResourceManager need to have a timeout on starting new TaskExecutor. If the starting of TaskExecutor takes too long, it should just fail the TaskExecutor and starts a new one.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)