You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by "Andrey Zagrebin (Jira)" <ji...@apache.org> on 2020/02/21 13:40:00 UTC

[jira] [Commented] (FLINK-16215) Start redundant TaskExecutor when JM failed

    [ https://issues.apache.org/jira/browse/FLINK-16215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17041869#comment-17041869 ] 

Andrey Zagrebin commented on FLINK-16215:
-----------------------------------------

I assume it is about some active RM integration, e.g. Yarn.

I agree reusing existing TMs is better but if it happens rarely as it is hard to reproduce, why is it a problem? The failover should be also a rare case so that starting a new TM should not be a big penalty and existing TMs will just disappear after the timeout as already pointed out.

cc [~trohrmann]

> Start redundant TaskExecutor when JM failed
> -------------------------------------------
>
>                 Key: FLINK-16215
>                 URL: https://issues.apache.org/jira/browse/FLINK-16215
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.10.0
>            Reporter: YufeiLiu
>            Priority: Major
>
> TaskExecutor will reconnect to the new ResourceManager leader when JM failed, and JobMaster will restart and reschedule job. If job slot request arrive earlier than TM registration, RM will start new workers rather than reuse the existing TMs.
> It‘s hard to reproduce becasue TM registration usually come first, and timeout check will stop redundant TMs. 
> But I think it would be better if we make the {{recoverWokerNode}} to interface, and put recovered slots in {{pendingSlots}} wait for TM reconnection.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)