You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by "Zhu Zhu (Jira)" <ji...@apache.org> on 2020/04/15 06:34:00 UTC

[jira] [Commented] (FLINK-17075) Add task status reconciliation between TM and JM

    [ https://issues.apache.org/jira/browse/FLINK-17075?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17083845#comment-17083845 ] 

Zhu Zhu commented on FLINK-17075:
---------------------------------

From the current implementation of {{TaskExecutor#updateTaskExecutionState}}, TM will fail the task if the {{updateTaskExecutionState(...)}} RPC fails. However, if it is a final state notification, the task should have been unregistered from the TM right before the RPC, and the {{failTask(...)}} handling will not take any effect.

So how about adding retry logic for calling {{JobMasterGateway#updateTaskExecutionState(...)}} in TaskExecutor, unless the PRC succeeds or fails with expected errors (like ExecutionGraphException)? Or maybe just retry for final states update RPC and {{failTask(...)}} on none final state update RPC in cases that the RPC fails.
This guarantees that task states in JM and TM can be finally synced. It is more lightweight than adding task state payloads in TM-JM heartbeat, and nothing needs to be changed in JM. If the retry keeps failing for network issues, the heartbeat would also fail I think.


> Add task status reconciliation between TM and JM
> ------------------------------------------------
>
>                 Key: FLINK-17075
>                 URL: https://issues.apache.org/jira/browse/FLINK-17075
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>    Affects Versions: 1.10.0, 1.11.0
>            Reporter: Till Rohrmann
>            Priority: Critical
>             Fix For: 1.11.0
>
>
> In order to harden the TM and JM communication I suggest to let the {{TaskExecutor}} send the task statuses back to the {{JobMaster}} as part of the heartbeat payload (similar to FLINK-11059). This would allow to reconcile the states of both components in case that a status update message was lost as described by a user on the ML.
> https://lists.apache.org/thread.html/ra9ed70866381f0ef0f4779633346722ccab3dc0d6dbacce04080b74e%40%3Cuser.flink.apache.org%3E



--
This message was sent by Atlassian Jira
(v8.3.4#803005)