You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Till Rohrmann (JIRA)" <ji...@apache.org> on 2019/02/06 09:51:00 UTC
[jira] [Updated] (FLINK-11537) ExecutionGraph does not reach
terminal state when JobMaster lost leadership
[ https://issues.apache.org/jira/browse/FLINK-11537?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Till Rohrmann updated FLINK-11537:
----------------------------------
Description:
The {{ExecutionGraph}} sometimes does not reach a terminal state if the {{JobMaster}} lost the leadership. The reason is that we use the fenced main thread executor to execute {{ExecutionGraph}} changes and we don't wait for the {{ExecutionGraph}} to reach the terminal state before we set the fencing token {{null}}.
One possible solution would be to wait for the {{ExecutionGraph}} to reach the terminal state before clearing the fencing token. This has, however, the downside that the {{JobMaster}} is still reachable until the {{ExecutionGraph}} has been properly terminated. Alternatively, we could use the unfenced main thread executor to send the cancel calls out.
A Travis run where the problem occurred is here: https://travis-ci.org/tillrohrmann/flink/jobs/489119926
was:
The {{ExecutionGraph}} sometimes does not reach a terminal state if the {{JobMaster}} lost the leadership. The reason is that we use the fenced main thread executor to execute {{ExecutionGraph}} changes and we don't wait for the {{ExecutionGraph}} to reach the terminal state before we set the fencing token {{null}}.
One possible solution would be to wait for the {{ExecutionGraph}} to reach the terminal state before clearing the fencing token. This has, however, the downside that the {{JobMaster}} is still reachable until the {{ExecutionGraph}} has been properly terminated. Alternatively, we could use the unfenced main thread executor to send the cancel calls out.
> ExecutionGraph does not reach terminal state when JobMaster lost leadership
> ---------------------------------------------------------------------------
>
> Key: FLINK-11537
> URL: https://issues.apache.org/jira/browse/FLINK-11537
> Project: Flink
> Issue Type: Bug
> Components: Distributed Coordination
> Affects Versions: 1.8.0
> Reporter: Till Rohrmann
> Assignee: Till Rohrmann
> Priority: Critical
> Fix For: 1.8.0
>
>
> The {{ExecutionGraph}} sometimes does not reach a terminal state if the {{JobMaster}} lost the leadership. The reason is that we use the fenced main thread executor to execute {{ExecutionGraph}} changes and we don't wait for the {{ExecutionGraph}} to reach the terminal state before we set the fencing token {{null}}.
> One possible solution would be to wait for the {{ExecutionGraph}} to reach the terminal state before clearing the fencing token. This has, however, the downside that the {{JobMaster}} is still reachable until the {{ExecutionGraph}} has been properly terminated. Alternatively, we could use the unfenced main thread executor to send the cancel calls out.
> A Travis run where the problem occurred is here: https://travis-ci.org/tillrohrmann/flink/jobs/489119926
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)