You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Till Rohrmann (JIRA)" <ji...@apache.org> on 2018/09/13 13:41:00 UTC
[jira] [Comment Edited] (FLINK-10184) HA Failover broken due to JobGraphs not being removed from Zookeeper on cancel

    [ https://issues.apache.org/jira/browse/FLINK-10184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16613513#comment-16613513 ] 

Till Rohrmann edited comment on FLINK-10184 at 9/13/18 1:40 PM:
----------------------------------------------------------------

[~Jamalarm] yes, you would need to checkout my branch and then build Flink yourself. The binaries are then located in flink-dist/target/flink1.7-SNAPSHOT-bin/flink-1.7-SNAPSHOT. I don't think that you need to rebuild your job because the branch should not contain any API changes. I'm about to merge the PR so that there should be a snapshot build with the fix very soon (hopefully by tomorrow morning).

Thanks a lot for your help [~wcummings]!


was (Author: till.rohrmann):
[~Jamalarm] yes, you would need to checkout my branch and then build Flink yourself. The binaries are then located in flink-dist/target/flink1.7-SNAPSHOT-bin/flink-1.7-SNAPSHOT. I don't think that you need to rebuild your job because the branch should not contain any API changes. I'm about to merge the PR so that there should be a snapshot build with the fix very soon (hopefully by tomorrow morning).

> HA Failover broken due to JobGraphs not being removed from Zookeeper on cancel
> ------------------------------------------------------------------------------
>
>                 Key: FLINK-10184
>                 URL: https://issues.apache.org/jira/browse/FLINK-10184
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Coordination
>    Affects Versions: 1.5.2, 1.6.0
>            Reporter: Thomas Wozniakowski
>            Priority: Blocker
>             Fix For: 1.6.1, 1.7.0, 1.5.4
>
>
> We have encountered a blocking issue when upgrading our cluster to 1.5.2.
> It appears that, when jobs are cancelled manually (in our case with a savepoint), the JobGraphs are NOT removed from the Zookeeper {{jobgraphs}} node.
> This means that, if you start a job, cancel it, restart it, cancel it, etc. You will end up with many job graphs stored in zookeeper, but none of the corresponding blobs in the Flink HA directory.
> When a HA failover occurs, the newly elected leader retrieves all of those old JobGraph objects from Zookeeper, then goes looking for the corresponding blobs in the HA directory. The blobs are not there so the JobManager explodes and the process dies.
> At this point the cluster has to be fully stopped, the zookeeper jobgraphs cleared out by hand, and all the jobmanagers restarted.
> I can see the following line in the JobManager logs:
> {quote}
> 2018-08-20 16:17:20,776 INFO  org.apache.flink.runtime.jobmanager.ZooKeeperSubmittedJobGraphStore  - Removed job graph 4e9a5a9d70ca99dbd394c35f8dfeda65 from ZooKeeper.
> {quote}
> But looking in Zookeeper the {{4e9a5a9d70ca99dbd394c35f8dfeda65}} job is still very much there.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)