You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Till Rohrmann (JIRA)" <ji...@apache.org> on 2018/10/03 19:40:00 UTC

[jira] [Commented] (FLINK-10475) Standalone HA - Leader election is not triggered on loss of leader (ZK 3.5.3-beta only)

    [ https://issues.apache.org/jira/browse/FLINK-10475?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16637433#comment-16637433 ] 

Till Rohrmann commented on FLINK-10475:
---------------------------------------

Good to know [~Jamalarm]. We could also add it to the HA documentation. Do you wanna contribute a PR for that? Then I guess we can close this issue, right?

> Standalone HA - Leader election is not triggered on loss of leader (ZK 3.5.3-beta only)
> ---------------------------------------------------------------------------------------
>
>                 Key: FLINK-10475
>                 URL: https://issues.apache.org/jira/browse/FLINK-10475
>             Project: Flink
>          Issue Type: Bug
>    Affects Versions: 1.6.1, 1.5.4
>            Reporter: Thomas Wozniakowski
>            Priority: Minor
>         Attachments: t1.log, t2.log, t3.log
>
>
> Hey Guys,
> Just testing the new bugfix release of 1.5.4 (edit: also happens with 1.6.1). Happy to see that the issue of jobgraphs hanging around forever has been resolved in standalone/zookeeper HA mode, but now I'm seeing a different issue.
> It looks like the HA failover is never triggered. I set up a 3/3/3 cluster of zookeeper/jobmanager/taskmanagers. Started my job, all fine with the new version. I then proceeded to kill the leading jobmanager to test the failover.
> The remaining jobmanagers never triggered a leader election, and simply got stuck.
> Please give me a shout if I can provide any more useful information
> EDIT
> Jobmanager logs attached below. You can see that I brought up a fresh cluster, one JM was elected leader (no taskmanagers or actual jobs in this case). I then let the cluster sit there for half an hour or so, before killing the leader. The log files are snapshotted maybe half an hour after that. You can see that a second election was never triggered.
> In case it's useful, our zookeeper quorum is running "3.5.3-beta". This setup previously worked with 1.4.3. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)