You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by "Xiangyu Zhu (JIRA)" <ji...@apache.org> on 2018/10/23 14:33:00 UTC

[jira] [Resolved] (FLINK-10133) finished job's jobgraph never been cleaned up in zookeeper for standalone clusters (HA mode with multiple masters)

     [ https://issues.apache.org/jira/browse/FLINK-10133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Xiangyu Zhu resolved FLINK-10133.
---------------------------------
       Resolution: Fixed
    Fix Version/s: 1.6.1

> finished job's jobgraph never been cleaned up in zookeeper for standalone clusters (HA mode with multiple masters)
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-10133
>                 URL: https://issues.apache.org/jira/browse/FLINK-10133
>             Project: Flink
>          Issue Type: Bug
>          Components: JobManager
>    Affects Versions: 1.5.0, 1.5.2, 1.6.0
>            Reporter: Xiangyu Zhu
>            Priority: Major
>             Fix For: 1.6.1
>
>         Attachments: client.log, namenode.log, standalonesession.log, zookeeper.log
>
>
> Hi,
> We have 3 servers in our test environment, noted as node1-3. Setup is as following:
>  * hadoop hdfs: node1 as namenode, node2,3 as datanode
>  * zookeeper: node1-3 as a quorum (but also tried node1 alone)
>  * flink: node1,2 as masters, node2,3 as slaves
> As my understanding when a job finished the corresponding job's blob data is expected to be deleted from hdfs path and node under zookeeper's path `/\{zk path root}/\{cluster-id}/jobgraphs/\{job id}` should be deleted after that. However we observe that whenever we submitted a job and it finished (via `bin/flink run WordCount.jar`), the blob data is gone whereas job id node under zookeeper is still there, with a uuid style lock node inside it. From the debug node in zookeeper we observed something like "cannot be deleted because non empty". Because of this, as long as a job is finished and the jobgraph node persists, if restart the clusters or kill one manager (to test HA mode), it tries to recover a finished job and couldn't find blob data under hdfs, and the whole cluster is down.
> If we use only node1 as master and node2,3 as slaves, the jobgraphs node can be deleted successfully. If the jobgraphs is clean, killing one job manager makes another stand-by JM raised as leader, so it is only this jobgraphs issue preventing HA from working.
> I'm not sure if there's something wrong with our configs because this happens every time for finished job (we only tested with wordcount.jar though). I'm aware of FLINK-10011 and FLINK-10029, but unlike FLINK-10011 this happens every time, rendering HA mode un-useable for us.
> Any idea what might cause this?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)