You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Fritz Budiyanto <fb...@icloud.com> on 2020/05/20 01:21:59 UTC

Jobgraph not getting deleted from Zookeeper

Hi All,


I have been seeing this issue several time where JobGraph are not cleaned up properly. As a result, when Flink cluster is restarted, it will attempt to do HA restore on a checkpoint which doesn't exist anymore and the new restarted cluster eventually go give up and stay down.

The workaround is to cleanup the jobgraph manually from Zookeeper. Is this a known issue? 


2020-05-19 19:56:21,471 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Un-registering task and sending final execution state FINISHED to JobManager for task Source: kafkaConsumer[update_server] -> (DetectedUpdateMessageConverter -> Sink: update_server.detected_updates, DrivenCoordinatesMessageConverter -> Sink: update_server.driven_coordinates) 588902a8096f49845b09fa1f595d6065.
2020-05-19 19:56:21,622 INFO org.apache.flink.runtime.taskexecutor.slot.TaskSlotTable - Free slot TaskSlot(index:0, state:ACTIVE, resource profile: ResourceProfile{cpuCores=1.7976931348623157E308, heapMemoryInMB=2147483647, directMemoryInMB=2147483647, nativeMemoryInMB=2147483647, networkMemoryInMB=2147483647, managedMemoryInMB=642}, allocationId: 29f6a5f83c832486f2d7ebe5c779fa32, jobId: 86a028b3f7aada8ffe59859ca71d6385).
2020-05-19 19:56:21,622 INFO org.apache.flink.runtime.taskexecutor.JobLeaderService - Remove job 86a028b3f7aada8ffe59859ca71d6385 from job leader monitoring.
2020-05-19 19:56:21,622 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Stopping ZooKeeperLeaderRetrievalService /leader/86a028b3f7aada8ffe59859ca71d6385/job_manager_lock.
2020-05-19 19:56:21,623 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Close JobManager connection for job 86a028b3f7aada8ffe59859ca71d6385.
2020-05-19 19:56:21,624 INFO org.apache.flink.runtime.taskexecutor.TaskExecutor - Close JobManager connection for job 86a028b3f7aada8ffe59859ca71d6385.
2020-05-19 19:56:21,624 INFO org.apache.flink.runtime.taskexecutor.JobLeaderService - Cannot reconnect to job 86a028b3f7aada8ffe59859ca71d6385 because it is not registered.


...

Zookeeper CLI:


ls /flink/cluster_update/jobgraphs
[86a028b3f7aada8ffe59859ca71d6385]

Thanks,
Fritz