You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by "Jiayi Liao (Jira)" <ji...@apache.org> on 2020/10/23 07:31:00 UTC

[jira] [Commented] (FLINK-19778) Failed job reinitiated with wrong checkpoint after a ZK reconnection

    [ https://issues.apache.org/jira/browse/FLINK-19778?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17219520#comment-17219520 ] 

Jiayi Liao commented on FLINK-19778:
------------------------------------

In the JM log, the point is why zk store cannot read checkpoints in Zookeeper. 

 
{code:java}
2020-10-23 06:17:59,635 INFO  org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  - Recovering checkpoints from ZooKeeper.
2020-10-23 06:17:59,706 INFO  org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  - Found 0 checkpoints in ZooKeeper.
2020-10-23 06:17:59,706 INFO  org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore  - Trying to fetch 0 checkpoints from storage.
2020-10-23 06:17:59,706 INFO  org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Starting job e70c0d75910ce83ea50cc34ce62a241a from savepoint hdfs://cosmos/flink/user_10342/savepoints/5d6e4d1574aabf4fab5281f4/savepoint-bbbc5a-f381255749ac (allowing non restored state)
{code}
 Do you find any clues in Zookeeper's logs? BTW, can you query the zk data on command line? 

The zk path should be, ${high-availability.zookeeper.path.root} + ${high-availability.cluster-id} + ${high-availability.zookeeper.path.checkpoints} + ${jobId}. 

 

> Failed job reinitiated with wrong checkpoint after a ZK reconnection
> --------------------------------------------------------------------
>
>                 Key: FLINK-19778
>                 URL: https://issues.apache.org/jira/browse/FLINK-19778
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.11.0
>            Reporter: Paul Lin
>            Priority: Critical
>         Attachments: jm_log
>
>
> We have a job of Flink 1.11.0 running on YARN that reached FAILED state because its jobmanager lost leadership during a ZK full GC. But after the ZK connection was recovered, somehow the job was reinitiated again with no checkpoints found in ZK, and hence an earlier savepoint was used to restore the job, which rewound the job unexpectedly.
>   
>  For details please see the jobmanager logs in the attachment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)