You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Paul Lin (Jira)" <ji...@apache.org> on 2020/10/23 06:40:00 UTC

[jira] [Updated] (FLINK-19778) Failed job reinitiated with wrong checkpoint after a ZK reconnection

     [ https://issues.apache.org/jira/browse/FLINK-19778?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Paul Lin updated FLINK-19778:
-----------------------------
    Description: 
We have a job of Flink 1.11.0 running on YARN that reached FAILED state because its jobmanager lost leadership during a ZK full GC. But after the ZK connection was recovered, somehow the job was reinitiated again with no checkpoints found in ZK, and hence used an earlier savepoint to restore the job, which rewound the job unexpectedly.
  
 For details please see the jobmanager logs in the attachment.

  was:
We have a job of Flink 1.11.0 running on YARN that reached FAILED state due to its jobmanager lost leadership during a ZK full GC. But after the ZK connection was recovered, somehow the job was reinitiated again with no checkpoints found in ZK, and hence used an earlier savepoint to restore the job, which rewound the job unexpectedly.
 
For details please see the jobmanager logs in the attachment.


> Failed job reinitiated with wrong checkpoint after a ZK reconnection
> --------------------------------------------------------------------
>
>                 Key: FLINK-19778
>                 URL: https://issues.apache.org/jira/browse/FLINK-19778
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.11.0
>            Reporter: Paul Lin
>            Priority: Critical
>         Attachments: jm_log
>
>
> We have a job of Flink 1.11.0 running on YARN that reached FAILED state because its jobmanager lost leadership during a ZK full GC. But after the ZK connection was recovered, somehow the job was reinitiated again with no checkpoints found in ZK, and hence used an earlier savepoint to restore the job, which rewound the job unexpectedly.
>   
>  For details please see the jobmanager logs in the attachment.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)