You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Yang Wang (Jira)" <ji...@apache.org> on 2022/03/31 02:37:00 UTC
[jira] [Comment Edited] (FLINK-26930) Rethink last-state upgrade implementation in flink-kubernetes-operator

    [ https://issues.apache.org/jira/browse/FLINK-26930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17515028#comment-17515028 ] 

Yang Wang edited comment on FLINK-26930 at 3/31/22, 2:36 AM:
-------------------------------------------------------------

I hesitate to store the last checkpoint path in the HA store, not only the K8s ConfigMap, but also ZooKeeper. Even though it is a minimal backward compatible change, I am just feeling it is a small temporary hack since it is only for exposing the checkpoint information, which will be picked up by external tools. Let's have more discussion here and create a new ticket if needed.

 

[~dmvk] is suggesting to store the retained checkpoints in the JRS[1], which might could not work for JobManager crash backoff scenario.

 

[1]. [https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=195726435]


was (Author: fly_in_gis):
I hesitate to store the the last checkpoint path in the HA store, not only the K8s ConfigMap, but also ZooKeeper. Even though it is a minimal backward compatible change, I am just feeling it is a small temporary hack since it is only for exposing the checkpoint information, which will be picked up by external tools. Let's have more discussion here and create a new ticket if needed.

 

[~dmvk] is suggesting to store the retained checkpoints in the JRS[1], which might could not work for JobManager crash backoff scenario.

 

[1]. https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=195726435

> Rethink last-state upgrade implementation in flink-kubernetes-operator
> ----------------------------------------------------------------------
>
>                 Key: FLINK-26930
>                 URL: https://issues.apache.org/jira/browse/FLINK-26930
>             Project: Flink
>          Issue Type: Improvement
>          Components: Kubernetes Operator
>            Reporter: Yang Wang
>            Priority: Major
>
> Following the discussion in FLINK-26916.
>  
> How the last-state upgrade works now?
> First, delete the Flink cluster directly with HA ConfigMap retained. This leaves job in a "SUSPENDED" state. Then flink-kubernetes-operator will deploy a new Flink application with same cluster-id so that it could recover from the latest checkpoint. Please note that before starting the application, JobGraph will be deleted from the HA ConfigMap. This is to ensure the newly changed job options could take effect.
>  
> Some community devs are thinking to extend the JRS so the stored job result contains list of retained checkpoints. This of course implies that cluster gets shut down / job gets terminated properly (other cases should be used for fail-over scenarios only).
>  
> As soon as there is a straightforward way of accessing the last checkpoint, we should improve the current implementation.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)