You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by "Gyula Fora (Jira)" <ji...@apache.org> on 2022/04/03 19:00:00 UTC

[jira] [Commented] (FLINK-26140) Add basic handling mechanism to deal with job upgrade errors

    [ https://issues.apache.org/jira/browse/FLINK-26140?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17516578#comment-17516578 ] 

Gyula Fora commented on FLINK-26140:
------------------------------------

One straightforward way to implement this would be to add a new field to the status called *lastStableSpec* .

lastStableSpec would be somewhat similar to lastReconciledSpec but while lastReconciledSpec is updated by the reconciler, lastStableSpec should be updated by the observer based on some stability condition.

We could start with a simple checkpoint condition where a spec would be marked stable if the resulting job run has completed 1 successful checkpoint/savepoint. Later we can add user configurable stability conditions but this is a good start.

Once we have the *lastStableSpec* field working, we could introduce the rollback strategy. If rollback is enabled, any deployment errors (not reconciliation errors) the job would be rolled back to the *lastStableSpec .* For executing the rollback we can reuse the logic from the reconciler with some slight modifications.

[~wangyang0918] [~aitozi] wdyt?

> Add basic handling mechanism to deal with job upgrade errors
> ------------------------------------------------------------
>
>                 Key: FLINK-26140
>                 URL: https://issues.apache.org/jira/browse/FLINK-26140
>             Project: Flink
>          Issue Type: Improvement
>          Components: Kubernetes Operator
>            Reporter: Gyula Fora
>            Priority: Major
>             Fix For: kubernetes-operator-1.0.0
>
>
> There are various different ways how a stateful job upgrade can fail.
> For example:
> - Failure/timeout during savepoint
> - Incompatible state
> - Corrupted / not-found checkpoint
> - Error after restart
> We should allow some strategies for the user to declare how to handle the different error scenarios (such as roll back to earlier state) and what should be treated as a fatal error.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)