You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by "Thomas Weise (Jira)" <ji...@apache.org> on 2022/12/03 15:20:00 UTC

[jira] [Commented] (FLINK-29109) Checkpoint path conflict with stateless upgrade mode

    [ https://issues.apache.org/jira/browse/FLINK-29109?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17642849#comment-17642849 ] 

Thomas Weise commented on FLINK-29109:
--------------------------------------

[~gyfora] thanks for catching this. Because the jobId assigned by Flink is deterministic (HighAvailabilityOptions.HA_CLUSTER_ID), we will also need to apply the random jobId for stateless upgrade mode for Flink version >= 1.16 to avoid the checkpoint path collisions. 

https://github.com/apache/flink/blob/e70fe68dea764606180ca3728184c00fc63ea0ff/flink-clients/src/main/java/org/apache/flink/client/deployment/application/ApplicationDispatcherBootstrap.java#L227

> Checkpoint path conflict with stateless upgrade mode
> ----------------------------------------------------
>
>                 Key: FLINK-29109
>                 URL: https://issues.apache.org/jira/browse/FLINK-29109
>             Project: Flink
>          Issue Type: Bug
>          Components: Kubernetes Operator
>    Affects Versions: kubernetes-operator-1.1.0
>            Reporter: Thomas Weise
>            Assignee: Thomas Weise
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: kubernetes-operator-1.2.0
>
>
> A stateful job with stateless upgrade mode (yes, there are such use cases) fails with checkpoint path conflict due to constant jobId and FLINK-19358 (applies to Flink < 1.16x). Since with stateless upgrade mode the checkpoint id resets on restart the job is going to write to previously used locations and fail. The workaround is to rotate the jobId on every redeploy when the upgrade mode is stateless. While this can be worked around externally it is best done in the operator itself because reconciliation resolves when a restart is actually required while rotating jobId externally may trigger unnecessary restarts.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)