You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by "Xin Hao (Jira)" <ji...@apache.org> on 2022/10/10 11:31:00 UTC

[jira] [Created] (FLINK-29566) Reschedule the cleanup logic if cancel job failed

Xin Hao created FLINK-29566:
-------------------------------

             Summary: Reschedule the cleanup logic if cancel job failed
                 Key: FLINK-29566
                 URL: https://issues.apache.org/jira/browse/FLINK-29566
             Project: Flink
          Issue Type: Improvement
          Components: Kubernetes Operator
            Reporter: Xin Hao


Currently, when we remove the FlinkSessionJob object,

we always remove the object even if the Flink job is not being canceled successfully.

 

This is not semantic consistent if the FlinkSessionJob has been removed but the Flink job is still running.

 

One of the scenarios is that if we deploy a FlinkDeployment with HA mode.

When we remove the FlinkSessionJob and change the FlinkDeployment at the same time,

or if the TMs are restarting because of some bugs such as OOM.

Both of these will cause the cancelation of the Flink job to fail because the TMs are not available.

 

We should reschedule the cleanup logic if the FlinkDeployment is present.

And we can add a new ReconciliationState DELETING to indicate the FlinkSessionJob's status.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)