You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Gyula Fora (Jira)" <ji...@apache.org> on 2022/10/14 13:31:00 UTC

[jira] [Commented] (FLINK-29566) Reschedule the cleanup logic if cancel job failed

    [ https://issues.apache.org/jira/browse/FLINK-29566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17617730#comment-17617730 ] 

Gyula Fora commented on FLINK-29566:
------------------------------------

I think this improvement makes sense :) 

> Reschedule the cleanup logic if cancel job failed
> -------------------------------------------------
>
>                 Key: FLINK-29566
>                 URL: https://issues.apache.org/jira/browse/FLINK-29566
>             Project: Flink
>          Issue Type: Improvement
>          Components: Kubernetes Operator
>            Reporter: Xin Hao
>            Priority: Minor
>
> Currently, when we remove the FlinkSessionJob object,
> we always remove the object even if the Flink job is not being canceled successfully.
>  
> This is *not semantic consistent* if the FlinkSessionJob has been removed but the Flink job is still running.
>  
> One of the scenarios is that if we deploy a FlinkDeployment with HA mode.
> When we remove the FlinkSessionJob and change the FlinkDeployment at the same time,
> or if the TMs are restarting because of some bugs such as OOM.
> Both of these will cause the cancelation of the Flink job to fail because the TMs are not available.
>  
> We should *reschedule* the cleanup logic if the FlinkDeployment is present.
> And we can add a new ReconciliationState DELETING to indicate the FlinkSessionJob's status.
>  
> The logic will be
> {code:java}
> if the FlinkDeployment is not present
>     delete the FlinkSessionJob object
> else
>     if the JM is not available
>         reschedule
>     else
>         if cancel job successfully
>             delete the FlinkSessionJob object
>         else
>             reschedule{code}
> When we cancel the Flink job, we need to verify all the jobs with the same name have been deleted in case of the job id is changed after JM restarted.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)