You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "lihe ma (Jira)" <ji...@apache.org> on 2022/07/12 03:57:00 UTC
[jira] [Updated] (FLINK-28499) resource leak when job failed with unknown status In Application Mode

     [ https://issues.apache.org/jira/browse/FLINK-28499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

lihe ma updated FLINK-28499:
----------------------------
    Description: 
I found a job restarted for thousands of times, and jobmanager tried to create a new taskmanager pod every time.  The jobmanager restarted because submitted with duplicate  job id[1] (we preset the jobId rather than generate), but I hadn't save the logs unfortunately. 

this job requires one taskmanager pod in normal circumstances, but thousands of pods were leaked finally.  you can find the screenshot in the attachment.



 

In application mode, cluster resources will be released  when job finished in succeeded, failed or canceled status[2][3] . When some exception happen, the job may be terminated in unknown status[4] . 

In this case, the job exited with unknown status , without releasing  taskmanager pods. So is it reasonable to not release taskmanager when job exited in unknown status ? 

 

 

one line in original logs:
2022-07-01 09:45:40,712 [main] INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Terminating cluster entrypoint process KubernetesApplicationClusterEntrypoint with exit code 1445.

 

[1] [https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/dispatcher/Dispatcher.java#L452]

[2] [https://github.com/apache/flink/blob/master/flink-clients/src/main/java/org/apache/flink/client/deployment/application/ApplicationDispatcherBootstrap.java#L90-L91]

[3] [https://github.com/apache/flink/blob/master/flink-clients/src/main/java/org/apache/flink/client/deployment/application/ApplicationDispatcherBootstrap.java#L175]

[4] [https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/clusterframework/ApplicationStatus.java#L39]

 

 

 

  was:
I found a job restarted for thousands of times, and jobmanager tried to create a new taskmanager pod every time.  The jobmanager restarted because submitted with duplicate  job id[1] (we preset the jobId rather than generate), but I hadn't save the logs unfortunately. 

this job requires one taskmanager pod in normal circumstances, but thousands of pods were leaked finally.
!image-2022-07-12-11-02-43-009.png|width=666,height=366!



In application mode, cluster resources will be released  when job finished in succeeded, failed or canceled status[2][3] . When some exception happen, the job may be terminated in unknown status[4] . 

In this case, the job exited with unknown status , without releasing  taskmanager pods. So is it reasonable to not release taskmanager when job exited in unknown status ? 

 

 

one line in original logs:
2022-07-01 09:45:40,712 [main] INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Terminating cluster entrypoint process KubernetesApplicationClusterEntrypoint with exit code 1445.

 

[1] [https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/dispatcher/Dispatcher.java#L452]

[2] [https://github.com/apache/flink/blob/master/flink-clients/src/main/java/org/apache/flink/client/deployment/application/ApplicationDispatcherBootstrap.java#L90-L91]


[3] [https://github.com/apache/flink/blob/master/flink-clients/src/main/java/org/apache/flink/client/deployment/application/ApplicationDispatcherBootstrap.java#L175]

[4] [https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/clusterframework/ApplicationStatus.java#L39]

 

 

 

       Priority: Minor  (was: Major)

> resource leak when job failed with unknown status In Application Mode
> ---------------------------------------------------------------------
>
>                 Key: FLINK-28499
>                 URL: https://issues.apache.org/jira/browse/FLINK-28499
>             Project: Flink
>          Issue Type: Bug
>          Components: Deployment / Kubernetes
>    Affects Versions: 1.13.1
>            Reporter: lihe ma
>            Priority: Minor
>         Attachments: cluster-pod-error.png
>
>
> I found a job restarted for thousands of times, and jobmanager tried to create a new taskmanager pod every time.  The jobmanager restarted because submitted with duplicate  job id[1] (we preset the jobId rather than generate), but I hadn't save the logs unfortunately. 
> this job requires one taskmanager pod in normal circumstances, but thousands of pods were leaked finally.  you can find the screenshot in the attachment.
>  
> In application mode, cluster resources will be released  when job finished in succeeded, failed or canceled status[2][3] . When some exception happen, the job may be terminated in unknown status[4] . 
> In this case, the job exited with unknown status , without releasing  taskmanager pods. So is it reasonable to not release taskmanager when job exited in unknown status ? 
>  
>  
> one line in original logs:
> 2022-07-01 09:45:40,712 [main] INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Terminating cluster entrypoint process KubernetesApplicationClusterEntrypoint with exit code 1445.
>  
> [1] [https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/dispatcher/Dispatcher.java#L452]
> [2] [https://github.com/apache/flink/blob/master/flink-clients/src/main/java/org/apache/flink/client/deployment/application/ApplicationDispatcherBootstrap.java#L90-L91]
> [3] [https://github.com/apache/flink/blob/master/flink-clients/src/main/java/org/apache/flink/client/deployment/application/ApplicationDispatcherBootstrap.java#L175]
> [4] [https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/clusterframework/ApplicationStatus.java#L39]
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)