You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "lihe ma (Jira)" <ji...@apache.org> on 2022/07/12 03:55:00 UTC
[jira] [Closed] (FLINK-28498) resource leak when job failed with unknown status In Application Mode
[ https://issues.apache.org/jira/browse/FLINK-28498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
lihe ma closed FLINK-28498.
---------------------------
Resolution: Duplicate
> resource leak when job failed with unknown status In Application Mode
> ---------------------------------------------------------------------
>
> Key: FLINK-28498
> URL: https://issues.apache.org/jira/browse/FLINK-28498
> Project: Flink
> Issue Type: Bug
> Components: Deployment / Kubernetes
> Affects Versions: 1.13.1
> Reporter: lihe ma
> Priority: Minor
> Attachments: cluster-pod-error.png
>
>
> I found a job restarted for thousands of times, and jobmanager tried to create a new taskmanager pod every time. The jobmanager restarted because submitted with duplicate job id[1] (we preset the jobId rather than generate), but I hadn't save the logs unfortunately.
> this job requires one taskmanager pod in normal circumstances, but thousands of pods were leaked finally.
> !image-2022-07-12-11-02-43-009.png|width=666,height=366!
> In application mode, cluster resources will be released when job finished in succeeded, failed or canceled status[2][3] . When some exception happen, the job may be terminated in unknown status[4] .
> In this case, the job exited with unknown status , without releasing taskmanager pods. So is it reasonable to not release taskmanager when job exited in unknown status ?
>
>
> one line in original logs:
> 2022-07-01 09:45:40,712 [main] INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint - Terminating cluster entrypoint process KubernetesApplicationClusterEntrypoint with exit code 1445.
>
> [1] [https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/dispatcher/Dispatcher.java#L452]
> [2] [https://github.com/apache/flink/blob/master/flink-clients/src/main/java/org/apache/flink/client/deployment/application/ApplicationDispatcherBootstrap.java#L90-L91]
> [3] [https://github.com/apache/flink/blob/master/flink-clients/src/main/java/org/apache/flink/client/deployment/application/ApplicationDispatcherBootstrap.java#L175]
> [4] [https://github.com/apache/flink/blob/master/flink-runtime/src/main/java/org/apache/flink/runtime/clusterframework/ApplicationStatus.java#L39]
>
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)