You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by "Xintong Song (Jira)" <ji...@apache.org> on 2021/09/26 08:09:00 UTC
[jira] [Created] (FLINK-24377) TM resource may not be properly
released after heartbeat timeout
Xintong Song created FLINK-24377:
------------------------------------
Summary: TM resource may not be properly released after heartbeat timeout
Key: FLINK-24377
URL: https://issues.apache.org/jira/browse/FLINK-24377
Project: Flink
Issue Type: Bug
Components: Deployment / Kubernetes, Deployment / YARN, Runtime / Coordination
Affects Versions: 1.13.2, 1.14.0
Reporter: Xintong Song
Assignee: Xintong Song
Fix For: 1.14.0, 1.13.3, 1.15.0
In native k8s and yarn deploy modes, RM disconnects a TM when its heartbeat times out. However, it does not actively release the pod / container of that TM. The releasing of pod / container relies on the TM to terminate itself after failing to re-register to the RM.
In some rare conditions, the TM process may not terminate and hang out for long time. In such cases, k8s / yarn sees the process running, thus will not release the pod / container. Neither will Flink's resource manager. Consequently, the resource is leaked until the entire application is terminated.
To fix this, we should make {{ActiveResourceManager}} to actively release the resource to K8s / Yarn after a TM heartbeat timeout.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)