You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Yang Wang (Jira)" <ji...@apache.org> on 2021/01/19 10:32:00 UTC
[jira] [Comment Edited] (FLINK-21008) ClusterEntrypoint#shutDownAsync may not be fully executed

    [ https://issues.apache.org/jira/browse/FLINK-21008?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17267804#comment-17267804 ] 

Yang Wang edited comment on FLINK-21008 at 1/19/21, 10:31 AM:
--------------------------------------------------------------

You are right. Deregistering the application from K8s(aka delete the JobManager deployment) will let the kubelet send a SIGTERM to JobManager process. But the Yarn has the same behavior. The reason why we do not come into this issue when deploying Flink application on Yarn is that the SIGTERM is sent a little late. Because Yarn ResourceManager tells NodeManager to kill(SIGTERM and followed by a SIGKILL) the JobManager via heartbeat, which is 3 seconds by default. However, on Kubernetes, kubelet is informed via watcher, which is no delay.

Assume that the cluster entrypoint costs more than 3 seconds for the internal clean up( {{stopClusterServices}} and {{cleanupDirectories}}), we will run into the same situation on Yarn deployment.


was (Author: fly_in_gis):
You are right. Deregistering the application from K8s(aka delete the JobManager deployment) will let the kubelet send a SIGTERM to JobManager process. But the Yarn has the same behavior. The reason why we do not come into this issue when deploying Flink application on Yarn is that the SIGTERM is sent a little late. Because Yarn ResourceManager tell NodeManager to kill(SIGTERM and followed by a SIGKILL) the JobManager via heartbeat, which is 3 seconds by default. However, on Kubernetes, kubelet is informed via watcher, which is no delay.

Assume that the cluster entrypoint costs more that 3 second for the internal clean up( {{stopClusterServices}} and {{cleanupDirectories}}), we will run into the same situation on Yarn deployment.

> ClusterEntrypoint#shutDownAsync may not be fully executed
> ---------------------------------------------------------
>
>                 Key: FLINK-21008
>                 URL: https://issues.apache.org/jira/browse/FLINK-21008
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.11.3, 1.12.1
>            Reporter: Yang Wang
>            Priority: Critical
>             Fix For: 1.13.0
>
>
> Recently, in our internal use case for native K8s integration with K8s HA enabled, we found that the leader related ConfigMaps could be residual in some corner situations.
> After some investigations, I think it is possibly caused by the inappropriate shutdown process.
> In {{ClusterEntrypoint#shutDownAsync}}, we first call the {{closeClusterComponent}}, which also includes deregistering the Flink application from cluster management(e.g. Yarn, K8s). Then we call the {{stopClusterServices}} and {{cleanupDirectories}}. Imagine that the cluster management do the deregister very fast, the JobManager process receives SIGNAL 15 before or is being executing the {{stopClusterServices}} and {{cleanupDirectories}}. The jvm process will directly exit then. So the two methods may not be executed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)