You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Matthias Pohl (Jira)" <ji...@apache.org> on 2022/04/20 10:26:00 UTC

[jira] [Comment Edited] (FLINK-26772) Application Mode does not wait for job cleanup during shutdown

    [ https://issues.apache.org/jira/browse/FLINK-26772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17524296#comment-17524296 ] 

Matthias Pohl edited comment on FLINK-26772 at 4/20/22 10:25 AM:
-----------------------------------------------------------------

I tried to reproduce it with the standalone cluster but failed: The {{ClusterEntrypoint}} process kept running until I resolved the cleanup issue which is the expected behavior. The logs of the k8s run revealed a {{SIGTERM}} which might be k8s-specific:
{code:java}
2022-03-21 09:34:41,129 INFO  org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] - Closing the slot manager.
2022-03-21 09:34:41,129 INFO  org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] - Suspending the slot manager.
2022-03-21 09:34:41,133 DEBUG org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor         [] - The RpcEndpoint resourcemanager_0 terminated successfully.
2022-03-21 09:34:41,136 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.
2022-03-21 09:34:41,136 DEBUG org.apache.flink.runtime.rpc.akka.SupervisorActor            [] - AkkaRpcActor akka://flink/user/rpc/resourcemanager_0 has terminated.
2022-03-21 09:34:41,151 INFO  org.apache.flink.runtime.blob.BlobServer                     [] - Stopped BLOB server at 0.0.0.0:6124 {code}
[~wangyang0918]  do you have any guess what external process might have sent the SIGTERM while the Application Mode cluster is shutting down?


was (Author: mapohl):
I tried to reproduce it with the standalone cluster but failed: The {{ClusterEntrypoint}} process kept running until I resolved the cleanup issue which is the expected behavior. The logs of the k8s run revealed a {{SIGTERM}} which might be k8s-specific:
{code:java}
2022-03-21 09:34:41,129 INFO  org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] - Closing the slot manager.
2022-03-21 09:34:41,129 INFO  org.apache.flink.runtime.resourcemanager.slotmanager.DeclarativeSlotManager [] - Suspending the slot manager.
2022-03-21 09:34:41,133 DEBUG org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor         [] - The RpcEndpoint resourcemanager_0 terminated successfully.
2022-03-21 09:34:41,136 INFO  org.apache.flink.runtime.entrypoint.ClusterEntrypoint        [] - RECEIVED SIGNAL 15: SIGTERM. Shutting down as requested.
2022-03-21 09:34:41,136 DEBUG org.apache.flink.runtime.rpc.akka.SupervisorActor            [] - AkkaRpcActor akka://flink/user/rpc/resourcemanager_0 has terminated.
2022-03-21 09:34:41,151 INFO  org.apache.flink.runtime.blob.BlobServer                     [] - Stopped BLOB server at 0.0.0.0:6124 {code}
[~yangwang166] do you have any guess what external process might have sent the SIGTERM while the Application Mode cluster is shutting down?

> Application Mode does not wait for job cleanup during shutdown
> --------------------------------------------------------------
>
>                 Key: FLINK-26772
>                 URL: https://issues.apache.org/jira/browse/FLINK-26772
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.15.0
>            Reporter: Mika Naylor
>            Assignee: Matthias Pohl
>            Priority: Critical
>              Labels: pull-request-available
>         Attachments: FLINK-26772.standalone-job.log, testcluster-599f4d476b-bghw5_log.txt
>
>
> We discovered that in Application Mode, when the application has completed, the cluster is shutdown even if there are ongoing resource cleanup events happening in the background. For example, if ha cleanup fails, further retries are not attempted as the cluster is shut down before this can happen.
>  
> We should also add a flag for the shutdown that will prevent further jobs from being submitted.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)