You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Till Rohrmann (Jira)" <ji...@apache.org> on 2020/07/24 08:26:00 UTC

[jira] [Comment Edited] (FLINK-17470) Flink task executor process permanently hangs on `flink-daemon.sh stop`, deletes PID file

    [ https://issues.apache.org/jira/browse/FLINK-17470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17164255#comment-17164255 ] 

Till Rohrmann edited comment on FLINK-17470 at 7/24/20, 8:25 AM:
-----------------------------------------------------------------

Introducing this kind of safety net could be a good idea. Based on https://stackoverflow.com/a/687994/4815083 we could add something like this:

{code}
kill -s SIGTERM $$ && kill -0 $$ || exit 0

((t = delay))

while ((t > 0)); do
    sleep $interval
    kill -0 $$ || exit 0
    ((t -= interval))
done

kill -s SIGKILL $$
{code}


was (Author: till.rohrmann):
Introducing this kind of safety net could be a good idea. Based on https://stackoverflow.com/a/687994/4815083 we could add something like this:

{code}
kill -s SIGTERM $$ kill -0 $$ || exit 0

((t = delay))

while ((t > 0)); do
    sleep $interval
    kill -0 $$ || exit 0
    ((t -= interval))
done

kill -s SIGKILL $$
{code}

> Flink task executor process permanently hangs on `flink-daemon.sh stop`, deletes PID file
> -----------------------------------------------------------------------------------------
>
>                 Key: FLINK-17470
>                 URL: https://issues.apache.org/jira/browse/FLINK-17470
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Task
>    Affects Versions: 1.10.0
>         Environment:  
> {code:java}
> $ uname -a
> Linux hostname.local 3.10.0-1062.9.1.el7.x86_64 #1 SMP Fri Dec 6 15:49:49 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
> $ lsb_release -a
> LSB Version:	:core-4.1-amd64:core-4.1-noarch
> Distributor ID:	CentOS
> Description:	CentOS Linux release 7.7.1908 (Core)
> Release:	7.7.1908
> Codename:	Core
> {code}
> Flink version 1.10
>  
>            Reporter: Hunter Herman
>            Priority: Major
>         Attachments: flink_jstack.log, flink_mixed_jstack.log
>
>
> Hi Flink team!
> We've attempted to upgrade our flink 1.9 cluster to 1.10, but are experiencing reproducible instability on shutdown. Speciically, it appears that the `kill` issued in the `stop` case of flink-daemon.sh is causing the task executor process to hang permanently. Specifically, the process seems to be hanging in the `org.apache.flink.runtime.util.JvmShutdownSafeguard$DelayedTerminator.run` in a `Thread.sleep()` call. I think this is a bizarre behavior. Also note that every thread in the process is BLOCKED. on a `pthread_cond_wait` call. Is this an OS level issue? Banging my head on a wall here. See attached stack traces for details.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)