You are viewing a plain text version of this content. The canonical link for it is here.

Posted to yarn-issues@hadoop.apache.org by "Jason Lowe (JIRA)" <ji...@apache.org> on 2016/05/23 17:50:13 UTC

[jira] [Commented] (YARN-4459) container-executor might kill process wrongly

    [ https://issues.apache.org/jira/browse/YARN-4459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15296729#comment-15296729 ] 

Jason Lowe commented on YARN-4459:
----------------------------------

Sorry to arrive to this late.  I agree that we should be killing the session and not the pid.  It's not a perfect solution, but it _drastically_ reduces the likelihood of the wrong process getting killed.  This could be improved upon by adding a just-before-kill check of some sort and/or proactive cancelling of the timer when we see the child process exit before the SIGKILL is sent.  However rather than holding up this significant improvement waiting for those things to be added, I propose we add this now and further iterate on it in a subsequent JIRA.

+1 for the patch.  Will commit this in a couple of days if there are no objections.


> container-executor might kill process wrongly
> ---------------------------------------------
>
>                 Key: YARN-4459
>                 URL: https://issues.apache.org/jira/browse/YARN-4459
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>            Reporter: Jun Gong
>            Assignee: Jun Gong
>         Attachments: YARN-4459.01.patch, YARN-4459.02.patch
>
>
> When calling 'signal_container_as_user' in container-executor, it first checks whether process group exists, if not, it will kill the process itself(if it the process exists).  It is not reasonable because that the process group does not exist means corresponding container has finished, if we kill the process itself, we just kill wrong process.
> We found it happened in our cluster many times. We used same account for starting NM and submitted app, and container-executor sometimes killed NM(the wrongly killed process might just be a newly started thread and was NM's child process).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org