You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Jun Gong (JIRA)" <ji...@apache.org> on 2016/01/04 14:01:39 UTC

[jira] [Commented] (YARN-4536) DelayedProcessKiller may not work under heavy workload

    [ https://issues.apache.org/jira/browse/YARN-4536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15081122#comment-15081122 ] 

Jun Gong commented on YARN-4536:
--------------------------------

hi [~gu chi] thanks for reporting the issue. 

{quote}but the parent process which persisted as container pid no longer exist, so the kill command can not reach the container process.{quote}
Although parent process does not exist, corresponding process group does exist, then *SIGKILL* will be delivered to the process group, so *SIGKILL* could reach the container's rest processes. Could you explain it more? Thanks.

> DelayedProcessKiller may not work under heavy workload
> ------------------------------------------------------
>
>                 Key: YARN-4536
>                 URL: https://issues.apache.org/jira/browse/YARN-4536
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.7.1
>            Reporter: gu-chi
>
> I am now facing with orphan process of container. Here is the scenario:
> With heavy task load, the NM machine CPU usage can reach almost 100%. When some container got event of kill, it will get  {{SIGTERM}} , and then the parent process exit, leave the container process to OS. This container process need handle some shutdown events or some logic, but hardly can get CPU, we suppose to see a {{SIGKILL}} as there is {{DelayedProcessKiller}} ,but the parent process which persisted as container pid no longer exist, so the kill command can not reach the container process. This is how orphan container process come.
> The orphan process do exit after some time, but the period can be very long, and will make the OS status worse. As I observed, the period can be several hours



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)