You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-issues@hadoop.apache.org by "Anubhav Dhoot (JIRA)" <ji...@apache.org> on 2015/08/12 18:22:45 UTC

[jira] [Moved] (HADOOP-12317) Applications fail on NM restart on some linux distro because NM container recovery declares AM container as LOST

     [ https://issues.apache.org/jira/browse/HADOOP-12317?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Anubhav Dhoot moved YARN-4046 to HADOOP-12317:
----------------------------------------------

    Component/s:     (was: nodemanager)
            Key: HADOOP-12317  (was: YARN-4046)
        Project: Hadoop Common  (was: Hadoop YARN)

> Applications fail on NM restart on some linux distro because NM container recovery declares AM container as LOST
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-12317
>                 URL: https://issues.apache.org/jira/browse/HADOOP-12317
>             Project: Hadoop Common
>          Issue Type: Bug
>            Reporter: Anubhav Dhoot
>            Assignee: Anubhav Dhoot
>            Priority: Critical
>         Attachments: YARN-4046.002.patch, YARN-4046.002.patch, YARN-4096.001.patch
>
>
> On a debian machine we have seen node manager recovery of containers fail because the signal syntax for process group may not work. We see errors in checking if process is alive during container recovery which causes the container to be declared as LOST (154) on a NodeManager restart.
> The application will fail with error. The attempts are not retried.
> {noformat}
> Application application_1439244348718_0001 failed 1 times due to Attempt recovered after RM restartAM Container for appattempt_1439244348718_0001_000001 exited with exitCode: 154
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)