You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Vinod Kumar Vavilapalli (Updated) (JIRA)" <ji...@apache.org> on 2011/10/20 14:15:11 UTC

[jira] [Updated] (MAPREDUCE-3228) MR AM hangs when one node goes bad

     [ https://issues.apache.org/jira/browse/MAPREDUCE-3228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Vinod Kumar Vavilapalli updated MAPREDUCE-3228:
-----------------------------------------------

    Attachment: MAPREDUCE-3228-20111020.txt

Adding timers for both {{startContainer()}} and {{stopContainer()}} so that MR AM doesn't get stuck on faulty nodes.

Need to do some testing.
                
> MR AM hangs when one node goes bad
> ----------------------------------
>
>                 Key: MAPREDUCE-3228
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3228
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Vinod Kumar Vavilapalli
>            Priority: Blocker
>             Fix For: 0.23.0
>
>         Attachments: MAPREDUCE-3228-20111020.txt
>
>
> Found this on one of the gridmix runs, again. One of the nodes went real bad, the job had three containers running on the node. Eventually, AM marked the tasks as timedout and initiated cleanup of the failed containers via {{stopContainer()}}. The later got stuck at the faulty node, the tasks are stuck in FAIL_CONTAINER_CLEANUP stage and the job lies in there waiting for ever.
> Thanks to [~Karams] for helping with this.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira