You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Gera Shegalov (JIRA)" <ji...@apache.org> on 2014/02/24 21:03:23 UTC

[jira] [Updated] (MAPREDUCE-5044) Have AM trigger jstack on task attempts that timeout before killing them

     [ https://issues.apache.org/jira/browse/MAPREDUCE-5044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gera Shegalov updated MAPREDUCE-5044:
-------------------------------------

    Attachment: MAPREDUCE-5044.v04.patch

v04 to apply on top of YARN-1515.v05. It now makes sure that a thread dump is created in the uber mode. 

Added unit tests for a normal MR job and uber MR job.

While working on this I realized that we actually need to discuss how mapreduce.task.timeout is treated in the ubermode. Right now it's basically ignored because AM does not kill itself, LocalContainerLauncher processes CONTAINER_REMOTE_CLEANUP inline with the stuck in SubtaskRunner.  The liveness monitor for AM in RM does not catch the problem either because RMCommunicator heartbeats in a separate allocator thread. 

I am considering two options:
- move heartbeat() into SubtaskRunner for ubermode such that the liveness monitor catches the stuck ubertask.
- do System.exit(errorcode) when TA_TIMEOUT occurs.

 

> Have AM trigger jstack on task attempts that timeout before killing them
> ------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-5044
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5044
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: mr-am
>    Affects Versions: 2.1.0-beta
>            Reporter: Jason Lowe
>            Assignee: Gera Shegalov
>         Attachments: MAPREDUCE-5044.v01.patch, MAPREDUCE-5044.v02.patch, MAPREDUCE-5044.v03.patch, MAPREDUCE-5044.v04.patch, Screen Shot 2013-11-12 at 1.05.32 PM.png, Screen Shot 2013-11-12 at 1.06.04 PM.png
>
>
> When an AM expires a task attempt it would be nice if it triggered a jstack output via SIGQUIT before killing the task attempt.  This would be invaluable for helping users debug their hung tasks, especially if they do not have shell access to the nodes.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)