You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Jason Lowe (JIRA)" <ji...@apache.org> on 2015/09/08 23:37:46 UTC
[jira] [Updated] (MAPREDUCE-5982) Task attempts that fail from the ASSIGNED state can disappear

     [ https://issues.apache.org/jira/browse/MAPREDUCE-5982?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Lowe updated MAPREDUCE-5982:
----------------------------------
    Status: Open  (was: Patch Available)

Thanks for the patch, Chang!

Note that the point of this change is to be able to have users locate any potential logs for applications that failed in the ASSIGNED state.  By having a canned fake started event there's no way to determine which nodemanager tried to run the container and therefore we can't provide a good logs link.  We need to preserve as much information as we can about the task, and that includes the host, http port, etc.

The good news is that we have most of this information from the container that was assigned to the task attempt.  See the code for LaunchedContainerTransition for details.  It would be nice to see some of the code in that transition factored out so it can be reused when we are creating the start event for an attempt that failed in the ASSIGNED state.  Also I would hesitate to call it a fake event.  It's still a task started event, but we are missing just a few key components like the shuffle port and the start time.  If we factor out the code from LaunchedContainerTransition then we can drop the "fake" part.

Is forceFinishTime really necessary?  We can go ahead and set the launch time as we are processing the task started event and then just call setFinishTime.

In general I think we should worry about making sure we generate a proper task start event and then let the normal task unsuccessful completion event code handle things after that.  For example, in DeallocateContainerTransition I think we should be generating the job counter update events for this scenario, but we don't since we go down a different task unsuccessful completion event handling path when launchTime is zero.  Seems like we should just generate the missing start event when launchTime is zero then fall through to the normal unsucessful completion event handling code in all cases after that.

Nit: missing whitespace before new method in MRApp.


> Task attempts that fail from the ASSIGNED state can disappear
> -------------------------------------------------------------
>
>                 Key: MAPREDUCE-5982
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-5982
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mr-am
>    Affects Versions: 2.7.1, 2.2.1, 0.23.10
>            Reporter: Jason Lowe
>            Assignee: Chang Li
>         Attachments: MAPREDUCE-5982.2.patch, MAPREDUCE-5982.3.patch, MAPREDUCE-5982.4.patch, MAPREDUCE-5982.patch
>
>
> If a task attempt fails in the ASSIGNED state, e.g.: container launch fails,  then it can disappear from the job history.  The task overview page will show subsequent attempts but the attempt in question is simply missing.  For example attempt ID 1 appears but the attempt ID 0 is missing.  Similarly in the job overview page the task attempt doesn't appear in any of the failed/killed/succeeded counts or pages.  It's as if the task attempt never existed, but the AM logs show otherwise.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)