You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-issues@hadoop.apache.org by "Thomas Graves (JIRA)" <ji...@apache.org> on 2012/05/01 00:21:51 UTC

[jira] [Commented] (MAPREDUCE-4152) map task left hanging after AM dies trying to connect to RM

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4152?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13265449#comment-13265449 ] 

Thomas Graves commented on MAPREDUCE-4152:
------------------------------------------

{quote}
TA_CONTAINER_CLEANED is not handled in TaskAttempt when it is in running state, it will cause more cascading errors which we can avoid by making it a legal state at running too.
{quote}

Note, the app master isn't going to process this event when its called from the stop() service as the dispatcher is busy in the handleFinishEvent routine in the MRAppMaster. But we should handle it since its using the async dispatcher and we could add more threads to it.  

{quote}
kill() can and should be made reentrant in case the container was already killed. (There is a small race when a container can be killed twice. This happens only after the patch, as stop() is out-of-band)
{quote}

I don't see anything wrong with this as is, do you see something specific I'm missing?  Nothing bad happens by calling kill of a container that is already being killed or killed.  We could add a check in the kill() routine to see if the container state is already ContainerState.DONE or FAILED and skip the extra call if you think that is better?
                
> map task left hanging after AM dies trying to connect to RM
> -----------------------------------------------------------
>
>                 Key: MAPREDUCE-4152
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4152
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 0.23.2
>            Reporter: Thomas Graves
>            Assignee: Thomas Graves
>         Attachments: MAPREDUCE-4152.patch
>
>
> We had an instance where the RM went down for more then an hour.  The application master exited with "Could not contact RM after 360000 milliseconds"
> 2012-04-11 10:43:36,040 INFO [AsyncDispatcher event handler] org.apache.hadoop.mapreduce.v2.app.job.impl.JobImpl: job_1333003059741_15999Job Transitioned from RUNNING to ERROR

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira