You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Daryn Sharp (JIRA)" <ji...@apache.org> on 2012/08/31 20:53:08 UTC

[jira] [Commented] (YARN-68) NodeManager will refuse to shutdown indefinitely due to container log aggregation

    [ https://issues.apache.org/jira/browse/YARN-68?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13446246#comment-13446246 ] 

Daryn Sharp commented on YARN-68:
---------------------------------

This also prevents the NM from internally restarting after the RM is bounced, or the NM goes out of sync for too long.  The stop sets a boolean to signal the (nonexistent or stuck) thread to finish, and then waits for the (nonexistent or stuck) thread to set another boolean that it's finished.  This will cause the NM to wait forever and be unresponsive to shutdowns or internal restarts.
                
> NodeManager will refuse to shutdown indefinitely due to container log aggregation
> ---------------------------------------------------------------------------------
>
>                 Key: YARN-68
>                 URL: https://issues.apache.org/jira/browse/YARN-68
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 0.23.3
>         Environment: QE
>            Reporter: patrick white
>            Assignee: Daryn Sharp
>
> The nodemanager is able to get into a state where containermanager.logaggregation.AppLogAggregatorImpl will apparently wait
> indefinitely for log aggregation to complete for an application, even if that application has abnormally terminated and is no longer present. 
> Observed behavior is that an attempt to stop the nodemanager daemon will return but have no effect, the nm log continually displays messages similar to this:
> [Thread-1]2012-08-21 17:44:07,581 INFO
> org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl:
> Waiting for aggregation to complete for application_1345221477405_2733
> The only recovery we found to work was to 'kill -9' the nm process.
> What exactly causes the NM to enter this state is unclear but we do see this behavior reliably when the NM has run a task which failed, for example when debugging oozie distcp actions and having a distcp map task fail, the NM that was running the container will now enter this state where a shutdown on said NM will never complete, 'never' in this case was waiting for 2 hours before killing the nodemanager process.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira