You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Sandy Ryza (JIRA)" <ji...@apache.org> on 2013/02/22 22:28:13 UTC

[jira] [Commented] (YARN-24) Nodemanager fails to start if log aggregation enabled and namenode unavailable

    [ https://issues.apache.org/jira/browse/YARN-24?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13584711#comment-13584711 ] 

Sandy Ryza commented on YARN-24:
--------------------------------

I encountered this when trying to start a NM and a namenode at the same time.  The NM shut down because the namenode was in safe mode.  Having the NM die in this way introduces a dependency in the order that services are started.

Log aggregation is checked each time an app is run on a node, and the app is immediately killed if a log folder cannot be used for it.  Thus, merely removing the NM killing itself on startup doesn't introduce any correctness issues.  The worst that could happen is that time could be wasted by scheduling more containers on a node we already know has connection issues to the namenode.

Attached a patch that removes the NM killing itself on startup.  At initApp time, if verifyAndCreateRemoteLogDir has not been successfully completed, it is called again, and the app is failed if it fails.  If initApp fails five consecutive times, the NM sets its status to unhealthy.

I agree if an NM loses its ability to connect to the namenode after an app has started, it would be good for the NMs to report that they weren't able to write their logs, but my opinion is that that is a more difficult issue and does not need to be tied to this change. 
                
> Nodemanager fails to start if log aggregation enabled and namenode unavailable
> ------------------------------------------------------------------------------
>
>                 Key: YARN-24
>                 URL: https://issues.apache.org/jira/browse/YARN-24
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 0.23.3, 2.0.0-alpha
>            Reporter: Jason Lowe
>         Attachments: YARN-24.patch
>
>
> If log aggregation is enabled and the namenode is currently unavailable, the nodemanager fails to startup.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira