You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Wangda Tan (JIRA)" <ji...@apache.org> on 2014/04/25 23:25:18 UTC
[jira] [Commented] (YARN-1885) yarn logs command does not provide the application logs for some applications

    [ https://issues.apache.org/jira/browse/YARN-1885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13981630#comment-13981630 ] 

Wangda Tan commented on YARN-1885:
----------------------------------

*This is caused by application completed in RM, but NM cannot recv application clean-up msg after RM restarted. This will cause a serials of problems, include but not limited,*
* Log aggregation not works sometimes,
* Application shown to “RUNNING” in NM’s web page, but it’s already terminated in RM

*We can reproduce this bug by following way, (in a recovery-enabled cluster)*
1) Submit application (has some deliberate errors will cause AM failure) to RM
2) Before application’s state transferred to FAILED, restart RM
3) After RM restarted / NM register, app state will become failed in RM, but it still shown running in NM side

*There’re multiple places will cause this problem*
1) Race condition in ResourceTrackerService.registerNodeManager
Handle container status logic,
{code}
    if (!request.getContainerStatuses().isEmpty()) {
      LOG.info("received container statuses on node manager register :"
          + request.getContainerStatuses());
      for (ContainerStatus containerStatus : request.getContainerStatuses()) {
        handleContainerStatus(containerStatus);
      }
    }
{code}
Happened before create RMNodeImplInstance
{code}
    RMNode rmNode = new RMNodeImpl(nodeId, rmContext, host, cmPort, httpPort,
        resolve(host), ResourceOption.newInstance(capability, RMNode.OVER_COMMIT_TIMEOUT_MILLIS_DEFAULT),
        nodeManagerVersion);

    RMNode oldNode = this.rmContext.getRMNodes().putIfAbsent(nodeId, rmNode);
    if (oldNode == null) {
      this.rmContext.getDispatcher().getEventHandler().handle(
          new RMNodeEvent(nodeId, RMNodeEventType.STARTED));
    } else {
      LOG.info("Reconnect from the node at: " + host);
      this.nmLivelinessMonitor.unregister(nodeId);
      this.rmContext.getDispatcher().getEventHandler().handle(
          new RMNodeReconnectEvent(nodeId, rmNode));
    }
{code}
So the RMAppImpl.FinalTransition will finish the application, but cannot notify corresponding RMNode.
2) RMAppAttempt cannot get full ranNodes after restart (RMAppAttempt will set to LAUNCHED state after restart)

*Proposal*
1) Add full running applications list while NM registering with RM
2) ResourceTrackerService (RTS for short) will,
* If RMApp not in final state, add RMNode to RMAppAttempt’s ranNodes.
* If RMApp already in final state, send RMNodeCleanAppEvent to RMNode

3) Address race condition described above


> yarn logs command does not provide the application logs for some applications
> -----------------------------------------------------------------------------
>
>                 Key: YARN-1885
>                 URL: https://issues.apache.org/jira/browse/YARN-1885
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.4.0
>            Reporter: Arpit Gupta
>            Assignee: Wangda Tan
>
> During our HA testing we have seen cases where yarn application logs are not available through the cli but i can look at AM logs through the UI. RM was also being restarted in the background as the application was running.



--
This message was sent by Atlassian JIRA
(v6.2#6252)