You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@slider.apache.org by "Steve Loughran (JIRA)" <ji...@apache.org> on 2014/09/18 11:54:34 UTC
[jira] [Commented] (SLIDER-439) RM never fulfilled Slider AM's container request after NM died on a node where HRegionServer was running

    [ https://issues.apache.org/jira/browse/SLIDER-439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14138737#comment-14138737 ] 

Steve Loughran commented on SLIDER-439:
---------------------------------------

This looks like a YARN quirk. Do you want to file a JIRA there?

> RM never fulfilled Slider AM's container request after NM died on a node where HRegionServer was running
> --------------------------------------------------------------------------------------------------------
>
>                 Key: SLIDER-439
>                 URL: https://issues.apache.org/jira/browse/SLIDER-439
>             Project: Slider
>          Issue Type: Bug
>          Components: appmaster
>            Reporter: Gour Saha
>            Assignee: Steve Loughran
>
> Steps to reproduce:
> - Setup a 3-node cluster (in non-HA mode)
> - Run slider create for HBase app-package (with HMaster and HRegionServer components only - just to keep things simple)
> - Let's assume that the HRegionServer came up in a node different from that of HMaster and Slider AM (if not, doing destroy-create couple of times will definitely get you to this setup)
> - Kill the NM in the node where HRegionServer is running
> - Wait for at least 10 minutes (do not restart NM on this node)
> - At this point Slider AM received the onNodesUpdated and onContainersCompleted events from RM, it unregistered the container and requested a new one to RM
> - This time the request for a new container never got fulfilled even after waiting for several minutes
> Expected:
> - Given that there was absolutely nothing else running on that cluster the container request should have been fulfilled by RM
> Interesting observation:
> - After waiting long enough I restarted the NM on the node where it was killed and surprisingly the new container request got fulfilled at that point and the container with HRegionServer came up on the node where NM was killed. It seemed like RM was waiting for the NM to come back up on this node (affinity?) although it had marked it dead long time back.
> Here is the Slider AM log snippet from the time it receives the onNodesUpdated event -
> {noformat}
> 14/09/17 07:02:47 INFO appmaster.SliderAppMaster: Nodes updated
> 14/09/17 07:02:47 INFO appmaster.SliderAppMaster: onContainersCompleted([1]
> 14/09/17 07:02:47 INFO appmaster.SliderAppMaster: Container Completion for containerID=container_1410935367006_0001_01_000002, state=COMPLETE, exitStatus=-100, diagnostics=Container released on a *lost* node
> 14/09/17 07:02:47 INFO state.AppState: Failed container in role[2] : HBASE_REGIONSERVER
> 14/09/17 07:02:47 INFO state.AppState: Current count of failed role[2] HBASE_REGIONSERVER =  1
> 14/09/17 07:02:47 INFO state.AppState: Removing node ID container_1410935367006_0001_01_000002
> 14/09/17 07:02:47 ERROR appmaster.SliderAppMaster: Role instance RoleInstance{role='HBASE_REGIONSERVER', id='container_1410935367006_0001_01_000002', container=ContainerID=container_1410935367006_0001_01_000002 nodeID=c6403.ambari.apache.org:45454 http=c6403.ambari.apache.org:8042 priority=2, createTime=1410936271481, startTime=1410936271543, released=false, roleId=2, host=c6403.ambari.apache.org, hostURL=http://c6403.ambari.apache.org:8042, state=5, exitCode=-100, command='python ./infra/agent/slider-agent/agent/main.py --label container_1410935367006_0001_01_000002___HBASE_REGIONSERVER --zk-quorum c6401.ambari.apache.org:2181,c6402.ambari.apache.org:2181,c6403.ambari.apache.org:2181 --zk-reg-path /registry/org-apache-slider/cl1 > <LOG_DIR>/agent.out 2>&1 ; ', diagnostics='Container released on a *lost* node', output=null, environment=[AGENT_WORK_ROOT="$PWD", HADOOP_USER_NAME="yarn", AGENT_LOG_ROOT="$LOG_DIRS", PYTHONPATH="./infra/agent/slider-agent/", SLIDER_PASSPHRASE="DEV"]} failed
> 14/09/17 07:02:47 INFO appmaster.SliderAppMaster: Unregistering component container_1410935367006_0001_01_000002
> 14/09/17 07:02:47 INFO state.AppState: RoleStatus{name='HBASE_REGIONSERVER', key=2, desired=1, actual=0, requested=0, releasing=0, failed=1, started=1, startFailed=0, completed=0, failureMessage='Failure container_1410935367006_0001_01_000002 on host c6403.ambari.apache.org: http://c6402.ambari.apache.org:19888/jobhistory/logs/c6403.ambari.apache.org:45454/container_1410935367006_0001_01_000002/ctx/yarn'}
> 14/09/17 07:02:47 INFO state.AppState: HBASE_REGIONSERVER: Asking for 1 more nodes(s) for a total of 1
> 14/09/17 07:02:47 INFO state.RoleHistory: There're 1 nodes to consider for HBASE_REGIONSERVER
> 14/09/17 07:02:47 INFO state.OutstandingRequest: Submitting request for container on c6403.ambari.apache.org
> 14/09/17 07:02:47 INFO state.AppState: Container ask is Capability[<memory:256, vCores:1>]Priority[2]
> 14/09/17 07:02:47 INFO state.AppState: RoleStatus{name='HBASE_MASTER', key=1, desired=1, actual=1, requested=0, releasing=0, failed=0, started=1, startFailed=0, completed=0, failureMessage=''}
> 14/09/17 07:02:47 INFO util.RackResolver: Resolved c6403.ambari.apache.org to /default-rack
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)