You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Rohith (JIRA)" <ji...@apache.org> on 2015/02/19 05:12:12 UTC

[jira] [Commented] (YARN-3222) RMNodeImpl#ReconnectNodeTransition should send scheduler events in sequential order

    [ https://issues.apache.org/jira/browse/YARN-3222?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14326965#comment-14326965 ] 

Rohith commented on YARN-3222:
------------------------------

Attaching the logs which gives more information about issue. In the below log, RM has shutdown with NPE while updating node_resource. And observe scheduler events dispatched from AsyncDispatcher in *org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.\**. Here the order is NODE_REMOVED --> NODE_RESOURCE_UPDATE --> NODE_ADDED --> NODE_LABELS_UPDATE
{noformat}
2015-02-19 09:14:57,212 INFO  [main] util.RackResolver (RackResolver.java:coreResolve(109)) - Resolved 127.0.0.1 to /default-rack
2015-02-19 09:14:57,213 INFO  [main] resourcemanager.ResourceTrackerService (ResourceTrackerService.java:registerNodeManager(313)) - Reconnect from the node at: 127.0.0.1
2015-02-19 09:14:57,215 DEBUG [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(166)) - Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeReconnectEvent.EventType: RECONNECTED
2015-02-19 09:14:57,215 INFO  [main] resourcemanager.ResourceTrackerService (ResourceTrackerService.java:registerNodeManager(343)) - NodeManager from node 127.0.0.1(cmPort: 1234 httpPort: 3) registered with capability: <memory:16384, vCores:16>, assigned nodeId 127.0.0.1:1234
2015-02-19 09:14:57,215 DEBUG [AsyncDispatcher event handler] rmnode.RMNodeImpl (RMNodeImpl.java:handle(412)) - Processing 127.0.0.1:1234 of type RECONNECTED
2015-02-19 09:14:57,266 DEBUG [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(166)) - Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.NodeRemovedSchedulerEvent.EventType: NODE_REMOVED
2015-02-19 09:14:57,266 DEBUG [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(166)) - Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeStartedEvent.EventType: STARTED
2015-02-19 09:14:57,266 DEBUG [AsyncDispatcher event handler] rmnode.RMNodeImpl (RMNodeImpl.java:handle(412)) - Processing 127.0.0.1:1234 of type STARTED
2015-02-19 09:14:57,266 INFO  [AsyncDispatcher event handler] rmnode.RMNodeImpl (RMNodeImpl.java:handle(424)) - 127.0.0.1:1234 Node Transitioned from NEW to RUNNING
2015-02-19 09:14:57,266 DEBUG [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(166)) - Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.NodesListManagerEvent.EventType: NODE_USABLE
2015-02-19 09:14:57,266 DEBUG [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(166)) - Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.NodeResourceUpdateSchedulerEvent.EventType: NODE_RESOURCE_UPDATE
2015-02-19 09:14:57,267 DEBUG [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(166)) - Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.NodeAddedSchedulerEvent.EventType: NODE_ADDED
2015-02-19 09:14:57,267 DEBUG [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(166)) - Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.NodesListManagerEvent.EventType: NODE_USABLE
2015-02-19 09:14:57,267 DEBUG [AsyncDispatcher event handler] event.AsyncDispatcher (AsyncDispatcher.java:dispatch(166)) - Dispatching the event org.apache.hadoop.yarn.server.resourcemanager.scheduler.event.NodeLabelsUpdateSchedulerEvent.EventType: NODE_LABELS_UPDATE
2015-02-19 09:14:57,267 INFO  [ResourceManager Event Processor] capacity.CapacityScheduler (CapacityScheduler.java:removeNode(1267)) - Removed node 127.0.0.1:1234 clusterResource: <memory:0, vCores:0>
2015-02-19 09:14:57,267 FATAL [ResourceManager Event Processor] resourcemanager.ResourceManager (ResourceManager.java:run(688)) - Error in handling event type NODE_RESOURCE_UPDATE to the scheduler
java.lang.NullPointerException
	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.updateNodeResource(AbstractYarnScheduler.java:548)
	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.updateNodeAndQueueResource(CapacityScheduler.java:992)
	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1119)
	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:120)
	at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:679)
	at java.lang.Thread.run(Thread.java:745)
2015-02-19 09:14:57,280 INFO  [ResourceManager Event Processor] resourcemanager.ResourceManager (ResourceManager.java:run(692)) - Exiting, bbye..
{noformat}

> RMNodeImpl#ReconnectNodeTransition should send scheduler events in sequential order
> -----------------------------------------------------------------------------------
>
>                 Key: YARN-3222
>                 URL: https://issues.apache.org/jira/browse/YARN-3222
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: Rohith
>            Assignee: Rohith
>            Priority: Critical
>
> When a node is reconnected,RMNodeImpl#ReconnectNodeTransition notifies the scheduler in a events node_added,node_removed or node_resource_update. These events should be notified in an sequential order i.e node_added event and next node_resource_update events.
> But if the node is reconnected with different http port, the oder of scheduler events are node_removed --> node_resource_update --> node_added which causes scheduler does not find the node and throw NPE and RM exit.
> Node_Resource_update event should be always should be triggered via RMNodeEventType.RESOURCE_UPDATE



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)