You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Prabhu Joseph (Jira)" <ji...@apache.org> on 2023/01/25 17:36:00 UTC
[jira] [Commented] (YARN-11417) RM Crashes when changing Node Label of a Node in Distributed Configuration

    [ https://issues.apache.org/jira/browse/YARN-11417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17680718#comment-17680718 ] 

Prabhu Joseph commented on YARN-11417:
--------------------------------------

When the NodeManager node label is changed to a new label and restarted, it resyncs to the ResourceManager with the new label. CapacityScheduler receives the NODE_LABELS_UPDATE event, which removes the node from the nodesList of the old node partition in the nodesPerLabel map <partition, nodesList> part of ClusterNodeTracker#updateNodesPerPartition. Then CapacityScheduler receives NODE_REMOVED which removes the node from the ClusterNodeTracker and also removes the node from the nodesList of the new partition in nodesPerLabel, which will fail with NPE as the new partition is not yet present in the nodesPerLabel map and will be added only after the NODE_ADDED event. 

In the absence of a new partition, ClusterNodeTracker#removeNode can skip removing the node from the nodesPerLabel as anyway that is already removed during NODE_LABELS_UPDATE.


> RM Crashes when changing Node Label of a Node in Distributed Configuration
> --------------------------------------------------------------------------
>
>                 Key: YARN-11417
>                 URL: https://issues.apache.org/jira/browse/YARN-11417
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 3.3.3
>            Reporter: Prabhu Joseph
>            Assignee: Prabhu Joseph
>            Priority: Minor
>
> RM Crashes when changing Node Label of a Node in Distributed Configuration.
> {code:java}
> 2023-01-11 16:25:50,986 ERROR org.apache.hadoop.yarn.event.EventDispatcher (SchedulerEventDispatcher:Event Processor): Error in handling event type NODE_REMOVED to the Event Dispatcher
> java.lang.NullPointerException
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.ClusterNodeTracker.removeNode(ClusterNodeTracker.java:194)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.removeNode(CapacityScheduler.java:2145)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1833)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:171)
>         at org.apache.hadoop.yarn.event.EventDispatcher$EventProcessor.run(EventDispatcher.java:83)
>         at java.lang.Thread.run(Thread.java:750)
> {code}
> *Repro*
> 1. Two NodeManagers with CORE Node Label
> {code:java}
> yarn.nodemanager.node-labels.provider.configured-node-partition=CORE
> yarn.node-labels.enabled = true
> yarn.node-labels.configuration-type = distributed
> yarn.nodemanager.node-labels.provider = config
> {code}
> 2. Remove the Node Label from one of the node to make it Default Partition and restart nodemanager.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org