You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Jason Lowe (JIRA)" <ji...@apache.org> on 2016/01/05 23:04:40 UTC
[jira] [Commented] (YARN-4546) ResourceManager crash due to scheduling opportunity overflow

    [ https://issues.apache.org/jira/browse/YARN-4546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15083927#comment-15083927 ] 

Jason Lowe commented on YARN-4546:
----------------------------------

When the overflow occurs the RM crashes with a stacktrace like this:
{noformat}
2015-12-26 20:18:39,731 [ResourceManager Event Processor] FATAL resourcemanager.ResourceManager: Error in handling event type NODE_UPDATE to the scheduler
java.lang.IllegalArgumentException: count cannot be negative: -2147483648
        at com.google.common.base.Preconditions.checkArgument(Preconditions.java:115)
        at com.google.common.collect.Multisets.checkNonnegative(Multisets.java:943)
        at com.google.common.collect.AbstractMapBasedMultiset.setCount(AbstractMapBasedMultiset.java:277)
        at com.google.common.collect.HashMultiset.setCount(HashMultiset.java:34)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.addSchedulingOpportunity(SchedulerApplicationAttempt.java:485)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.assignContainers(LeafQueue.java:872)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:586)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:447)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1019)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1061)
        at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:115)
        at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:682)
        at java.lang.Thread.run(Thread.java:745)
2015-12-26 20:18:39,732 [ResourceManager Event Processor] INFO resourcemanager.ResourceManager: Exiting, bbye..
{noformat}

In this particular case the resource request went unsatisfied for a long time due to the use of node labels and the application having blacklisted every node with that label.  At that point no node in the cluster could satisfy the request because it either didn't have the label or it was blacklisted.  So the resource request accumulated scheduling opportunities until the count eventually overflowed.

> ResourceManager crash due to scheduling opportunity overflow
> ------------------------------------------------------------
>
>                 Key: YARN-4546
>                 URL: https://issues.apache.org/jira/browse/YARN-4546
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.6.1
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Critical
>
> If a resource request lingers long enough unsatisfied then the scheduling opportunities count for the request can overflow and cause an RM crash.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)