You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Zephyr Guo (JIRA)" <ji...@apache.org> on 2016/09/24 18:22:20 UTC

[jira] [Reopened] (YARN-4743) ResourceManager crash because TimSort

     [ https://issues.apache.org/jira/browse/YARN-4743?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zephyr Guo reopened YARN-4743:
------------------------------

Sorry for some reason did not deal with this issue for a long time. Now pick up it, we have found the bug.
Reopen the issue.

> ResourceManager crash because TimSort
> -------------------------------------
>
>                 Key: YARN-4743
>                 URL: https://issues.apache.org/jira/browse/YARN-4743
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: fairscheduler
>    Affects Versions: 2.6.4
>            Reporter: Zephyr Guo
>            Assignee: Yufei Gu
>         Attachments: YARN-4743-cdh5.4.7.patch
>
>
> {code}
> 2016-02-26 14:08:50,821 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type NODE_UPDATE to the scheduler
> java.lang.IllegalArgumentException: Comparison method violates its general contract!
>          at java.util.TimSort.mergeHi(TimSort.java:868)
>          at java.util.TimSort.mergeAt(TimSort.java:485)
>          at java.util.TimSort.mergeCollapse(TimSort.java:410)
>          at java.util.TimSort.sort(TimSort.java:214)
>          at java.util.TimSort.sort(TimSort.java:173)
>          at java.util.Arrays.sort(Arrays.java:659)
>          at java.util.Collections.sort(Collections.java:217)
>          at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer(FSLeafQueue.java:316)
>          at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:240)
>          at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.attemptScheduling(FairScheduler.java:1091)
>          at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate(FairScheduler.java:989)
>          at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1185)
>          at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
>          at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:684)
>          at java.lang.Thread.run(Thread.java:745)
> 2016-02-26 14:08:50,822 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..
> {code}
> Actually, this issue found in 2.6.0-cdh5.4.7.
> I think the cause is that we modify {{Resouce}} while we are sorting {{runnableApps}}.
> {code:title=FSLeafQueue.java}
>     Comparator<Schedulable> comparator = policy.getComparator();
>     writeLock.lock();
>     try {
>       Collections.sort(runnableApps, comparator);
>     } finally {
>       writeLock.unlock();
>     }
>     readLock.lock();
> {code}
> {code:title=FairShareComparator}
> public int compare(Schedulable s1, Schedulable s2) {
> ......
>           s1.getResourceUsage(), minShare1);
>       boolean s2Needy = Resources.lessThan(RESOURCE_CALCULATOR, null,
>           s2.getResourceUsage(), minShare2);
>       minShareRatio1 = (double) s1.getResourceUsage().getMemory()
>           / Resources.max(RESOURCE_CALCULATOR, null, minShare1, ONE).getMemory();
>       minShareRatio2 = (double) s2.getResourceUsage().getMemory()
>           / Resources.max(RESOURCE_CALCULATOR, null, minShare2, ONE).getMemory();
> ......
> {code}
> {{getResourceUsage}} will return current Resource. The current Resource is unstable. 
> {code:title=FSAppAttempt.java}
> @Override
>   public Resource getResourceUsage() {
>     // Here the getPreemptedResources() always return zero, except in
>     // a preemption round
>     return Resources.subtract(getCurrentConsumption(), getPreemptedResources());
>   }
> {code}
> {code:title=SchedulerApplicationAttempt}
>  public Resource getCurrentConsumption() {
>     return currentConsumption;
>   }
> // This method may modify current Resource.
> public synchronized void recoverContainer(RMContainer rmContainer) {
> ......
>     Resources.addTo(currentConsumption, rmContainer.getContainer()
>       .getResource());
> ......
>   }
> {code}
> I suggest that use stable Resource in comparator.
> Is there something i think wrong?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org