You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Zephyr Guo (JIRA)" <ji...@apache.org> on 2016/03/01 04:05:18 UTC

[jira] [Commented] (YARN-4743) ResourceManager crash because TimSort

    [ https://issues.apache.org/jira/browse/YARN-4743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15173146#comment-15173146 ] 

Zephyr Guo commented on YARN-4743:
----------------------------------

{quote}
I think that DRF comparator is not transitive with my intuition.
{quote}
I think that's right.[~ozawa]

FairShareComparator uses {{getResourceUsage()}} and {{getDemand()}} and {{getMinShare()}} to implement {{compare(Schedulable s1, Schedulable s1)}}.The three methods must return same Resource anyway while we are sorting, otherwise will break transitivity.

How about add snapshot feature in Schedulable? We snapshot Schedulable before sorting.Then we sort but use snapshot Resource in comparator . Result of sorting will very close to real situation, because sorting is very fast.

> ResourceManager crash because TimSort
> -------------------------------------
>
>                 Key: YARN-4743
>                 URL: https://issues.apache.org/jira/browse/YARN-4743
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: fairscheduler
>    Affects Versions: 2.6.4
>            Reporter: Zephyr Guo
>
> {code}
> 2016-02-26 14:08:50,821 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type NODE_UPDATE to the scheduler
> java.lang.IllegalArgumentException: Comparison method violates its general contract!
>          at java.util.TimSort.mergeHi(TimSort.java:868)
>          at java.util.TimSort.mergeAt(TimSort.java:485)
>          at java.util.TimSort.mergeCollapse(TimSort.java:410)
>          at java.util.TimSort.sort(TimSort.java:214)
>          at java.util.TimSort.sort(TimSort.java:173)
>          at java.util.Arrays.sort(Arrays.java:659)
>          at java.util.Collections.sort(Collections.java:217)
>          at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.assignContainer(FSLeafQueue.java:316)
>          at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:240)
>          at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.attemptScheduling(FairScheduler.java:1091)
>          at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.nodeUpdate(FairScheduler.java:989)
>          at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1185)
>          at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
>          at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:684)
>          at java.lang.Thread.run(Thread.java:745)
> 2016-02-26 14:08:50,822 INFO org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Exiting, bbye..
> {code}
> Actually, this issue found in 2.6.0-cdh5.4.7.
> I think the cause is that we modify {{Resouce}} while we are sorting {{runnableApps}}.
> {code:title=FSLeafQueue.java}
>     Comparator<Schedulable> comparator = policy.getComparator();
>     writeLock.lock();
>     try {
>       Collections.sort(runnableApps, comparator);
>     } finally {
>       writeLock.unlock();
>     }
>     readLock.lock();
> {code}
> {code:title=FairShareComparator}
> public int compare(Schedulable s1, Schedulable s2) {
> ......
>           s1.getResourceUsage(), minShare1);
>       boolean s2Needy = Resources.lessThan(RESOURCE_CALCULATOR, null,
>           s2.getResourceUsage(), minShare2);
>       minShareRatio1 = (double) s1.getResourceUsage().getMemory()
>           / Resources.max(RESOURCE_CALCULATOR, null, minShare1, ONE).getMemory();
>       minShareRatio2 = (double) s2.getResourceUsage().getMemory()
>           / Resources.max(RESOURCE_CALCULATOR, null, minShare2, ONE).getMemory();
> ......
> {code}
> {{getResourceUsage}} will return current Resource. The current Resource is unstable. 
> {code:title=FSAppAttempt.java}
> @Override
>   public Resource getResourceUsage() {
>     // Here the getPreemptedResources() always return zero, except in
>     // a preemption round
>     return Resources.subtract(getCurrentConsumption(), getPreemptedResources());
>   }
> {code}
> {code:title=SchedulerApplicationAttempt}
>  public Resource getCurrentConsumption() {
>     return currentConsumption;
>   }
> // This method may modify current Resource.
> public synchronized void recoverContainer(RMContainer rmContainer) {
> ......
>     Resources.addTo(currentConsumption, rmContainer.getContainer()
>       .getResource());
> ......
>   }
> {code}
> I suggest that use stable Resource in comparator.
> Is there something i think wrong?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)