You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Peter Bacsko (Jira)" <ji...@apache.org> on 2021/03/08 18:55:00 UTC

[jira] [Comment Edited] (YARN-10178) Global Scheduler async thread crash caused by 'Comparison method violates its general contract'

    [ https://issues.apache.org/jira/browse/YARN-10178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17297642#comment-17297642 ] 

Peter Bacsko edited comment on YARN-10178 at 3/8/21, 6:54 PM:
--------------------------------------------------------------

[~zhuqi] this is a tricky patch, I have to understand what's going on. We might ask [~wangda] again to look at it, because I'm not that familiar with the code that has been modified.

Having said that, I have some recommendations:
1. {{private final static Random RANDOM = new Random(System.currentTimeMillis());}}
Is there a reason why this is static? {{RANDOM}} is only used in the test.
Another problem is that, let's assume that it fails. But the problem is that we don't see the random seed that was used for initialization, so this test is not reproducible.
I suggest rewriting the test like:
{noformat}
long seed = System.nanoTime();  // I think nanoTime is better

try {
  .. test code ..
} catch (AssertionFailedError e) {
   LOG.error("Test failed, seed = {}", seed);
   LOG.error(e);
   throw e;
}
{noformat}

So at least we can check the logs for the seed number. Or maybe rethrow the exception with a modified message, that's also a solution, or wrap it in a different exception with a new message which contains the seed. The point is, it should be visible.

2. This sanity test only works if JVM is started with "-ea":
{noformat}
	    // sanity check
	    assert queueNames != null && priorities != null && utilizations != null
	        && queueNames.length > 0 && queueNames.length == priorities.length
	        && priorities.length == utilizations.length;
{noformat}
I think this should be converted to normal JUnit assertion or just remove it.


was (Author: pbacsko):
[~zhuqi] this is a tricky patch, I have to understand what's going on. We might ask [~wangda] again to look at it, because I'm not that familiar with the code that has been modified.

Having said that, I have some recommendations:
1. {{private final static Random RANDOM = new Random(System.currentTimeMillis());}}
Is there a reason why this is static? {{RANDOM}} is only used in the test.
Another problem is that, let's assume that it fails. But the problem is that we don't see the random seed that was used for initialization, so this test is not reproducible.
I suggest rewriting the test like:
{noformat}
long seed = System.nanoTime();  // I think nanoTime is better

try {
  .. test code ..
} catch (AssertionFailedError e) {
   LOG.error("Test failed, seed = {}", seed, e);
   throw e;
}
{noformat}

So at least we can check the logs for the seed number. Or maybe rethrow the exception with a modified message, that's also a solution, or wrap it in a different exception with a new message which contains the seed. The point is, it should be visible.

2. This sanity test only works if JVM is started with "-ea":
{noformat}
	    // sanity check
	    assert queueNames != null && priorities != null && utilizations != null
	        && queueNames.length > 0 && queueNames.length == priorities.length
	        && priorities.length == utilizations.length;
{noformat}
I think this should be converted to normal JUnit assertion or just remove it.

> Global Scheduler async thread crash caused by 'Comparison method violates its general contract'
> -----------------------------------------------------------------------------------------------
>
>                 Key: YARN-10178
>                 URL: https://issues.apache.org/jira/browse/YARN-10178
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacity scheduler
>    Affects Versions: 3.2.1
>            Reporter: tuyu
>            Assignee: Qi Zhu
>            Priority: Major
>         Attachments: YARN-10178.001.patch, YARN-10178.002.patch, YARN-10178.003.patch, YARN-10178.004.patch, YARN-10178.005.patch
>
>
> Global Scheduler Async Thread crash stack
> {code:java}
> ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, Thread-6066574, that exited unexpectedly: java.lang.IllegalArgumentException: Comparison method violates its general contract!                                                                     at java.util.TimSort.mergeHi(TimSort.java:899)
>         at java.util.TimSort.mergeAt(TimSort.java:516)
>         at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
>         at java.util.TimSort.sort(TimSort.java:254)
>         at java.util.Arrays.sort(Arrays.java:1512)
>         at java.util.ArrayList.sort(ArrayList.java:1462)
>         at java.util.Collections.sort(Collections.java:177)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:221)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:777)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:791)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1635)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1629)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1732)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1481)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:569)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:616)
> {code}
> JAVA 8 Arrays.sort default use timsort algo, and timsort has  few require 
> {code:java}
> 1.x.compareTo(y) != y.compareTo(x)
> 2.x>y,y>z --> x > z
> 3.x=y, x.compareTo(z) == y.compareTo(z)
> {code}
> if not Arrays paramters not satify this require,TimSort will throw 'java.lang.IllegalArgumentException'
> look at PriorityUtilizationQueueOrderingPolicy.compare function,we will know Capacity Scheduler use this these queue resource usage to compare
> {code:java}
> AbsoluteUsedCapacity
> UsedCapacity
> ConfiguredMinResource
> AbsoluteCapacity
> {code}
> In Capacity Scheduler Global Scheduler AsyncThread use PriorityUtilizationQueueOrderingPolicy function to choose queue to assign container,and construct a CSAssignment struct, and use submitResourceCommitRequest function add CSAssignment to backlogs
> ResourceCommitterService  will tryCommit this CSAssignment,look tryCommit function,there will update queue resource usage
> {code:java}
> public boolean tryCommit(Resource cluster, ResourceCommitRequest r,
>     boolean updatePending) {
>   long commitStart = System.nanoTime();
>   ResourceCommitRequest<FiCaSchedulerApp, FiCaSchedulerNode> request =
>       (ResourceCommitRequest<FiCaSchedulerApp, FiCaSchedulerNode>) r;
>  
>   ...
>   boolean isSuccess = false;
>   if (attemptId != null) {
>     FiCaSchedulerApp app = getApplicationAttempt(attemptId);
>     // Required sanity check for attemptId - when async-scheduling enabled,
>     // proposal might be outdated if AM failover just finished
>     // and proposal queue was not be consumed in time
>     if (app != null && attemptId.equals(app.getApplicationAttemptId())) {
>       if (app.accept(cluster, request, updatePending)
>           && app.apply(cluster, request, updatePending)) { // apply this resource
>         ...
>         }
>     }
>   }
>   return isSuccess;
> }
> }
> {code}
> {code:java}
> public boolean apply(Resource cluster, ResourceCommitRequest<FiCaSchedulerApp,
>     FiCaSchedulerNode> request, boolean updatePending) {
> ...
>     if (!reReservation) {
>         getCSLeafQueue().apply(cluster, request); 
>     }
> ...
> }
> {code}
> LeafQueue.apply invok allocateResource
> {code:java}
> void allocateResource(Resource clusterResource,
>     Resource resource, String nodePartition) {
>   try {
>     writeLock.lock(); // only lock leaf queue lock
>     queueUsage.incUsed(nodePartition, resource);
>  
>     ++numContainers;
>  
>     CSQueueUtils.updateQueueStatistics(resourceCalculator, clusterResource,
>         this, labelManager, nodePartition); // there will update queue statistics
>   } finally {
>     writeLock.unlock();
>   }
> }
> {code}
> we found ResourceCommitterService will only lock leaf queue to update queue statistics, but AsyncThread use sortAndGetChildrenAllocationIterator only lock queue root queue lock
> {code:java}
> ParentQueue.java
> private Iterator<CSQueue> sortAndGetChildrenAllocationIterator(
>       String partition) {
>     try {
>       readLock.lock();
>       return queueOrderingPolicy.getAssignmentIterator(partition);
>     } finally {
>       readLock.unlock();
>     }
>   }
> {code}
> so if multi async thread compare queue usage statistics and ResourceCommitterService apply leaf queue change statistics concurrent, will break TimSort algo required, and cause thread crash



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org