You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Qi Zhu (Jira)" <ji...@apache.org> on 2021/03/04 13:39:00 UTC
[jira] [Commented] (YARN-10178) Global Scheduler async thread crash
caused by 'Comparison method violates its general contract'
[ https://issues.apache.org/jira/browse/YARN-10178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17295284#comment-17295284 ]
Qi Zhu commented on YARN-10178:
-------------------------------
[~pbacsko]
Could help review this issue when you are free , this is a problem waited for solution for a long time.:D
Thanks a lot.
> Global Scheduler async thread crash caused by 'Comparison method violates its general contract'
> -----------------------------------------------------------------------------------------------
>
> Key: YARN-10178
> URL: https://issues.apache.org/jira/browse/YARN-10178
> Project: Hadoop YARN
> Issue Type: Bug
> Components: capacity scheduler
> Affects Versions: 3.2.1
> Reporter: tuyu
> Assignee: Qi Zhu
> Priority: Major
> Attachments: YARN-10178.001.patch, YARN-10178.002.patch, YARN-10178.003.patch, YARN-10178.004.patch, YARN-10178.005.patch
>
>
> Global Scheduler Async Thread crash stack
> {code:java}
> ERROR org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Received RMFatalEvent of type CRITICAL_THREAD_CRASH, caused by a critical thread, Thread-6066574, that exited unexpectedly: java.lang.IllegalArgumentException: Comparison method violates its general contract! at java.util.TimSort.mergeHi(TimSort.java:899)
> at java.util.TimSort.mergeAt(TimSort.java:516)
> at java.util.TimSort.mergeForceCollapse(TimSort.java:457)
> at java.util.TimSort.sort(TimSort.java:254)
> at java.util.Arrays.sort(Arrays.java:1512)
> at java.util.ArrayList.sort(ArrayList.java:1462)
> at java.util.Collections.sort(Collections.java:177)
> at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.policy.PriorityUtilizationQueueOrderingPolicy.getAssignmentIterator(PriorityUtilizationQueueOrderingPolicy.java:221)
> at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.sortAndGetChildrenAllocationIterator(ParentQueue.java:777)
> at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainersToChildQueues(ParentQueue.java:791)
> at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue.assignContainers(ParentQueue.java:623)
> at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateOrReserveNewContainers(CapacityScheduler.java:1635)
> at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainerOnSingleNode(CapacityScheduler.java:1629)
> at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1732)
> at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.allocateContainersToNode(CapacityScheduler.java:1481)
> at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.schedule(CapacityScheduler.java:569)
> at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler$AsyncScheduleThread.run(CapacityScheduler.java:616)
> {code}
> JAVA 8 Arrays.sort default use timsort algo, and timsort has few require
> {code:java}
> 1.x.compareTo(y) != y.compareTo(x)
> 2.x>y,y>z --> x > z
> 3.x=y, x.compareTo(z) == y.compareTo(z)
> {code}
> if not Arrays paramters not satify this require,TimSort will throw 'java.lang.IllegalArgumentException'
> look at PriorityUtilizationQueueOrderingPolicy.compare function,we will know Capacity Scheduler use this these queue resource usage to compare
> {code:java}
> AbsoluteUsedCapacity
> UsedCapacity
> ConfiguredMinResource
> AbsoluteCapacity
> {code}
> In Capacity Scheduler Global Scheduler AsyncThread use PriorityUtilizationQueueOrderingPolicy function to choose queue to assign container,and construct a CSAssignment struct, and use submitResourceCommitRequest function add CSAssignment to backlogs
> ResourceCommitterService will tryCommit this CSAssignment,look tryCommit function,there will update queue resource usage
> {code:java}
> public boolean tryCommit(Resource cluster, ResourceCommitRequest r,
> boolean updatePending) {
> long commitStart = System.nanoTime();
> ResourceCommitRequest<FiCaSchedulerApp, FiCaSchedulerNode> request =
> (ResourceCommitRequest<FiCaSchedulerApp, FiCaSchedulerNode>) r;
>
> ...
> boolean isSuccess = false;
> if (attemptId != null) {
> FiCaSchedulerApp app = getApplicationAttempt(attemptId);
> // Required sanity check for attemptId - when async-scheduling enabled,
> // proposal might be outdated if AM failover just finished
> // and proposal queue was not be consumed in time
> if (app != null && attemptId.equals(app.getApplicationAttemptId())) {
> if (app.accept(cluster, request, updatePending)
> && app.apply(cluster, request, updatePending)) { // apply this resource
> ...
> }
> }
> }
> return isSuccess;
> }
> }
> {code}
> {code:java}
> public boolean apply(Resource cluster, ResourceCommitRequest<FiCaSchedulerApp,
> FiCaSchedulerNode> request, boolean updatePending) {
> ...
> if (!reReservation) {
> getCSLeafQueue().apply(cluster, request);
> }
> ...
> }
> {code}
> LeafQueue.apply invok allocateResource
> {code:java}
> void allocateResource(Resource clusterResource,
> Resource resource, String nodePartition) {
> try {
> writeLock.lock(); // only lock leaf queue lock
> queueUsage.incUsed(nodePartition, resource);
>
> ++numContainers;
>
> CSQueueUtils.updateQueueStatistics(resourceCalculator, clusterResource,
> this, labelManager, nodePartition); // there will update queue statistics
> } finally {
> writeLock.unlock();
> }
> }
> {code}
> we found ResourceCommitterService will only lock leaf queue to update queue statistics, but AsyncThread use sortAndGetChildrenAllocationIterator only lock queue root queue lock
> {code:java}
> ParentQueue.java
> private Iterator<CSQueue> sortAndGetChildrenAllocationIterator(
> String partition) {
> try {
> readLock.lock();
> return queueOrderingPolicy.getAssignmentIterator(partition);
> } finally {
> readLock.unlock();
> }
> }
> {code}
> so if multi async thread compare queue usage statistics and ResourceCommitterService apply leaf queue change statistics concurrent, will break TimSort algo required, and cause thread crash
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org