You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Wilfred Spiegelenburg (JIRA)" <ji...@apache.org> on 2018/10/24 01:45:00 UTC

[jira] [Commented] (YARN-8436) FSParentQueue: Comparison method violates its general contract

    [ https://issues.apache.org/jira/browse/YARN-8436?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16661579#comment-16661579 ] 

Wilfred Spiegelenburg commented on YARN-8436:
---------------------------------------------

Sorry [~imstefanlee] I did not see your comment earlier.
The NODE_UPDATE is handled in locks and we can not do two at the same time that is correct. However an update of a different object could change the child queue usage. This change is not directly related to the node update and thus processed using different locking. In that case if the child queue was already added to the sorted list and we are adding a new queue it will look like it was added in the sort mechanism is broken causing the exception.

> FSParentQueue: Comparison method violates its general contract
> --------------------------------------------------------------
>
>                 Key: YARN-8436
>                 URL: https://issues.apache.org/jira/browse/YARN-8436
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: fairscheduler
>    Affects Versions: 3.1.0
>            Reporter: Wilfred Spiegelenburg
>            Assignee: Wilfred Spiegelenburg
>            Priority: Minor
>             Fix For: 3.2.0
>
>         Attachments: YARN-8436.001.patch, YARN-8436.002.patch, YARN-8436.003.patch
>
>
> The ResourceManager can fail while sorting queues if an update comes in:
> {code:java}
> FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type NODE_UPDATE to the scheduler
> java.lang.IllegalArgumentException: Comparison method violates its general contract!
> 	at java.util.TimSort.mergeLo(TimSort.java:777)
> 	at java.util.TimSort.mergeAt(TimSort.java:514)
> ...
> 	at java.util.Collections.sort(Collections.java:175)
> 	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.assignContainer(FSParentQueue.java:223){code}
> The reason it breaks is a change in the sorted object itself. 
> This is why it fails:
>  * an update from a node comes in as a heartbeat.
>  * the update triggers a check to see if we can assign a container on the node.
>  * walk over the queue hierarchy to find a queue to assign a container to: top down.
>  * for each parent queue we sort the child queues in {{assignContainer}} to decide which queue to descent into.
>  * we lock the parent queue when sort to prevent changes, but we do not lock the child queues that we are sorting.
> If during this sorting a different node update changes a child queue then we allow that. This means that the objects that we are trying to sort now might be out of order. That causes the issue with the comparator. The comparator itself is not broken.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org