You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Jeongin Ju (Jira)" <ji...@apache.org> on 2021/08/23 07:55:00 UTC

[jira] [Created] (YARN-10892) YARN Preemption Monitor got java.util.ConcurrentModificationException when three or more partitions exists

Jeongin Ju created YARN-10892:
---------------------------------

             Summary: YARN Preemption Monitor got java.util.ConcurrentModificationException when three or more partitions exists
                 Key: YARN-10892
                 URL: https://issues.apache.org/jira/browse/YARN-10892
             Project: Hadoop YARN
          Issue Type: Bug
          Components: resourcemanager
    Affects Versions: 3.1.2
            Reporter: Jeongin Ju


On our cluster with a large number of NMs, preemption monitor thread consistently got java.util.ConcurrentModificationException when specific conditions met.

What We found as conditions are as follow. (All 4 conditions should be met)
 # There are at least two non-exclusive partitions except default partition (let me call the partitions as X and Y partition)
 # app1 in the queue belonging to default partition (let me call the queue as 'dev' queue) borrowed resources from both X, Y partitions 
 # app2, app3 submitted to queues belonging to each X, Y partition is 'PENDING' because resources are consumed by app1
 # Preemption monitor can clear borrowed resources from X or Y when the container of app1 is preempted.  

Main problem is that FifoCandiatesSelector.selectCandidates tried to remove HashMap key(partition name) while iterating HashMap.

Logically, it is correct because we didn't traverse the same partition again on this 'selectCandidates'. However HashMap structure does not allow modification while iterating.

I made test case to reproduce the error case(testResourceTypesInterQueuePreemptionWithThreePartitions).

We found and patched our cluster on 3.1.2 but it seems trunk still has the same problem.

I attached patch based on the trunk.

 

Thanks!

 
{quote}{{2020-09-07 12:20:37,105 ERROR monitor.SchedulingMonitor (SchedulingMonitor.java:run(116)) - Exception raised while executing preemption checker, skip this run..., exception=
java.util.ConcurrentModificationException
        at java.util.HashMap$HashIterator.nextNode(HashMap.java:1437)
        at java.util.HashMap$KeyIterator.next(HashMap.java:1461)
        at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.FifoCandidatesSelector.selectCandidates(FifoCandidatesSelector.java:105)
        at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.containerBasedPreemptOrKill(ProportionalCapacityPreemptionPolicy.java:489)
        at org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy.editSchedule(ProportionalCapacityPreemptionPolicy.java:320)
        at org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor.invokePolicy(SchedulingMonitor.java:99)
        at org.apache.hadoop.yarn.server.resourcemanager.monitor.SchedulingMonitor$PolicyInvoker.run(SchedulingMonitor.java:111)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
        at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)}}

{{}}

{{}}

{{}}
{quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org