You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Peter Bacsko (Jira)" <ji...@apache.org> on 2021/03/05 08:57:00 UTC
[jira] [Comment Edited] (YARN-10672) All testcases in TestReservations are flaky

    [ https://issues.apache.org/jira/browse/YARN-10672?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17295881#comment-17295881 ] 

Peter Bacsko edited comment on YARN-10672 at 3/5/21, 8:56 AM:
--------------------------------------------------------------

The solution is much straightforward than mine in YARN-10447. Actually we might consider applying this to TestLeafQueue with undoing my changes, because that's more complicated (I had no patience to go deeper with Mockito internal behavior, I just thought well, disable that thread and that's enough).


was (Author: pbacsko):
The solution is much straightforward than mine in YARN-10447. Actually we might consider applying this to TestLeafQueue as well, while undoing my changes, because that's more complicated (I had no patience to go deeper with Mockito internal behavior, I just thought well, disable that thread and that's enough).

> All testcases in TestReservations are flaky
> -------------------------------------------
>
>                 Key: YARN-10672
>                 URL: https://issues.apache.org/jira/browse/YARN-10672
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Szilard Nemeth
>            Assignee: Szilard Nemeth
>            Priority: Major
>         Attachments: Screenshot 2021-03-04 at 21.34.18.png, Screenshot 2021-03-04 at 22.06.20.png, Screenshot-mockitostubbing1-2021-03-04 at 22.34.01.png, Screenshot-mockitostubbing2-2021-03-04 at 22.34.12.png, YARN-10672-debuglogs.patch, YARN-10672.001.patch
>
>
> All testcases in TestReservations are flaky
> Running a particular test in TestReservations 100 times never passes all the time.
>  For example, let's run testReservationNoContinueLook 100 times. For me, it produced 39 failed and 61 passed results.
>  Sometimes just 1 out of 100 runs is failed.
>  Screenshot is attached.
> Stacktrace:
> {code:java}
> java.lang.AssertionError: 
> Expected :2048
> Actual   :0
> <Click to see difference>
> at org.junit.Assert.fail(Assert.java:89)
> at org.junit.Assert.failNotEquals(Assert.java:835)
> at org.junit.Assert.assertEquals(Assert.java:647)
> at org.junit.Assert.assertEquals(Assert.java:633)
> at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:642)
> {code}
> The test fails here:
> {code:java}
>  // Start testing...
> // Only AM
> TestUtils.applyResourceCommitRequest(clusterResource,
>     a.assignContainers(clusterResource, node_0,
>         new ResourceLimits(clusterResource),
>         SchedulingMode.RESPECT_PARTITION_EXCLUSIVITY), nodes, apps);
> assertEquals(2 * GB, a.getUsedResources().getMemorySize());
> {code}
> With some debugging (patch attached), I realized that sometimes there are no registered nodes so the AM can't be allocated and test will fail:
> {code:java}
> 2021-03-04 21:58:25,434 DEBUG [main] allocator.RegularContainerAllocator (RegularContainerAllocator.java:canAssign(312)) - ******Can't assign container, no nodes... rmContext: 2a8dd942, scheduler: 2322e56f
> {code}
> In these cases, this is also printed from org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler#getNumClusterNodes:
> {code:java}
> 2021-03-04 21:58:25,379 DEBUG [main] capacity.CapacityScheduler (CapacityScheduler.java:getNumClusterNodes(290)) - ***Called real getNumClusterNodes
> {code}
> h2. Let's break this down:
>  1. The mocking happens in org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations#setup(org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacitySchedulerConfiguration, boolean):
> {code:java}
> cs.setRMContext(spyRMContext);
> cs.init(csConf);
> cs.start();
> when(cs.getNumClusterNodes()).thenReturn(3);
> {code}
> Under no circumstances this could be allowed to return any other value than 3.
>  However, as mentioned above, sometimes the real method of 'getNumClusterNodes' is called on CapacityScheduler.
> 2. Sometimes, this gets printed to the console:
> {code:java}
> org.mockito.exceptions.misusing.WrongTypeOfReturnValue: 
> Integer cannot be returned by isMultiNodePlacementEnabled()
> isMultiNodePlacementEnabled() should return boolean
> ***
> If you're unsure why you're getting above error read on.
> Due to the nature of the syntax above problem might occur because:
> 1. This exception *might* occur in wrongly written multi-threaded tests.
>    Please refer to Mockito FAQ on limitations of concurrency testing.
> 2. A spy is stubbed using when(spy.foo()).then() syntax. It is safer to stub spies - 
>    - with doReturn|Throw() family of methods. More in javadocs for Mockito.spy() method.
> 	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.setup(TestReservations.java:166)
> 	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.setup(TestReservations.java:114)
> 	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestReservations.testReservationNoContinueLook(TestReservations.java:566)
> 	at sun.reflect.GeneratedMethodAccessor34.invoke(Unknown Source)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:498)
> 	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
> 	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> 	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
> 	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> 	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
> 	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:325)
> 	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:78)
> 	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:57)
> 	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:290)
> 	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:71)
> 	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:288)
> 	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:58)
> 	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:268)
> 	at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
> 	at org.junit.runner.JUnitCore.run(JUnitCore.java:137)
> 	at com.intellij.junit4.JUnit4IdeaTestRunner.startRunnerWithArgs(JUnit4IdeaTestRunner.java:69)
> 	at com.intellij.rt.junit.IdeaTestRunner$Repeater.startRunnerWithArgs(IdeaTestRunner.java:40)
> 	at com.intellij.rt.junit.JUnitStarter.prepareStreamsAndStart(JUnitStarter.java:220)
> 	at com.intellij.rt.junit.JUnitStarter.main(JUnitStarter.java:53)
> {code}
> TestReservations.java:166 is exactly our suspicious stubbing call:
> {code:java}
> when(cs.getNumClusterNodes()).thenReturn(3);
> {code}
> But what the heck is the relation between this and the isMultiNodePlacementEnabled method?
>  Let's dive into this more.
> 3. The only caller of isMultiNodePlacementEnabled is org.apache.hadoop.yarn.server.resourcemanager.scheduler.activities.ActivitiesManager#dynamicallyUpdateAppActivitiesMaxQueueLengthIfNeeded, that gets called from the anonymous Thread instance (assigned to field 'cleanupThread') in class [ActivitiesManager|https://github.com/apache/hadoop/blob/6699198b54bf6360c164a6ce7552c8b91a318c59/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/activities/ActivitiesManager.java#L301-L357]
>  The ActivitiesManager.serviceStart is invoked when the CapacityScheduler is started.
> *Theory*: When CS is started [here|https://github.com/apache/hadoop/blob/6699198b54bf6360c164a6ce7552c8b91a318c59/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/test/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/capacity/TestReservations.java#L156-L160], the ActivitiesManager starts running and eventually cs.isMultiNodePlacementEnabled() will be called by the thread [here|https://github.com/apache/hadoop/blob/6699198b54bf6360c164a6ce7552c8b91a318c59/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/hadoop/yarn/server/resourcemanager/scheduler/activities/ActivitiesManager.java#L266].
>  As the main thread (JUnit) started to stub method 'getNumClusterNodes', there's an ongoing stubbing so when isMultiNodePlacementEnabled is called, the Answer will hold the value of 3, which is improper for a boolean method, so the stubbing will fail.
>  This is a race-condition situation, it is presented with 2 screenshots.
>  As the stubbing failed, subsequent calls to getNumClusterNodes will return 0 as the original implementation returns 0 for the tests. This is the root cause of the test failures.
> 4. Some references to this mockito "issue":
>  - [https://github.com/mockito/mockito/issues/1151]
>  - [https://github.com/mockito/mockito/issues/1151#issuecomment-336718714]
>  - [https://github.com/mockito/mockito/pull/1200#issuecomment-336717814]
>  The last link points to the documentation that describes the thread-safety of mockito: [https://github.com/mockito/mockito/wiki/FAQ#is-mockito-thread-safe]
> *CONCLUSION & FIX*
>  As all tests in this class are affected by the setup method and this thread-safety issue, we need to fix it consistently.
>  Actually, the fix is quite simple: All the stub calls of the CS instance should happen before the CS.init / CS.start calls, that triggers the ActivitiesManager thread.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org