You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Weiwei Yang (JIRA)" <ji...@apache.org> on 2018/07/13 11:29:00 UTC

[jira] [Commented] (YARN-7748) TestContainerResizing.testIncreaseContainerUnreservedWhenApplicationCompleted failed

    [ https://issues.apache.org/jira/browse/YARN-7748?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16542923#comment-16542923 ] 

Weiwei Yang commented on YARN-7748:
-----------------------------------

Took a bit of time investigating this issue as it happens to us too. I can stably reproduce this issue by adding a sleep right after following line:
{code:java}
// Kill the application
cs.handle(new AppAttemptRemovedSchedulerEvent(am1.getApplicationAttemptId(),
        RMAppAttemptState.KILLED, false));
// Sleep a few seconds to wait all events are
// handled before verifying the metrics
Thread.sleep(3000);{code}
Like [~haibochen] mentioned, this issue is because {{LeafQueue#finishApplicationAttempt}} was called *twice* for *same* app-attempt, causing the metrics incorrectly deducted to *-1*
{code:java}
default #user-pending-applications: -1
{code}
This is how it happens
 # UT case killed the app attempt by triggering a {{AppAttemptRemovedSchedulerEvent}}, this will cause the leafQueue to {{removeApplicationAttempt}} immediately
 # Scheduler then releases all containers, including the AM container
 # When AM container is killed, it will trigger a {{RMAppAttemptContainerFinishedEvent}} and {{LeafQueue#removeApplicationAttempt}} will be called again

To fix this, I posted a patch by adding a check before {{LeafQueue#removeApplicationAttempt}} to make sure the app attempt still exists at the time deleting it. Another thing is to disable restart app-attempt in the UT case to avoid another issue (this UT is supposed to only check resources for one app-attempt). Put this two together, this UT should be able to run good.

Please help to review.

> TestContainerResizing.testIncreaseContainerUnreservedWhenApplicationCompleted failed
> ------------------------------------------------------------------------------------
>
>                 Key: YARN-7748
>                 URL: https://issues.apache.org/jira/browse/YARN-7748
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacityscheduler
>    Affects Versions: 3.0.0
>            Reporter: Haibo Chen
>            Assignee: Szilard Nemeth
>            Priority: Major
>
> TestContainerResizing.testIncreaseContainerUnreservedWhenApplicationCompleted
> Failing for the past 1 build (Since Failed#19244 )
> Took 0.4 sec.
> *Error Message*
> expected null, but was:<or...@6193932a>
> *Stacktrace*
> {code}
> java.lang.AssertionError: expected null, but was:<or...@6193932a>
> 	at org.junit.Assert.fail(Assert.java:88)
> 	at org.junit.Assert.failNotNull(Assert.java:664)
> 	at org.junit.Assert.assertNull(Assert.java:646)
> 	at org.junit.Assert.assertNull(Assert.java:656)
> 	at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.TestContainerResizing.testIncreaseContainerUnreservedWhenApplicationCompleted(TestContainerResizing.java:826)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> 	at java.lang.reflect.Method.invoke(Method.java:498)
> 	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
> 	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> 	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
> 	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> 	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
> 	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
> 	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
> 	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:50)
> 	at org.junit.runners.ParentRunner$3.run(ParentRunner.java:238)
> 	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:63)
> 	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:236)
> 	at org.junit.runners.ParentRunner.access$000(ParentRunner.java:53)
> 	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:229)
> 	at org.junit.runners.ParentRunner.run(ParentRunner.java:309)
> 	at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:369)
> 	at org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:275)
> 	at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:239)
> 	at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:160)
> 	at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:373)
> 	at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:334)
> 	at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:119)
> 	at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:407)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org