You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Peter Bacsko (JIRA)" <ji...@apache.org> on 2019/04/03 14:41:00 UTC

[jira] [Created] (YARN-9436) Flaky test testApplicationLifetimeMonitor

Peter Bacsko created YARN-9436:
----------------------------------

             Summary: Flaky test testApplicationLifetimeMonitor
                 Key: YARN-9436
                 URL: https://issues.apache.org/jira/browse/YARN-9436
             Project: Hadoop YARN
          Issue Type: Bug
          Components: scheduler, test
            Reporter: Peter Bacsko
            Assignee: Peter Bacsko


In our test environment, we occasionally encounter this failure:
{noformat}
2019-04-03 12:49:32 [INFO] Running org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor
2019-04-03 12:53:08 [ERROR] Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 215.535 s <<< FAILURE! - in org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor
2019-04-03 12:53:08 [ERROR] testApplicationLifetimeMonitor[0](org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor)  Time elapsed: 34.244 s  <<< FAILURE!
2019-04-03 12:53:08 java.lang.AssertionError: Application killed before lifetime value
2019-04-03 12:53:08 	at org.apache.hadoop.yarn.server.resourcemanager.rmapp.TestApplicationLifetimeMonitor.testApplicationLifetimeMonitor(TestApplicationLifetimeMonitor.java:218)
2019-04-03 12:53:08 
{noformat}
The root cause is the condition here:
{noformat}
        Assert.assertTrue("Application killed before lifetime value",
            totalTimeRun > maxLifetime);
{noformat}
However, there are two problems with this condition:
 1. Logically it's not correct. In fact, since the app should be killed after 30 seconds, one would expect to see {{totalTimeRun = maxLifetime}}. Due to some asynchronicity and rounding, most of the time {{totalTimeRun}} ends up being 31.

2. Sometimes the application is killed fast enough and {{totalTimeRun}} is 30, but this is correct, because in {{setUpCSQueue}} we set the queue lifetime:
{noformat}
    csConf.setMaximumLifetimePerQueue(
        CapacitySchedulerConfiguration.ROOT + ".default", maxLifetime);
    csConf.setDefaultLifetimePerQueue(
        CapacitySchedulerConfiguration.ROOT + ".default", defaultLifetime);
{noformat}
A more proper condition is:
{noformat}
Assert.assertTrue("Application killed before lifetime value",
            totalTimeRun >= maxLifetime);
{noformat}
The assertion message in the next line is also misleading:
{noformat}
        Assert.assertTrue(
            "Application killed before lifetime value " + totalTimeRun,
            totalTimeRun < maxLifetime + 10L);
{noformat}
If it false, it means that the application is killed _after_ 40 seconds, which exceeds both the app's lifetime (40s) and that of the queue (30s).
{noformat}
        Assert.assertTrue(
            "Application killed after queue/app lifetime value: " + totalTimeRun,
            totalTimeRun < maxLifetime + 10L);
{noformat}
We can be even be stricter, since we expect a kill almost immediately after 30 seconds:
{noformat}
        Assert.assertTrue(
            "Application killed too late: " + totalTimeRun,
            totalTimeRun < maxLifetime + 2L);
{noformat}
where we allow a 2 second tolerance.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org