You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Xintong Song (JIRA)" <ji...@apache.org> on 2019/07/22 03:46:00 UTC

[jira] [Comment Edited] (FLINK-13242) StandaloneResourceManagerTest fails on travis

    [ https://issues.apache.org/jira/browse/FLINK-13242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16889630#comment-16889630 ] 

Xintong Song edited comment on FLINK-13242 at 7/22/19 3:45 AM:
---------------------------------------------------------------

Hi [~azagrebin], I think I found the problem.

In _StandaloneResourceManager#initialize()_ it uses _getMainThreadExecutor()_ to execute _setFailUnfulfillableRequest()_. However, before _setFailUnfulfillableRequest()_ is executed, the main thread executor of the resource manager might be replaced by a new one when it accepts the granted leader ship, leading to _setFailUnfulfillableRequest()_ never being executed. This only happens when the _StandaloneResourceManager#initialize()_ is invoked before _TestingLeaderElectionService#isLeader()_.

The problem can be re-produced and verified as follows:
 * Add logs in _StandaloneResourceManager#initialize()_, _TestingLeaderElectionService#isLeader()_ and _ResourceManager#setFailUnfulfillableRequest()_, and run the test. In most cases, you should see _TestingLeaderElectionService#isLeader()_ invoked before _StandaloneResourceManager#initialize()_, _setFailUnfulfillableRequest()_ is invoked, and the test should pass.
 * Add a short sleep time (in my case 100ms) in _MockResourceManagerRuntimeServices#grantLeadership()_ before _rmLeaderElectionService.isLeader()_, and run the test again. Now you should see _TestingLeaderElectionService#isLeader()_ invoked after _StandaloneResourceManager#initialize()_, _setFailUnfulfillableRequest()_ is never invoked, and the test should fail.
 * Add another short sleep time (also 100ms in my case) in _StandaloneResourceManager#initialize()_, inside the _getRpcService().getScheduledExecutor().schedule()_ block, right before _getMainThreadExecutor()_. This should change the order of invoking back and fix the failure.
 * If you invoke _getMainThreadExecutor()_ twice in _StandaloneResourceManager#initialize()_, once before the sleep and the other after it, and print out the fetched main thread executors, you should find that they are two different objects.
 * Now if you remove the sleep in _StandaloneResourceManager#initialize()_, you should see the printed two main thread executors are the same object, and the test is broken again.

I'm thinking that maybe _setFailUnfulfillableRequest(true)_ does not need to be invoked on the PRC main thread. Instead of calling on the main thread executor, I tried call _setFailUnfulfillableRequest(true)_ directly in the _getRpcService().getScheduledExecutor().schedule()_ block in _StandaloneResourceManager#initialize()_ and it fixes the problem.

I think we do not care whether the _setFailUnfulfillableRequest(true)_ happens on main thread or not in production, as long as it eventually get invoked. And for this test case, we may have a bit inconsistency that after _setFailUnfulfillableRequest(true)_ the _isFailingUnfulfillableRequest()_ may not get the correct result immediately, which I think is acceptable and the 10s timeout for _assertHappensUntil()_ should be long enough to catch the invoking of _setFailUnfulfillableRequest(true)_ eventually. What do you think?


was (Author: xintongsong):
Hi [~azagrebin], I think I found the problem.

In _StandaloneResourceManager#initialize()_ it uses _getMainThreadExecutor()_ to execute _setFailUnfulfillableRequest()_. However, before _setFailUnfulfillableRequest()_ is executed, the main thread executor of the resource manager might be replaced by a new one when it accepts the granted leader ship, leading to _setFailUnfulfillableRequest()_ never being executed. This only happens when the _StandaloneResourceManager#initialize()_ is invoked before _TestingLeaderElectionService#isLeader()_.

The problem can be re-produced and verified as follows:
 * Add logs in _StandaloneResourceManager#initialize()_, __ _TestingLeaderElectionService#isLeader()_ and _ResourceManager#setFailUnfulfillableRequest()_, and run the test. In most cases, you should see _TestingLeaderElectionService#isLeader()_ invoked before __ _StandaloneResourceManager#initialize()_, _setFailUnfulfillableRequest()_ is invoked, and the test should pass.
 * Add a short sleep time (in my case 100ms) in _MockResourceManagerRuntimeServices#grantLeadership()_ before _rmLeaderElectionService.isLeader()_, and run the test again. Now you should see _TestingLeaderElectionService#isLeader()_ invoked after __ _StandaloneResourceManager#initialize()_, _setFailUnfulfillableRequest()_ is never invoked, and the test should fail.
 * Add another short sleep time (also 100ms in my case) in _StandaloneResourceManager#initialize()_, inside __ the _getRpcService().getScheduledExecutor().schedule()_ block, right before _getMainThreadExecutor()_. This should change the order of invoking back and fix the failure.
 * If you invoke _getMainThreadExecutor()_ twice in _StandaloneResourceManager#initialize()_, once before the sleep and the other after it, and print out the fetched main thread executors, you should find that they are two different objects.
 * Now if you remove the sleep in _StandaloneResourceManager#initialize()_, you should see the printed two main thread executors are the same object, and the test is broken again.

I'm thinking that maybe _setFailUnfulfillableRequest(true)_ does not need to be invoked on the PRC main thread. Instead of calling on the main thread executor, I tried call _setFailUnfulfillableRequest(true)_ directly in the _getRpcService().getScheduledExecutor().schedule()_ block in _StandaloneResourceManager#initialize()_ and it fixes the problem.

I think we do not care whether the _setFailUnfulfillableRequest(true)_ happens on main thread or not in production, as long as it eventually get invoked. And for this test case, we may have a bit inconsistency that after _setFailUnfulfillableRequest(true)_ the _isFailingUnfulfillableRequest()_ may not get the correct result immediately, which I think is acceptable and the 10s timeout for _assertHappensUntil()_ should be long enough to catch the invoking of _setFailUnfulfillableRequest(true)_ eventually. What do you think?

> StandaloneResourceManagerTest fails on travis
> ---------------------------------------------
>
>                 Key: FLINK-13242
>                 URL: https://issues.apache.org/jira/browse/FLINK-13242
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.9.0
>            Reporter: Chesnay Schepler
>            Assignee: Andrey Zagrebin
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 1.9.0, 1.10.0
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> https://travis-ci.org/apache/flink/jobs/557696989
> {code}
> 08:28:06.475 [ERROR] testStartupPeriod(org.apache.flink.runtime.resourcemanager.StandaloneResourceManagerTest)  Time elapsed: 10.276 s  <<< FAILURE!
> java.lang.AssertionError: condition was not fulfilled before the deadline
> 	at org.apache.flink.runtime.resourcemanager.StandaloneResourceManagerTest.assertHappensUntil(StandaloneResourceManagerTest.java:114)
> 	at org.apache.flink.runtime.resourcemanager.StandaloneResourceManagerTest.testStartupPeriod(StandaloneResourceManagerTest.java:60)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)