You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by "Liu (Jira)" <ji...@apache.org> on 2021/09/07 02:31:00 UTC

[jira] [Created] (FLINK-24174) MiniClusterTestEnvironment‘s triggerTaskManagerFailover may stuck in CommonTestUtils.waitForJobStatus()

Liu created FLINK-24174:
---------------------------

             Summary: MiniClusterTestEnvironment‘s triggerTaskManagerFailover may stuck in CommonTestUtils.waitForJobStatus()
                 Key: FLINK-24174
                 URL: https://issues.apache.org/jira/browse/FLINK-24174
             Project: Flink
          Issue Type: Improvement
          Components: Test Infrastructure
            Reporter: Liu


When writing taskmanager failover tests with [unified testing framework for connectors|https://issues.apache.org/jira/browse/FLINK-19554], I find that it may stuck in 

CommonTestUtils.waitForJobStatus() as following:
 # triggerTaskManagerFailover is called.
 # JobStatus switched from RUNNING to RESTARTING.
 # JobStatus switched from RESTARTING to RUNNING.
 # The method terminateTaskManager() is completed.
 # Since the jobStatus is RUNNING, CommonTestUtils.waitForJobStatus() will never exit.

A solution is to call terminateTaskManager() with async way. At the same time, call 

CommonTestUtils.waitForJobStatus(). The pseudo code can be as follow:
{code:java}
public void triggerTaskManagerFailover(JobClient jobClient, Runnable afterFailAction)
        throws Exception {
    CompletableFuture<Void> completableFuture = terminateTaskManager();
    CommonTestUtils.waitForJobStatus(
            jobClient,
            Arrays.asList(JobStatus.FAILING, JobStatus.FAILED, JobStatus.RESTARTING),
            Deadline.fromNow(Duration.ofMinutes(5)));
    completableFuture.get();
    afterFailAction.run();
    startTaskManager();
}
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)