You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Andrey Zagrebin (JIRA)" <ji...@apache.org> on 2019/05/14 14:42:00 UTC

[jira] [Commented] (FLINK-11631) TaskExecutorITCase#testJobReExecutionAfterTaskExecutorTermination unstable on Travis

    [ https://issues.apache.org/jira/browse/FLINK-11631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16839480#comment-16839480 ] 

Andrey Zagrebin commented on FLINK-11631:
-----------------------------------------

[~kisimple] as you are working on a potential fix for this problem, would you be open to check whether your [PR|https://github.com/apache/flink/pull/7757] fully fixes the problem, described here?

I would suggest trying to reproduce this test failure in your Travis CI account by looping the unstable test: TaskExecutorITCase#testJobReExecutionAfterTaskExecutorTermination in master branch. How-to example can be found in this [commit|https://github.com/azagrebin/flink/commit/6961315f82a82b9810af1f1ba17d45f2bcf82669].

Then repeating the same test looping with the fix from your PR branch and check that the test is stable again. If you post here both Travis CI results: failed before the fix and succeeded after the fix, we can close this issue.

> TaskExecutorITCase#testJobReExecutionAfterTaskExecutorTermination unstable on Travis
> ------------------------------------------------------------------------------------
>
>                 Key: FLINK-11631
>                 URL: https://issues.apache.org/jira/browse/FLINK-11631
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination, Tests
>    Affects Versions: 1.8.0
>            Reporter: Till Rohrmann
>            Priority: Critical
>              Labels: test-stability
>
> The {{TaskExecutorITCase#testJobReExecutionAfterTaskExecutorTermination}} is unstable on Travis. It fails with 
> {code}
> 16:12:04.644 [ERROR] testJobReExecutionAfterTaskExecutorTermination(org.apache.flink.runtime.taskexecutor.TaskExecutorITCase)  Time elapsed: 1.257 s  <<< ERROR!
> org.apache.flink.util.FlinkException: Could not close resource.
> 	at org.apache.flink.runtime.taskexecutor.TaskExecutorITCase.teardown(TaskExecutorITCase.java:83)
> Caused by: org.apache.flink.util.FlinkException: Error while shutting the TaskExecutor down.
> Caused by: org.apache.flink.util.FlinkException: Could not properly shut down the TaskManager services.
> Caused by: java.lang.IllegalStateException: NetworkBufferPool is not empty after destroying all LocalBufferPools
> {code} 
> https://api.travis-ci.org/v3/job/493221318/log.txt
> The problem seems to be caused by the {{TaskExecutor}} not properly waiting for the termination of all running {{Tasks}}. Due to this, there is a race condition which causes that not all buffers are returned to the {{BufferPool}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)