You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Andrey Zagrebin (JIRA)" <ji...@apache.org> on 2019/05/14 14:42:00 UTC
[jira] [Commented] (FLINK-11631)
TaskExecutorITCase#testJobReExecutionAfterTaskExecutorTermination unstable
on Travis
[ https://issues.apache.org/jira/browse/FLINK-11631?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16839480#comment-16839480 ]
Andrey Zagrebin commented on FLINK-11631:
-----------------------------------------
[~kisimple] as you are working on a potential fix for this problem, would you be open to check whether your [PR|https://github.com/apache/flink/pull/7757] fully fixes the problem, described here?
I would suggest trying to reproduce this test failure in your Travis CI account by looping the unstable test: TaskExecutorITCase#testJobReExecutionAfterTaskExecutorTermination in master branch. How-to example can be found in this [commit|https://github.com/azagrebin/flink/commit/6961315f82a82b9810af1f1ba17d45f2bcf82669].
Then repeating the same test looping with the fix from your PR branch and check that the test is stable again. If you post here both Travis CI results: failed before the fix and succeeded after the fix, we can close this issue.
> TaskExecutorITCase#testJobReExecutionAfterTaskExecutorTermination unstable on Travis
> ------------------------------------------------------------------------------------
>
> Key: FLINK-11631
> URL: https://issues.apache.org/jira/browse/FLINK-11631
> Project: Flink
> Issue Type: Bug
> Components: Runtime / Coordination, Tests
> Affects Versions: 1.8.0
> Reporter: Till Rohrmann
> Priority: Critical
> Labels: test-stability
>
> The {{TaskExecutorITCase#testJobReExecutionAfterTaskExecutorTermination}} is unstable on Travis. It fails with
> {code}
> 16:12:04.644 [ERROR] testJobReExecutionAfterTaskExecutorTermination(org.apache.flink.runtime.taskexecutor.TaskExecutorITCase) Time elapsed: 1.257 s <<< ERROR!
> org.apache.flink.util.FlinkException: Could not close resource.
> at org.apache.flink.runtime.taskexecutor.TaskExecutorITCase.teardown(TaskExecutorITCase.java:83)
> Caused by: org.apache.flink.util.FlinkException: Error while shutting the TaskExecutor down.
> Caused by: org.apache.flink.util.FlinkException: Could not properly shut down the TaskManager services.
> Caused by: java.lang.IllegalStateException: NetworkBufferPool is not empty after destroying all LocalBufferPools
> {code}
> https://api.travis-ci.org/v3/job/493221318/log.txt
> The problem seems to be caused by the {{TaskExecutor}} not properly waiting for the termination of all running {{Tasks}}. Due to this, there is a race condition which causes that not all buffers are returned to the {{BufferPool}}.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)