You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Ufuk Celebi (JIRA)" <ji...@apache.org> on 2015/06/02 23:56:49 UTC

[jira] [Commented] (FLINK-2134) Deadlock in SuccessAfterNetworkBuffersFailureITCase

    [ https://issues.apache.org/jira/browse/FLINK-2134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569835#comment-14569835 ] 

Ufuk Celebi commented on FLINK-2134:
------------------------------------

I'm debugging this. The result so far is that some backwards termination events (from sync task to heads) get lost. The sync task sends out all tasks and clear itself. I'm currently trying to figure out where the events get lost...

> Deadlock in SuccessAfterNetworkBuffersFailureITCase
> ---------------------------------------------------
>
>                 Key: FLINK-2134
>                 URL: https://issues.apache.org/jira/browse/FLINK-2134
>             Project: Flink
>          Issue Type: Bug
>    Affects Versions: master
>            Reporter: Ufuk Celebi
>
> I ran into the issue in a Travis run for a PR: https://s3.amazonaws.com/archive.travis-ci.org/jobs/64994288/log.txt
> I can reproduce this locally by running SuccessAfterNetworkBuffersFailureITCase multiple times:
> {code}
> cluster = new ForkableFlinkMiniCluster(config, false);
> for (int i = 0; i < 100; i++) {
>    // run test programs CC, KMeans, CC
> }
> {code}
> The iteration tasks wait for superstep notifications like this:
> {code}
> "Join (Join at runConnectedComponents(SuccessAfterNetworkBuffersFailureITCase.java:128)) (8/6)" daemon prio=5 tid=0x00007f95f374f800 nid=0x138a7 in Object.wait() [0x0000000123f2a000]
>    java.lang.Thread.State: TIMED_WAITING (on object monitor)
> 	at java.lang.Object.wait(Native Method)
> 	- waiting on <0x00000007f89e3440> (a java.lang.Object)
> 	at org.apache.flink.runtime.iterative.concurrent.SuperstepKickoffLatch.awaitStartOfSuperstepOrTermination(SuperstepKickoffLatch.java:57)
> 	- locked <0x00000007f89e3440> (a java.lang.Object)
> 	at org.apache.flink.runtime.iterative.task.IterationTailPactTask.run(IterationTailPactTask.java:131)
> 	at org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362)
> 	at org.apache.flink.runtime.taskmanager.Task.run(Task.java:559)
> 	at java.lang.Thread.run(Thread.java:745)
> {code}
> I've asked [~rmetzger] to reproduce this and it deadlocks for him as well. The system needs to be under some load for this to occur after multiple runs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)