You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by "Ufuk Celebi (JIRA)" <ji...@apache.org> on 2015/03/02 18:28:06 UTC

[jira] [Created] (FLINK-1627) Netty channel connect deadlock

Ufuk Celebi created FLINK-1627:
----------------------------------

             Summary: Netty channel connect deadlock 
                 Key: FLINK-1627
                 URL: https://issues.apache.org/jira/browse/FLINK-1627
             Project: Flink
          Issue Type: Bug
            Reporter: Ufuk Celebi


[~StephanEwen] reports the following deadlock (https://travis-ci.org/StephanEwen/incubator-flink/jobs/52755844, logs: https://s3.amazonaws.com/flink.a.o.uce.east/travis-artifacts/StephanEwen/incubator-flink/477/477.2.tar.gz).

{code}
"CHAIN Partition -> Map (Map at testRestartMultipleTimes(SimpleRecoveryITCase.java:200)) (2/4)" daemon prio=10 tid=0x00007f5fdc008800 nid=0xe230 in Object.wait() [0x00007f5fca8f2000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
	at java.lang.Object.wait(Native Method)
	- waiting on <0x00000000f2a13530> (a java.lang.Object)
	at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.waitForChannel(PartitionRequestClientFactory.java:179)
	- locked <0x00000000f2a13530> (a java.lang.Object)
	at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.access$000(PartitionRequestClientFactory.java:125)
	at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.createPartitionRequestClient(PartitionRequestClientFactory.java:64)
	at org.apache.flink.runtime.io.network.netty.NettyConnectionManager.createPartitionRequestClient(NettyConnectionManager.java:53)
	at org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestIntermediateResultPartition(RemoteInputChannel.java:92)
	at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:287)
	- locked <0x00000000f29dbcd8> (a java.lang.Object)
	at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:306)
	at org.apache.flink.runtime.io.network.api.reader.AbstractRecordReader.getNextRecord(AbstractRecordReader.java:75)
	at org.apache.flink.runtime.io.network.api.reader.MutableRecordReader.next(MutableRecordReader.java:34)
	at org.apache.flink.runtime.operators.util.ReaderIterator.next(ReaderIterator.java:59)
	at org.apache.flink.runtime.operators.NoOpDriver.run(NoOpDriver.java:91)
	at org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask.java:496)
	at org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362)
	at org.apache.flink.runtime.execution.RuntimeEnvironment.run(RuntimeEnvironment.java:205)
	at java.lang.Thread.run(Thread.java:745)
{code}

{code}
"CHAIN Partition -> Map (Map at testRestartMultipleTimes(SimpleRecoveryITCase.java:200)) (3/4)" daemon prio=10 tid=0x00007f5fdc005000 nid=0xe22f in Object.wait() [0x00007f5fca9f3000]
   java.lang.Thread.State: TIMED_WAITING (on object monitor)
	at java.lang.Object.wait(Native Method)
	- waiting on <0x00000000f2a13530> (a java.lang.Object)
	at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.waitForChannel(PartitionRequestClientFactory.java:179)
	- locked <0x00000000f2a13530> (a java.lang.Object)
	at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.access$000(PartitionRequestClientFactory.java:125)
	at org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.createPartitionRequestClient(PartitionRequestClientFactory.java:79)
	at org.apache.flink.runtime.io.network.netty.NettyConnectionManager.createPartitionRequestClient(NettyConnectionManager.java:53)
	at org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestIntermediateResultPartition(RemoteInputChannel.java:92)
	at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:287)
	- locked <0x00000000f2896f88> (a java.lang.Object)
	at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:306)
	at org.apache.flink.runtime.io.network.api.reader.AbstractRecordReader.getNextRecord(AbstractRecordReader.java:75)
	at org.apache.flink.runtime.io.network.api.reader.MutableRecordReader.next(MutableRecordReader.java:34)
	at org.apache.flink.runtime.operators.util.ReaderIterator.next(ReaderIterator.java:59)
	at org.apache.flink.runtime.operators.NoOpDriver.run(NoOpDriver.java:91)
	at org.apache.flink.runtime.operators.RegularPactTask.run(RegularPactTask.java:496)
	at org.apache.flink.runtime.operators.RegularPactTask.invoke(RegularPactTask.java:362)
	at org.apache.flink.runtime.execution.RuntimeEnvironment.run(RuntimeEnvironment.java:205)
	at java.lang.Thread.run(Thread.java:745)
{code}

Two tasks try to connect to a task manager during a data shuffle. One of the two tries to establish the connection and then both wait for the connect to return (waitForChannel).

The problem seems to be related to the channel listener never handing in the channel.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)