You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Charith Wickramarachchi <ch...@gmail.com> on 2017/07/25 21:06:46 UTC

Connecting to remote task manager failed

Hi All,

I m getting an exception when running a Gelly task using Pregel model. It
complains that the remote task manager might be lost. But task managers
seem to be active based on the flink dashboard.  Also, other tasks run fine
without an issue.


Following is the summary of exception trace.  I have attached the full
trace as well. It will be great if you can provide any directions to
identify the issue.


Flink version: flink-1.1.3
Java: 1.7

Caused by: java.io.IOException: Thread 'SortMerger Reading Thread'
terminated due to an exception: Connecting the channel failed: Connecting
to remote task manager + 'worker/127.0.1.1:44310' has failed. This might
indicate that the remote task manager has been lost.
at
org.apache.flink.runtime.operators.sort.UnilateralSortMerger$ThreadBase.run(UnilateralSortMerger.java:800)
Caused by: java.io.IOException: Connecting the channel failed: Connecting
to remote task manager + 'worker/127.0.1.1:44310' has failed. This might
indicate that the remote task manager has been lost.
at
org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.waitForChannel(PartitionRequestClientFactory.java:196)
at
org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.access$000(PartitionRequestClientFactory.java:131)
at
org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory.createPartitionRequestClient(PartitionRequestClientFactory.java:83)
at
org.apache.flink.runtime.io.network.netty.NettyConnectionManager.createPartitionRequestClient(NettyConnectionManager.java:60)
at
org.apache.flink.runtime.io.network.partition.consumer.RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:118)
at
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:394)
at
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.getNextBufferOrEvent(SingleInputGate.java:413)
at
org.apache.flink.runtime.io.network.api.reader.AbstractRecordReader.getNextRecord(AbstractRecordReader.java:87)
at
org.apache.flink.runtime.io.network.api.reader.MutableRecordReader.next(MutableRecordReader.java:42)
at
org.apache.flink.runtime.operators.util.ReaderIterator.next(ReaderIterator.java:59)
at
org.apache.flink.runtime.operators.sort.UnilateralSortMerger$ReadingThread.go(UnilateralSortMerger.java:973)
at
org.apache.flink.runtime.operators.sort.UnilateralSortMerger$ThreadBase.run(UnilateralSortMerger.java:796)
Caused by:
org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException:
Connecting to remote task manager + 'worker/127.0.1.1:44310' has failed.
This might indicate that the remote task manager has been lost.
at
org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:219)
at
org.apache.flink.runtime.io.network.netty.PartitionRequestClientFactory$ConnectingChannel.operationComplete(PartitionRequestClientFactory.java:131)

Thanks,
Charith


-- 
Charith Dhanushka Wickramaarachchi

Tel  +1 213 447 4253
Blog  http://charith.wickramaarachchi.org/
<http://charithwiki.blogspot.com/>
Twitter  @charithwiki <https://twitter.com/charithwiki>

This communication may contain privileged or other confidential information
and is intended exclusively for the addressee/s. If you are not the
intended recipient/s, or believe that you may have
received this communication in error, please reply to the sender indicating
that fact and delete the copy you received and in addition, you should not
print, copy, retransmit, disseminate, or otherwise use the information
contained in this communication. Internet communications cannot be
guaranteed to be timely, secure, error or virus-free. The sender does not
accept liability for any errors or omissions

Re: Connecting to remote task manager failed

Posted by Charith Wickramarachchi <ch...@gmail.com>.
Hi Fabian,

I see the following exceptions in worker logs (Exception trace is similar
to the one I attached above). I m wondering if its a network configuration
issue because it refers to a 127.0.1.1 address.

Caused by: java.lang.RuntimeException: Error obtaining the sorted input:
Thread 'SortMerger Reading Thread' terminated due to an exception:
Connecting the channel failed: Connecting to remote task manager + 'worker/
127.0.1.1:43061' has failed. This might indicate that the remote task
manager has been lost.
        at
org.apache.flink.runtime.operators.sort.UnilateralSortMerger.getIterator(UnilateralSortMerger.java:619)
        at
org.apache.flink.runtime.operators.BatchTask.getInput(BatchTask.java:1098)
        at
org.apache.flink.runtime.operators.BatchTask.initLocalStrategies(BatchTask.java:831)
        at
org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:327)
        ... 2 more


Thanks,
Charith

On Wed, Jul 26, 2017 at 4:29 AM, Fabian Hueske <fh...@gmail.com> wrote:

> Hi,
>
> Do you have an exception for the the CoGroup failure?
>
> Best, Fabian
>
> 2017-07-26 3:32 GMT+02:00 Charith Wickramarachchi <
> charith.dhanushka@gmail.com>:
>
>> I did some more digging. It seems the CoGroup operation failed in one of
>> the workers. But I do not face this issue when running other tasks.
>>
>> Thanks,
>> Charith
>>
>> On Tue, Jul 25, 2017 at 2:06 PM, Charith Wickramarachchi <
>> charith.dhanushka@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>> I m getting an exception when running a Gelly task using Pregel model.
>>> It complains that the remote task manager might be lost. But task managers
>>> seem to be active based on the flink dashboard.  Also, other tasks run fine
>>> without an issue.
>>>
>>>
>>> Following is the summary of exception trace.  I have attached the full
>>> trace as well. It will be great if you can provide any directions to
>>> identify the issue.
>>>
>>>
>>> Flink version: flink-1.1.3
>>> Java: 1.7
>>>
>>> Caused by: java.io.IOException: Thread 'SortMerger Reading Thread'
>>> terminated due to an exception: Connecting the channel failed: Connecting
>>> to remote task manager + 'worker/127.0.1.1:44310' has failed. This
>>> might indicate that the remote task manager has been lost.
>>> at org.apache.flink.runtime.operators.sort.UnilateralSortMerger
>>> $ThreadBase.run(UnilateralSortMerger.java:800)
>>> Caused by: java.io.IOException: Connecting the channel failed:
>>> Connecting to remote task manager + 'worker/127.0.1.1:44310' has
>>> failed. This might indicate that the remote task manager has been lost.
>>> at org.apache.flink.runtime.io.network.netty.PartitionRequestCl
>>> ientFactory$ConnectingChannel.waitForChannel(PartitionReques
>>> tClientFactory.java:196)
>>> at org.apache.flink.runtime.io.network.netty.PartitionRequestCl
>>> ientFactory$ConnectingChannel.access$000(PartitionRequestCli
>>> entFactory.java:131)
>>> at org.apache.flink.runtime.io.network.netty.PartitionRequestCl
>>> ientFactory.createPartitionRequestClient(PartitionRequestCli
>>> entFactory.java:83)
>>> at org.apache.flink.runtime.io.network.netty.NettyConnectionMan
>>> ager.createPartitionRequestClient(NettyConnectionManager.java:60)
>>> at org.apache.flink.runtime.io.network.partition.consumer.Remot
>>> eInputChannel.requestSubpartition(RemoteInputChannel.java:118)
>>> at org.apache.flink.runtime.io.network.partition.consumer.Singl
>>> eInputGate.requestPartitions(SingleInputGate.java:394)
>>> at org.apache.flink.runtime.io.network.partition.consumer.Singl
>>> eInputGate.getNextBufferOrEvent(SingleInputGate.java:413)
>>> at org.apache.flink.runtime.io.network.api.reader.AbstractRecor
>>> dReader.getNextRecord(AbstractRecordReader.java:87)
>>> at org.apache.flink.runtime.io.network.api.reader.MutableRecord
>>> Reader.next(MutableRecordReader.java:42)
>>> at org.apache.flink.runtime.operators.util.ReaderIterator.next(
>>> ReaderIterator.java:59)
>>> at org.apache.flink.runtime.operators.sort.UnilateralSortMerger
>>> $ReadingThread.go(UnilateralSortMerger.java:973)
>>> at org.apache.flink.runtime.operators.sort.UnilateralSortMerger
>>> $ThreadBase.run(UnilateralSortMerger.java:796)
>>> Caused by: org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException:
>>> Connecting to remote task manager + 'worker/127.0.1.1:44310' has
>>> failed. This might indicate that the remote task manager has been lost.
>>> at org.apache.flink.runtime.io.network.netty.PartitionRequestCl
>>> ientFactory$ConnectingChannel.operationComplete(PartitionReq
>>> uestClientFactory.java:219)
>>> at org.apache.flink.runtime.io.network.netty.PartitionRequestCl
>>> ientFactory$ConnectingChannel.operationComplete(PartitionReq
>>> uestClientFactory.java:131)
>>>
>>> Thanks,
>>> Charith
>>>
>>>
>>> --
>>> Charith Dhanushka Wickramaarachchi
>>>
>>> Tel  +1 213 447 4253
>>> Blog  http://charith.wickramaarachchi.org/
>>> <http://charithwiki.blogspot.com/>
>>> Twitter  @charithwiki <https://twitter.com/charithwiki>
>>>
>>> This communication may contain privileged or other confidential information
>>> and is intended exclusively for the addressee/s. If you are not the
>>> intended recipient/s, or believe that you may have
>>> received this communication in error, please reply to the sender indicating
>>> that fact and delete the copy you received and in addition, you should
>>> not print, copy, retransmit, disseminate, or otherwise use the
>>> information contained in this communication. Internet communications
>>> cannot be guaranteed to be timely, secure, error or virus-free. The
>>> sender does not accept liability for any errors or omissions
>>>
>>
>>
>>
>> --
>> Charith Dhanushka Wickramaarachchi
>>
>> Tel  +1 213 447 4253
>> Blog  http://charith.wickramaarachchi.org/
>> <http://charithwiki.blogspot.com/>
>> Twitter  @charithwiki <https://twitter.com/charithwiki>
>>
>> This communication may contain privileged or other confidential information
>> and is intended exclusively for the addressee/s. If you are not the
>> intended recipient/s, or believe that you may have
>> received this communication in error, please reply to the sender indicating
>> that fact and delete the copy you received and in addition, you should
>> not print, copy, retransmit, disseminate, or otherwise use the
>> information contained in this communication. Internet communications
>> cannot be guaranteed to be timely, secure, error or virus-free. The
>> sender does not accept liability for any errors or omissions
>>
>
>


-- 
Charith Dhanushka Wickramaarachchi

Tel  +1 213 447 4253
Blog  http://charith.wickramaarachchi.org/
<http://charithwiki.blogspot.com/>
Twitter  @charithwiki <https://twitter.com/charithwiki>

This communication may contain privileged or other confidential information
and is intended exclusively for the addressee/s. If you are not the
intended recipient/s, or believe that you may have
received this communication in error, please reply to the sender indicating
that fact and delete the copy you received and in addition, you should not
print, copy, retransmit, disseminate, or otherwise use the information
contained in this communication. Internet communications cannot be
guaranteed to be timely, secure, error or virus-free. The sender does not
accept liability for any errors or omissions

Re: Connecting to remote task manager failed

Posted by Fabian Hueske <fh...@gmail.com>.
Hi,

Do you have an exception for the the CoGroup failure?

Best, Fabian

2017-07-26 3:32 GMT+02:00 Charith Wickramarachchi <
charith.dhanushka@gmail.com>:

> I did some more digging. It seems the CoGroup operation failed in one of
> the workers. But I do not face this issue when running other tasks.
>
> Thanks,
> Charith
>
> On Tue, Jul 25, 2017 at 2:06 PM, Charith Wickramarachchi <
> charith.dhanushka@gmail.com> wrote:
>
>> Hi All,
>>
>> I m getting an exception when running a Gelly task using Pregel model. It
>> complains that the remote task manager might be lost. But task managers
>> seem to be active based on the flink dashboard.  Also, other tasks run fine
>> without an issue.
>>
>>
>> Following is the summary of exception trace.  I have attached the full
>> trace as well. It will be great if you can provide any directions to
>> identify the issue.
>>
>>
>> Flink version: flink-1.1.3
>> Java: 1.7
>>
>> Caused by: java.io.IOException: Thread 'SortMerger Reading Thread'
>> terminated due to an exception: Connecting the channel failed: Connecting
>> to remote task manager + 'worker/127.0.1.1:44310' has failed. This might
>> indicate that the remote task manager has been lost.
>> at org.apache.flink.runtime.operators.sort.UnilateralSortMerger
>> $ThreadBase.run(UnilateralSortMerger.java:800)
>> Caused by: java.io.IOException: Connecting the channel failed: Connecting
>> to remote task manager + 'worker/127.0.1.1:44310' has failed. This might
>> indicate that the remote task manager has been lost.
>> at org.apache.flink.runtime.io.network.netty.PartitionRequestCl
>> ientFactory$ConnectingChannel.waitForChannel(PartitionReques
>> tClientFactory.java:196)
>> at org.apache.flink.runtime.io.network.netty.PartitionRequestCl
>> ientFactory$ConnectingChannel.access$000(PartitionRequestCli
>> entFactory.java:131)
>> at org.apache.flink.runtime.io.network.netty.PartitionRequestCl
>> ientFactory.createPartitionRequestClient(PartitionRequestCli
>> entFactory.java:83)
>> at org.apache.flink.runtime.io.network.netty.NettyConnectionMan
>> ager.createPartitionRequestClient(NettyConnectionManager.java:60)
>> at org.apache.flink.runtime.io.network.partition.consumer.Remot
>> eInputChannel.requestSubpartition(RemoteInputChannel.java:118)
>> at org.apache.flink.runtime.io.network.partition.consumer.Singl
>> eInputGate.requestPartitions(SingleInputGate.java:394)
>> at org.apache.flink.runtime.io.network.partition.consumer.Singl
>> eInputGate.getNextBufferOrEvent(SingleInputGate.java:413)
>> at org.apache.flink.runtime.io.network.api.reader.AbstractRecor
>> dReader.getNextRecord(AbstractRecordReader.java:87)
>> at org.apache.flink.runtime.io.network.api.reader.MutableRecord
>> Reader.next(MutableRecordReader.java:42)
>> at org.apache.flink.runtime.operators.util.ReaderIterator.next(
>> ReaderIterator.java:59)
>> at org.apache.flink.runtime.operators.sort.UnilateralSortMerger
>> $ReadingThread.go(UnilateralSortMerger.java:973)
>> at org.apache.flink.runtime.operators.sort.UnilateralSortMerger
>> $ThreadBase.run(UnilateralSortMerger.java:796)
>> Caused by: org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException:
>> Connecting to remote task manager + 'worker/127.0.1.1:44310' has failed.
>> This might indicate that the remote task manager has been lost.
>> at org.apache.flink.runtime.io.network.netty.PartitionRequestCl
>> ientFactory$ConnectingChannel.operationComplete(PartitionReq
>> uestClientFactory.java:219)
>> at org.apache.flink.runtime.io.network.netty.PartitionRequestCl
>> ientFactory$ConnectingChannel.operationComplete(PartitionReq
>> uestClientFactory.java:131)
>>
>> Thanks,
>> Charith
>>
>>
>> --
>> Charith Dhanushka Wickramaarachchi
>>
>> Tel  +1 213 447 4253
>> Blog  http://charith.wickramaarachchi.org/
>> <http://charithwiki.blogspot.com/>
>> Twitter  @charithwiki <https://twitter.com/charithwiki>
>>
>> This communication may contain privileged or other confidential information
>> and is intended exclusively for the addressee/s. If you are not the
>> intended recipient/s, or believe that you may have
>> received this communication in error, please reply to the sender indicating
>> that fact and delete the copy you received and in addition, you should
>> not print, copy, retransmit, disseminate, or otherwise use the
>> information contained in this communication. Internet communications
>> cannot be guaranteed to be timely, secure, error or virus-free. The
>> sender does not accept liability for any errors or omissions
>>
>
>
>
> --
> Charith Dhanushka Wickramaarachchi
>
> Tel  +1 213 447 4253
> Blog  http://charith.wickramaarachchi.org/
> <http://charithwiki.blogspot.com/>
> Twitter  @charithwiki <https://twitter.com/charithwiki>
>
> This communication may contain privileged or other confidential information
> and is intended exclusively for the addressee/s. If you are not the
> intended recipient/s, or believe that you may have
> received this communication in error, please reply to the sender indicating
> that fact and delete the copy you received and in addition, you should
> not print, copy, retransmit, disseminate, or otherwise use the
> information contained in this communication. Internet communications
> cannot be guaranteed to be timely, secure, error or virus-free. The
> sender does not accept liability for any errors or omissions
>

Re: Connecting to remote task manager failed

Posted by Charith Wickramarachchi <ch...@gmail.com>.
I did some more digging. It seems the CoGroup operation failed in one of
the workers. But I do not face this issue when running other tasks.

Thanks,
Charith

On Tue, Jul 25, 2017 at 2:06 PM, Charith Wickramarachchi <
charith.dhanushka@gmail.com> wrote:

> Hi All,
>
> I m getting an exception when running a Gelly task using Pregel model. It
> complains that the remote task manager might be lost. But task managers
> seem to be active based on the flink dashboard.  Also, other tasks run fine
> without an issue.
>
>
> Following is the summary of exception trace.  I have attached the full
> trace as well. It will be great if you can provide any directions to
> identify the issue.
>
>
> Flink version: flink-1.1.3
> Java: 1.7
>
> Caused by: java.io.IOException: Thread 'SortMerger Reading Thread'
> terminated due to an exception: Connecting the channel failed: Connecting
> to remote task manager + 'worker/127.0.1.1:44310' has failed. This might
> indicate that the remote task manager has been lost.
> at org.apache.flink.runtime.operators.sort.UnilateralSortMerger$
> ThreadBase.run(UnilateralSortMerger.java:800)
> Caused by: java.io.IOException: Connecting the channel failed: Connecting
> to remote task manager + 'worker/127.0.1.1:44310' has failed. This might
> indicate that the remote task manager has been lost.
> at org.apache.flink.runtime.io.network.netty.
> PartitionRequestClientFactory$ConnectingChannel.waitForChannel(
> PartitionRequestClientFactory.java:196)
> at org.apache.flink.runtime.io.network.netty.
> PartitionRequestClientFactory$ConnectingChannel.access$000(
> PartitionRequestClientFactory.java:131)
> at org.apache.flink.runtime.io.network.netty.
> PartitionRequestClientFactory.createPartitionRequestClient(
> PartitionRequestClientFactory.java:83)
> at org.apache.flink.runtime.io.network.netty.NettyConnectionManager.
> createPartitionRequestClient(NettyConnectionManager.java:60)
> at org.apache.flink.runtime.io.network.partition.consumer.
> RemoteInputChannel.requestSubpartition(RemoteInputChannel.java:118)
> at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.
> requestPartitions(SingleInputGate.java:394)
> at org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.
> getNextBufferOrEvent(SingleInputGate.java:413)
> at org.apache.flink.runtime.io.network.api.reader.AbstractRecordReader.
> getNextRecord(AbstractRecordReader.java:87)
> at org.apache.flink.runtime.io.network.api.reader.
> MutableRecordReader.next(MutableRecordReader.java:42)
> at org.apache.flink.runtime.operators.util.ReaderIterator.
> next(ReaderIterator.java:59)
> at org.apache.flink.runtime.operators.sort.UnilateralSortMerger$
> ReadingThread.go(UnilateralSortMerger.java:973)
> at org.apache.flink.runtime.operators.sort.UnilateralSortMerger$
> ThreadBase.run(UnilateralSortMerger.java:796)
> Caused by: org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException:
> Connecting to remote task manager + 'worker/127.0.1.1:44310' has failed.
> This might indicate that the remote task manager has been lost.
> at org.apache.flink.runtime.io.network.netty.
> PartitionRequestClientFactory$ConnectingChannel.operationComplete(
> PartitionRequestClientFactory.java:219)
> at org.apache.flink.runtime.io.network.netty.
> PartitionRequestClientFactory$ConnectingChannel.operationComplete(
> PartitionRequestClientFactory.java:131)
>
> Thanks,
> Charith
>
>
> --
> Charith Dhanushka Wickramaarachchi
>
> Tel  +1 213 447 4253
> Blog  http://charith.wickramaarachchi.org/
> <http://charithwiki.blogspot.com/>
> Twitter  @charithwiki <https://twitter.com/charithwiki>
>
> This communication may contain privileged or other confidential information
> and is intended exclusively for the addressee/s. If you are not the
> intended recipient/s, or believe that you may have
> received this communication in error, please reply to the sender indicating
> that fact and delete the copy you received and in addition, you should
> not print, copy, retransmit, disseminate, or otherwise use the
> information contained in this communication. Internet communications
> cannot be guaranteed to be timely, secure, error or virus-free. The
> sender does not accept liability for any errors or omissions
>



-- 
Charith Dhanushka Wickramaarachchi

Tel  +1 213 447 4253
Blog  http://charith.wickramaarachchi.org/
<http://charithwiki.blogspot.com/>
Twitter  @charithwiki <https://twitter.com/charithwiki>

This communication may contain privileged or other confidential information
and is intended exclusively for the addressee/s. If you are not the
intended recipient/s, or believe that you may have
received this communication in error, please reply to the sender indicating
that fact and delete the copy you received and in addition, you should not
print, copy, retransmit, disseminate, or otherwise use the information
contained in this communication. Internet communications cannot be
guaranteed to be timely, secure, error or virus-free. The sender does not
accept liability for any errors or omissions