You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by "yinhua.dai" <yi...@outlook.com> on 2019/03/28 11:23:22 UTC

RemoteTransportException: Connection unexpectedly closed by remote task manager

Hi,

I write a single flink job with flink SQL with version 1.6.1
I have one table source which read data from a database, and one table sink
to output as avro format file.
The table source has parallelism of 19, and table sink only has parallelism
of 1.

But there is always RemoteTransportException when the job is nearly done(All
data source has been finished, and the data sink has been running for a
while).
The detail error as below:

2019-03-28 07:53:49,086 ERROR
org.apache.flink.runtime.operators.DataSinkTask               - Error in
user code: Connection unexpectedly closed by remote task manager
'ip-10-97-34-40.tr-fr-nonprod.aws-int.thomsonreuters.com/10.97.34.40:46625'.
This might indicate that the remote task manager was lost.:  DataSink
(com.tr.apt.sqlengine.tables.s3.AvroFileTableSink$AvroOutputFormat@42d174ad)
(1/1)
org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException:
Connection unexpectedly closed by remote task manager
'ip-10-97-34-40.tr-fr-nonprod.aws-int.thomsonreuters.com/10.97.34.40:46625'.
This might indicate that the remote task manager was lost.
        at
org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.channelInactive(CreditBasedPartitionRequestClientHandler.java:143)
        at
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
        at
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
        at
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
        at
org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:377)
        at
org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:342)
        at
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
        at
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
        at
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
        at
org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1429)
        at
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
        at
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
        at
org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:947)
        at
org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:822)
        at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
        at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)
        at
org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
        at
org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
        at java.lang.Thread.run(Thread.java:748)
2019-03-28 07:53:49,440 INFO 
com.tr.apt.sqlengine.tables.s3.AbstractFileOutputFormat       -
FileTableSink sinked all data to : file:///tmp/shareamount.avro
2019-03-28 07:53:49,441 INFO  org.apache.flink.runtime.taskmanager.Task                    
- DataSink
(com.tr.apt.sqlengine.tables.s3.AvroFileTableSink$AvroOutputFormat@42d174ad)
(1/1) (31fd3e6fdbb1576e7288e202fff69b07) switched from RUNNING to FAILED.


Is the error means that the data sink failed to read all of data from some
data source instance before the source end itself?

When I check the log of task manager (10.97.34.40:46625), it's all ok, it
shows it successfully finished its job and receive SIGNAL 15 and then
terminate itself.
So how should I find out the root cause of the error?




--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: RemoteTransportException: Connection unexpectedly closed by remote task manager

Posted by "yinhua.dai" <yi...@outlook.com>.
I have put the task manager of the data sink log to
https://gist.github.com/yinhua2018/7de42ff9c1738d5fdf9d99030db903e2



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: RemoteTransportException: Connection unexpectedly closed by remote task manager

Posted by "yinhua.dai" <yi...@outlook.com>.
Hi Qi,

I checked the JVM heap of the sink TM is low.

I tried to read flink source code to identify where is exact the error
happen.
I think the exception happened inside DataSinkTask.invoke()

                                // work!
				while (!this.taskCanceled && ((record = input.next()) != null)) {
					numRecordsIn.inc();
					format.writeRecord(record);
				}

RemoteTransportException should be thrown from "input.next()" when InputGate
tried to read data from the upstream.
Is this really a problem for this sink TM?
I'm a little bit confused.



--
Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: RemoteTransportException: Connection unexpectedly closed by remote task manager

Posted by qi luo <lu...@gmail.com>.
Hi Yinhua,

This looks like the TM executing the sink is down, maybe due to OOM or some other error. You can check the JVM heap and GC log to see if there’re any clues.

Regards,
Qi

> On Mar 28, 2019, at 7:23 PM, yinhua.dai <yi...@outlook.com> wrote:
> 
> Hi,
> 
> I write a single flink job with flink SQL with version 1.6.1
> I have one table source which read data from a database, and one table sink
> to output as avro format file.
> The table source has parallelism of 19, and table sink only has parallelism
> of 1.
> 
> But there is always RemoteTransportException when the job is nearly done(All
> data source has been finished, and the data sink has been running for a
> while).
> The detail error as below:
> 
> 2019-03-28 07:53:49,086 ERROR
> org.apache.flink.runtime.operators.DataSinkTask               - Error in
> user code: Connection unexpectedly closed by remote task manager
> 'ip-10-97-34-40.tr-fr-nonprod.aws-int.thomsonreuters.com/10.97.34.40:46625'.
> This might indicate that the remote task manager was lost.:  DataSink
> (com.tr.apt.sqlengine.tables.s3.AvroFileTableSink$AvroOutputFormat@42d174ad)
> (1/1)
> org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException:
> Connection unexpectedly closed by remote task manager
> 'ip-10-97-34-40.tr-fr-nonprod.aws-int.thomsonreuters.com/10.97.34.40:46625'.
> This might indicate that the remote task manager was lost.
>        at
> org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.channelInactive(CreditBasedPartitionRequestClientHandler.java:143)
>        at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
>        at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
>        at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
>        at
> org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:377)
>        at
> org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:342)
>        at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
>        at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
>        at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
>        at
> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1429)
>        at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
>        at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
>        at
> org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:947)
>        at
> org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:822)
>        at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
>        at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)
>        at
> org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
>        at
> org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
>        at java.lang.Thread.run(Thread.java:748)
> 2019-03-28 07:53:49,440 INFO 
> com.tr.apt.sqlengine.tables.s3.AbstractFileOutputFormat       -
> FileTableSink sinked all data to : file:///tmp/shareamount.avro
> 2019-03-28 07:53:49,441 INFO  org.apache.flink.runtime.taskmanager.Task                    
> - DataSink
> (com.tr.apt.sqlengine.tables.s3.AvroFileTableSink$AvroOutputFormat@42d174ad)
> (1/1) (31fd3e6fdbb1576e7288e202fff69b07) switched from RUNNING to FAILED.
> 
> 
> Is the error means that the data sink failed to read all of data from some
> data source instance before the source end itself?
> 
> When I check the log of task manager (10.97.34.40:46625), it's all ok, it
> shows it successfully finished its job and receive SIGNAL 15 and then
> terminate itself.
> So how should I find out the root cause of the error?
> 
> 
> 
> 
> --
> Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/