You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by M Singh <ma...@yahoo.com> on 2020/01/29 07:36:45 UTC

Apache Flink Job fails repeatedly due to RemoteTransportException

Hi Folks:
We have streaming Flink application (using v 1.6.2) and it dies within 12 hours.  We have configured number of restarts which is 10 at the moment.
Sometimes the job runs for some time and then within a very short time has a number of restarts and finally fails.  In other instances, the restarts happen randomly. So there is no pattern that I could discern for the restarts.
I can increase the restart count but would like to see if there is any advice on the root cause of this issue.  I've seen a some emails in the user groups but could not find any definitive solution or investigation steps.

Is there any any on how to investigate it further or resolve it ?
The exception we see in the job manager is:
2020-01-29 06:15:42,371 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job testJob (d65a52389f9ea30def1fe522bf3956c6) switched from state FAILING to FAILED.
org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connection unexpectedly closed by remote task manager 'ip-xx-xxx-xxx-xxx.ec2.internal/xx.xxx.xxx.xxx:39623'. This might indicate that the remote task manager was lost.
	at org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.channelInactive(CreditBasedPartitionRequestClientHandler.java:136)
	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
	at org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:377)
	at org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:342)
	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
	at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1429)
	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
	at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:947)
	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:822)
	at org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
	at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)
	at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
	at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
	at java.lang.Thread.run(Thread.java:748)
2020-01-29 06:15:42,371 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Could not restart the job testJob (d65a52389f9ea30def1fe522bf3956c6) because the restart strategy prevented it.

Re: Apache Flink Job fails repeatedly due to RemoteTransportException

Posted by M Singh <ma...@yahoo.com>.

 Thanks will try your recommendations and apologize for the delayed response.
    On Wednesday, January 29, 2020, 09:58:26 AM EST, Till Rohrmann <tr...@apache.org> wrote:  
 
 Hi M Singh,
have you checked the TaskManager logs of ip-xx-xxx-xxx-xxx.ec2.internal/xx.xxx.xxx.xxx:39623 for any suspicious logging statements? This might help to uncover why another node thinks that this TaskManager is no longer reachable.
You could also try whether the same problem remains if you upgrade to one of Flink latest versions (1.9.1 for example).
Cheers,Till
On Wed, Jan 29, 2020 at 8:37 AM M Singh <ma...@yahoo.com> wrote:

Hi Folks:
We have streaming Flink application (using v 1.6.2) and it dies within 12 hours.  We have configured number of restarts which is 10 at the moment.
Sometimes the job runs for some time and then within a very short time has a number of restarts and finally fails.  In other instances, the restarts happen randomly. So there is no pattern that I could discern for the restarts.
I can increase the restart count but would like to see if there is any advice on the root cause of this issue.  I've seen a some emails in the user groups but could not find any definitive solution or investigation steps.

Is there any any on how to investigate it further or resolve it ?
The exception we see in the job manager is:
2020-01-29 06:15:42,371 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job testJob (d65a52389f9ea30def1fe522bf3956c6) switched from state FAILING to FAILED.
org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connection unexpectedly closed by remote task manager 'ip-xx-xxx-xxx-xxx.ec2.internal/xx.xxx.xxx.xxx:39623'. This might indicate that the remote task manager was lost.
	at org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.channelInactive(CreditBasedPartitionRequestClientHandler.java:136)
	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
	at org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:377)
	at org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:342)
	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
	at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1429)
	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
	at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:947)
	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:822)
	at org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
	at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)
	at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
	at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
	at java.lang.Thread.run(Thread.java:748)
2020-01-29 06:15:42,371 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Could not restart the job testJob (d65a52389f9ea30def1fe522bf3956c6) because the restart strategy prevented it.

Re: Apache Flink Job fails repeatedly due to RemoteTransportException

Posted by Till Rohrmann <tr...@apache.org>.

Hi M Singh,

have you checked the TaskManager logs
of ip-xx-xxx-xxx-xxx.ec2.internal/xx.xxx.xxx.xxx:39623 for any suspicious
logging statements? This might help to uncover why another node thinks that
this TaskManager is no longer reachable.

You could also try whether the same problem remains if you upgrade to one
of Flink latest versions (1.9.1 for example).

Cheers,
Till

On Wed, Jan 29, 2020 at 8:37 AM M Singh <ma...@yahoo.com> wrote:

> Hi Folks:
>
> We have streaming Flink application (using v 1.6.2) and it dies within 12
> hours.  We have configured number of restarts which is 10 at the moment.
>
> Sometimes the job runs for some time and then within a very short time has
> a number of restarts and finally fails.  In other instances, the restarts
> happen randomly. So there is no pattern that I could discern for the
> restarts.
>
> I can increase the restart count but would like to see if there is any
> advice on the root cause of this issue.  I've seen a some emails in the
> user groups but could not find any definitive solution or investigation
> steps.
>
> Is there any any on how to investigate it further or resolve it ?
>
> The exception we see in the job manager is:
>
> 2020-01-29 06:15:42,371 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Job testJob (d65a52389f9ea30def1fe522bf3956c6) switched from state FAILING to FAILED.
> org.apache.flink.runtime.io.network.netty.exception.RemoteTransportException: Connection unexpectedly closed by remote task manager 'ip-xx-xxx-xxx-xxx.ec2.internal/xx.xxx.xxx.xxx:39623'. This might indicate that the remote task manager was lost.
> 	at org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.channelInactive(CreditBasedPartitionRequestClientHandler.java:136)
> 	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
> 	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
> 	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
> 	at org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.channelInputClosed(ByteToMessageDecoder.java:377)
> 	at org.apache.flink.shaded.netty4.io.netty.handler.codec.ByteToMessageDecoder.channelInactive(ByteToMessageDecoder.java:342)
> 	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
> 	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
> 	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireChannelInactive(AbstractChannelHandlerContext.java:224)
> 	at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline$HeadContext.channelInactive(DefaultChannelPipeline.java:1429)
> 	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:245)
> 	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeChannelInactive(AbstractChannelHandlerContext.java:231)
> 	at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireChannelInactive(DefaultChannelPipeline.java:947)
> 	at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannel$AbstractUnsafe$8.run(AbstractChannel.java:822)
> 	at org.apache.flink.shaded.netty4.io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:163)
> 	at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:404)
> 	at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:463)
> 	at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:884)
> 	at java.lang.Thread.run(Thread.java:748)
> 2020-01-29 06:15:42,371 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Could not restart the job testJob (d65a52389f9ea30def1fe522bf3956c6) because the restart strategy prevented it.
>
>
>