You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Piotr Nowojski (Jira)" <ji...@apache.org> on 2019/12/13 17:42:00 UTC

[jira] [Comment Edited] (FLINK-15074) Connection timed out, Standalone cluster

    [ https://issues.apache.org/jira/browse/FLINK-15074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16995784#comment-16995784 ] 

Piotr Nowojski edited comment on FLINK-15074 at 12/13/19 5:41 PM:
------------------------------------------------------------------

Thanks [~gameking] for the bug report and sorry for the delay in responding!

I think it's unlikely this being a bug in Flink itself. Also this one taskmanager.log is unfortunately not very helpful, as connection timeout is only a symptom of an issue somewhere else in the cluster. You would have to take a look at the task manager logs, stdout/stderr and system logs on other machines to find the underlying issue. For some reason one of the other task managers either stopped responding at all, or stopped responding for some period of time. There are couple of common issues that I would suggest to rule out first (more or less in that order):
 # check if another machine hasn't crashed/rebooted
 # check if another Task Manager process hasn't crashed. This might includes some exceptions that forced Task Manager to shut down, including JVM fatal errors (OOM), potential segfaults (JVM bugs, native libraries like RocksDB issues)
 # check if another Task Manager wasn't killed (system OOM killer, ...)
 # check for GC pauses (make sure you are using G1GC) - stop the world GC pause can easily create connection time outs for another machines. Connect with some Java monitoring/profiling tool to check GC times, or print GC logs - this is my personal bet, as increase parallelism could easily increase GC pressure
 # check if JVM is not blocked for another reasons, like machine swapping, or heavy disk IO usage (long blocking IO can block arbitrarily java threads for example due to logging). This might be difficult to diagnose, but you can start from making sure that you log something every X seconds and look for suspicious gaps in the task manager logs.
 # make sure that your network is stable and not overloaded (run {{ping}} command in parallel and dump the output to a separate file)


was (Author: pnowojski):
Thanks for the bug report and sorry for the delay in responding!

I think it's unlikely this being a bug in Flink itself. Also this one taskmanager.log is unfortunately not very helpful, as connection timeout is only a symptom of an issue somewhere else in the cluster. You would have to take a look at the task manager logs, stdout/stderr and system logs on other machines to find the underlying issue. For some reason one of the other task managers either stopped responding at all, or stopped responding for some period of time. There are couple of common issues that I would suggest to rule out first (more or less in that order):
 # check if another machine hasn't crashed/rebooted
 # check if another Task Manager process hasn't crashed. This might includes some exceptions that forced Task Manager to shut down, including JVM fatal errors (OOM), potential segfaults (JVM bugs, native libraries like RocksDB issues)
 # check if another Task Manager wasn't killed (system OOM killer, ...)
 # check for GC pauses (make sure you are using G1GC) - stop the world GC pause can easily create connection time outs for another machines. Connect with some Java monitoring/profiling tool to check GC times, or print GC logs - this is my personal bet, as increase parallelism could easily increase GC pressure
 # check if JVM is not blocked for another reasons, like machine swapping, or heavy disk IO usage (long blocking IO can block arbitrarily java threads for example due to logging). This might be difficult to diagnose, but you can start from making sure that you log something every X seconds and look for suspicious gaps in the task manager logs.
 # make sure that your network is stable and not overloaded (run {{ping}} command in parallel and dump the output to a separate file)

> Connection timed out, Standalone cluster
> ----------------------------------------
>
>                 Key: FLINK-15074
>                 URL: https://issues.apache.org/jira/browse/FLINK-15074
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Network
>    Affects Versions: 1.9.1
>         Environment: flink version : 1.5.1 , 1.9.1
> jdk version : 1.8.0_181
> Number of servers : 15
> Number of taskmanagers : 178
> Number of slots: 178
>            Reporter: gameking
>            Priority: Major
>         Attachments: flink-conf.yaml, jobmanager.log, taskmanager.log
>
>
> I am running a flink streaming application on  a standalone-cluster.
> It works well when the job's parallelism is low, just like 96.
> But when I try to increase job's parallelism  to a high value, like 164 or more,  Job will fail in 10-15 minutes due to connection timeout error
> I have try to solve this problem by increaseing taskmanager configs just like 'taskmanager.network.netty.server.numThreads', 'taskmanager.network.netty.client.numThreads', 'taskmanager.network.request-backoff.max', 'akka.ask.timeout' and so on, It doesn't work.
> I also try to change different versions of flink, such as 1.5.1 and 1.9.1, to solve this problem , it doesn't help too. 
> Does anyone know how to fix this problem,I have no idea now. It looks like a bug.
> I hava upload my config and log as attachment, and the error trace below :
>  
> ------------------------------------------------------------------
> org.apache.flink.runtime.io.network.netty.exception.LocalTransportException: Connection timed out
>  at org.apache.flink.runtime.io.network.netty.CreditBasedPartitionRequestClientHandler.exceptionCaught(CreditBasedPartitionRequestClientHandler.java:172) ~[flink-dist_2.11-1.5.1.jar:1.5.1]
>  at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275) ~[flink-dist_2.11-1.5.1.jar:1.5.1]
>  at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253) ~[flink-dist_2.11-1.5.1.jar:1.5.1]
>  at org.apache.flink.shaded.netty4.io.netty.channel.ChannelInboundHandlerAdapter.exceptionCaught(ChannelInboundHandlerAdapter.java:131) ~[flink-dist_2.11-1.5.1.jar:1.5.1]
>  at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275) ~[flink-dist_2.11-1.5.1.jar:1.5.1]
>  at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253) ~[flink-dist_2.11-1.5.1.jar:1.5.1]
>  at org.apache.flink.shaded.netty4.io.netty.channel.ChannelHandlerAdapter.exceptionCaught(ChannelHandlerAdapter.java:79) ~[flink-dist_2.11-1.5.1.jar:1.5.1]
>  at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.invokeExceptionCaught(AbstractChannelHandlerContext.java:275) ~[flink-dist_2.11-1.5.1.jar:1.5.1]
>  at org.apache.flink.shaded.netty4.io.netty.channel.AbstractChannelHandlerContext.fireExceptionCaught(AbstractChannelHandlerContext.java:253) ~[flink-dist_2.11-1.5.1.jar:1.5.1]
>  at org.apache.flink.shaded.netty4.io.netty.channel.DefaultChannelPipeline.fireExceptionCaught(DefaultChannelPipeline.java:835) ~[flink-dist_2.11-1.5.1.jar:1.5.1]
>  at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.handleReadException(AbstractNioByteChannel.java:87) ~[flink-dist_2.11-1.5.1.jar:1.5.1]
>  at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:162) ~[flink-dist_2.11-1.5.1.jar:1.5.1]
>  at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511) ~[flink-dist_2.11-1.5.1.jar:1.5.1]
>  at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468) ~[flink-dist_2.11-1.5.1.jar:1.5.1]
>  at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382) ~[flink-dist_2.11-1.5.1.jar:1.5.1]
>  at org.apache.flink.shaded.netty4.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354) ~[flink-dist_2.11-1.5.1.jar:1.5.1]
>  at org.apache.flink.shaded.netty4.io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111) ~[flink-dist_2.11-1.5.1.jar:1.5.1]
>  at java.lang.Thread.run(Thread.java:748) [na:1.8.0_181]
> Caused by: java.io.IOException: Connection timed out
>  at sun.nio.ch.FileDispatcherImpl.read0(Native Method) ~[na:1.8.0_181]
>  at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39) ~[na:1.8.0_181]
>  at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223) ~[na:1.8.0_181]
>  at sun.nio.ch.IOUtil.read(IOUtil.java:192) ~[na:1.8.0_181]
>  at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:380) ~[na:1.8.0_181]
>  at org.apache.flink.shaded.netty4.io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311) ~[flink-dist_2.11-1.5.1.jar:1.5.1]
>  at org.apache.flink.shaded.netty4.io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881) ~[flink-dist_2.11-1.5.1.jar:1.5.1]
>  at org.apache.flink.shaded.netty4.io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:241) ~[flink-dist_2.11-1.5.1.jar:1.5.1]
>  at org.apache.flink.shaded.netty4.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119) ~[flink-dist_2.11-1.5.1.jar:1.5.1]
>  ... 6 common frames omitted



--
This message was sent by Atlassian Jira
(v8.3.4#803005)