You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/05/23 02:12:05 UTC

[GitHub] [spark] sjrand edited a comment on issue #24645: [SPARK-27773][Shuffle] add metrics for number of exceptions caught in ExternalShuffleBlockHandler

sjrand edited a comment on issue #24645: [SPARK-27773][Shuffle] add metrics for number of exceptions caught in ExternalShuffleBlockHandler
URL: https://github.com/apache/spark/pull/24645#issuecomment-495040379
 
 
   On the client (executor) side we were seeing lots of timeouts, e.g.:
   
   ```
   ERROR [2019-05-16T18:34:57.782Z] org.apache.spark.storage.BlockManager: Failed to connect to external shuffle server, will retry 2 more times after waiting 5 seconds... 
   java.io.IOException: Failed to connect to <node_manager_hostname>/<ip>:7337
   	at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:250)
   	at org.apache.spark.network.client.TransportClientFactory.createUnmanagedClient(TransportClientFactory.java:206)
   	at org.apache.spark.network.shuffle.ExternalShuffleClient.registerWithShuffleServer(ExternalShuffleClient.java:142)
   	at org.apache.spark.storage.BlockManager$$anonfun$registerWithExternalShuffleServer$1.apply$mcVI$sp(BlockManager.scala:300)
   	at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:160)
   	at org.apache.spark.storage.BlockManager.registerWithExternalShuffleServer(BlockManager.scala:297)
   	at org.apache.spark.storage.BlockManager.initialize(BlockManager.scala:271)
   	at org.apache.spark.executor.Executor.<init>(Executor.scala:121)
   	at org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$receive$1.applyOrElse(CoarseGrainedExecutorBackend.scala:92)
   	at org.apache.spark.rpc.netty.Inbox$$anonfun$process$1.apply$mcV$sp(Inbox.scala:117)
   	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:205)
   	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:101)
   	at org.apache.spark.rpc.netty.Dispatcher$MessageLoop.run(Dispatcher.scala:222)
   	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
   	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
   	at java.lang.Thread.run(Thread.java:748)
   Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection timed out: <node_manager_hostname>/<ip>:7337
   	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
   	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
   	at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327)
   	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
   	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:632)
   	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:579)
   	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:496)
   	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458)
   	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:897)
   	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
   	at java.lang.Thread.run(Thread.java:748)
   Caused by: java.net.ConnectException: Connection timed out
   	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
   	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
   	at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:327)
   	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
   	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:632)
   	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:579)
   	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:496)
   	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458)
   	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:897)
   	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
   	at java.lang.Thread.run(Thread.java:748)
   ```
   
   And in the NodeManager logs we were seeing lots of `ClosedChannelException` errors from netty, along with the occasional `java.io.IOException: Broken pipe` error. For example:
   
   ```
   2019-05-16 05:13:17,999 ERROR org.apache.spark.network.server.TransportRequestHandler: Error sending result ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=1647907385644, chunkIndex=22}, buffer=FileSegmentManagedBuffer{file=/scratch/hadoop/tmp/nm-local-dir/usercache/<user_name>/appcache/application_1557300039674_635976/blockmgr-0ec1d292-3e75-40bd-afd3-79314f427338/11/shuffle_5_3900_0.data, offset=12387017, length=1235}} to /<ip_addr>:35922; closing connection
   java.nio.channels.ClosedChannelException
   ```
   
   We confirmed that the `shuffle-server` threads were still alive in the NM and took thread dumps, but we weren't able to determine what the issue was. In the end we just restarted the NodeManagers and this fixed the problem.
   
   I didn't create a JIRA for this just because I don't think the information I have so far is enough to be actionable.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org