You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Stephen Haberman (JIRA)" <ji...@apache.org> on 2015/01/06 04:39:34 UTC

[jira] [Closed] (SPARK-5085) netty shuffle service causing connection timeouts

     [ https://issues.apache.org/jira/browse/SPARK-5085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stephen Haberman closed SPARK-5085.
-----------------------------------
    Resolution: Invalid

> netty shuffle service causing connection timeouts
> -------------------------------------------------
>
>                 Key: SPARK-5085
>                 URL: https://issues.apache.org/jira/browse/SPARK-5085
>             Project: Spark
>          Issue Type: Bug
>          Components: Shuffle
>    Affects Versions: 1.2.0
>         Environment: EMR, transient cluster of 10 m3.2xlarges, spark 1.2.0
> Here's our spark-defaults:
> {code}
> spark.master spark://$MASTER_IP:7077
> spark.eventLog.enabled true
> spark.eventLog.dir /mnt/spark/work/history
> spark.serializer org.apache.spark.serializer.KryoSerializer
> spark.executor.memory ${EXECUTOR_MEM}m
> spark.core.connection.ack.wait.timeout 600
> spark.storage.blockManagerSlaveTimeoutMs 60000
> spark.shuffle.consolidateFiles true
> spark.shuffle.service.enabled false
> spark.shuffle.blockTransferService nio # works with nio, fails with netty
> # Use snappy because LZF uses ~100-300k buffer per block
> spark.io.compression.codec org.apache.spark.io.SnappyCompressionCodec
> spark.shuffle.file.buffer.kb 10
> spark.executor.extraJavaOptions -XX:+PrintGCDetails -XX:+HeapDumpOnOutOfMemoryError -Xss2m -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRati...
> spark.akka.logLifecycleEvents true
> spark.akka.timeout 360
> spark.akka.askTimeout 120
> spark.akka.lookupTimeout 120
> spark.akka.frameSize 100
> spark.files.userClassPathFirst true
> spark.shuffle.memoryFraction 0.5
> spark.storage.memoryFraction 0.2
> {code}
>            Reporter: Stephen Haberman
>
> In Spark 1.2.0, the netty backend is causing our report's cluster to lock up with connection timeouts, ~75% of the way through the job.
> It happens with both the external shuffle server and the non-external/executor-hosted shuffle server, but if I change the shuffle service from netty to nio, it immediately works.
> Here's log output from one executor (I turned on trace output for the network package and ShuffleBlockFetcherIterator; all executors in the cluster have basically the same pattern of ~15m of silence then timeouts):
> {code}
> // lots of log output, doing fine...
> 15/01/03 05:33:39 TRACE [shuffle-server-0] protocol.MessageDecoder (MessageDecoder.java:decode(42)) - Received message ChunkFetchRequest: ChunkFetchRequest{streamChunkId=StreamChunkId{streamId=1465867812750, chunkIndex=170}}
> 15/01/03 05:33:39 TRACE [shuffle-server-0] server.TransportRequestHandler (TransportRequestHandler.java:processFetchRequest(107)) - Received req from /10.169.175.179:57056 to fetch block StreamChunkId{streamId=1465867812750, chunkIndex=170}
> 15/01/03 05:33:39 TRACE [shuffle-server-0] server.OneForOneStreamManager (OneForOneStreamManager.java:getChunk(75)) - Removing stream id 1465867812750
> 15/01/03 05:33:39 TRACE [shuffle-server-0] server.TransportRequestHandler (TransportRequestHandler.java:operationComplete(152)) - Sent result ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=1465867812750, chunkIndex=170}, buffer=FileSegmentManagedBuffer{file=/mnt1/spark/local/spark-local-20150103040327-c554/28/shuffle_4_1723_0.data, offset=4574685, length=20939}} to client /10.169.175.179:57056
> // note 15m of silence here...
> 15/01/03 05:48:13 WARN  [shuffle-server-7] server.TransportChannelHandler (TransportChannelHandler.java:exceptionCaught(66)) - Exception in connection from /10.33.166.218:42780
> java.io.IOException: Connection timed out
> 	at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> 	at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> 	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
> 	at sun.nio.ch.IOUtil.read(IOUtil.java:192)
> 	at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
> 	at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311)
> 	at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
> 	at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:225)
> 	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
> 	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
> 	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
> 	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
> 	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
> 	at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
> 	at java.lang.Thread.run(Thread.java:745)
> 15/01/03 05:48:13 ERROR [shuffle-server-7] server.TransportRequestHandler (TransportRequestHandler.java:operationComplete(154)) - Error sending result ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=1465867812408, chunkIndex=52}, buffer=FileSegmentManagedBuffer{file=/mnt1/spark/local/spark-local-20150103040327-c554/2d/shuffle_4_520_0.data, offset=2214139, length=20607}} to /10.33.166.218:42780; closing connection
> java.nio.channels.ClosedChannelException
> 15/01/03 05:48:13 ERROR [shuffle-server-7] server.TransportRequestHandler (TransportRequestHandler.java:operationComplete(154)) - Error sending result ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=1465867812408, chunkIndex=53}, buffer=FileSegmentManagedBuffer{file=/mnt1/spark/local/spark-local-20150103040327-c554/10/shuffle_4_524_0.data, offset=2215548, length=23998}} to /10.33.166.218:42780; closing connection
> java.nio.channels.ClosedChannelException
> 15/01/03 05:48:13 ERROR [shuffle-server-7] server.TransportRequestHandler (TransportRequestHandler.java:operationComplete(154)) - Error sending result ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=1465867812408, chunkIndex=54}, buffer=FileSegmentManagedBuffer{file=/mnt/spark/local/spark-local-20150103040327-4f92/32/shuffle_4_532_0.data, offset=2248230, length=20580}} to /10.33.166.218:42780; closing connection
> java.nio.channels.ClosedChannelException
> // lots more of these...
> {code}
> Note how, up through 5:33, everything was fine, then after ~15 minutes of silence, at 5:48, the shuffle-server connection times out, and all of that server-7's requests fail.
> Here is shuffle-server-1 from the same stdout (with 1 last ClosedChannelException from shuffle-server-7):
> {code}
> 15/01/03 05:48:13 ERROR [shuffle-server-7] server.TransportRequestHandler (TransportRequestHandler.java:operationComplete(154)) - Error sending result RpcResponse{requestId=6670015694248454083, response=[B@6653587b} to /10.33.166.218:42780; closing connection
> java.nio.channels.ClosedChannelException
> 15/01/03 05:48:45 WARN  [shuffle-server-1] server.TransportChannelHandler (TransportChannelHandler.java:exceptionCaught(66)) - Exception in connection from /10.13.165.77:41845
> java.io.IOException: Connection timed out
> 	at sun.nio.ch.FileDispatcherImpl.read0(Native Method)
> 	at sun.nio.ch.SocketDispatcher.read(SocketDispatcher.java:39)
> 	at sun.nio.ch.IOUtil.readIntoNativeBuffer(IOUtil.java:223)
> 	at sun.nio.ch.IOUtil.read(IOUtil.java:192)
> 	at sun.nio.ch.SocketChannelImpl.read(SocketChannelImpl.java:379)
> 	at io.netty.buffer.PooledUnsafeDirectByteBuf.setBytes(PooledUnsafeDirectByteBuf.java:311)
> 	at io.netty.buffer.AbstractByteBuf.writeBytes(AbstractByteBuf.java:881)
> 	at io.netty.channel.socket.nio.NioSocketChannel.doReadBytes(NioSocketChannel.java:225)
> 	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:119)
> 	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:511)
> 	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
> 	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
> 	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
> 	at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
> 	at java.lang.Thread.run(Thread.java:745)
> 15/01/03 05:48:45 ERROR [shuffle-server-1] server.TransportRequestHandler (TransportRequestHandler.java:operationComplete(154)) - Error sending result ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=1465867812684, chunkIndex=50}, buffer=FileSegmentManagedBuffer{file=/mnt1/spark/local/spark-local-20150103040327-c554/20/shuffle_4_513_0.data, offset=4029264, length=19717}} to /10.13.165.77:41845; closing connection
> java.nio.channels.ClosedChannelException
> 15/01/03 05:48:45 ERROR [shuffle-server-1] server.TransportRequestHandler (TransportRequestHandler.java:operationComplete(154)) - Error sending result ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=1465867812684, chunkIndex=51}, buffer=FileSegmentManagedBuffer{file=/mnt/spark/local/spark-local-20150103040327-4f92/30/shuffle_4_514_0.data, offset=3976999, length=18459}} to /10.13.165.77:41845; closing connection
> java.nio.channels.ClosedChannelException
> 15/01/03 05:48:45 ERROR [shuffle-server-1] server.TransportRequestHandler (TransportRequestHandler.java:operationComplete(154)) - Error sending result ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=1465867812684, chunkIndex=52}, buffer=FileSegmentManagedBuffer{file=/mnt1/spark/local/spark-local-20150103040327-c554/2d/shuffle_4_520_0.data, offset=4177037, length=18689}} to /10.13.165.77:41845; closing connection
> java.nio.channels.ClosedChannelException
> {code}
> The last line from shuffle-server-7 from before it died was earlier, at 5:33:06:
> {code}
> 15/01/03 05:33:06 TRACE [shuffle-server-7] netty.NettyBlockRpcServer (Logging.scala:logTrace(67)) - Registered streamId 1465867812639 with 171 buffers
> {code}
> And here's the last few lines-ish from shuffle-server-1 before the timeout:
> {code}
> 15/01/03 05:33:25 TRACE [shuffle-server-1] netty.NettyBlockRpcServer (Logging.scala:logTrace(67)) - Registered streamId 1465867812719 with 230 buffers
> 15/01/03 05:33:25 TRACE [shuffle-server-3] server.TransportRequestHandler (TransportRequestHandler.java:operationComplete(152)) - Sent result ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=1465867812717, chunkIndex=119}, buffer=FileSegmentManagedBuffer{file=/mnt1/spark/local/spark-local-20150103040327-c554/0e/shuffle_4_1192_0.data, offset=4788234, length=26721}} to client /10.181.127.41:60981
> 15/01/03 05:33:25 TRACE [shuffle-server-1] protocol.MessageDecoder (MessageDecoder.java:decode(42)) - Received message RpcRequest: RpcRequest{requestId=7443158706773552834, message=[B@73157c0d}
> 15/01/03 05:33:25 TRACE [shuffle-server-1] netty.NettyBlockRpcServer (Logging.scala:logTrace(67)) - Received request: OpenBlocks{appId=app-20150103040324-0000, execId=0, blockIds=[shuffle_4_16_222, shuffle_4_18_222, shuffle_4_39_222, shuffle_4_49_222, shuffle_4_50_222, shuffle_4_53_222, shuffle_4_64_222, shuffle_4_86_222, shuffle_4_87_222, shuffle_4_88_222, shuffle_4_99_222, shuffle_4_100_222, shuffle_4_126_222, shuffle_4_134_222, shuffle_4_153_222, shuffle_4_162_222, shuffle_4_163_222, shuffle_4_169_222, shuffle_4_171_222, shuffle_4_173_222, shuffle_4_177_222, shuffle_4_191_222, shuffle_4_200_222, shuffle_4_248_222, shuffle_4_266_222, shuffle_4_269_222, shuffle_4_280_222, shuffle_4_283_222, shuffle_4_286_222, shuffle_4_288_222, shuffle_4_289_222, shuffle_4_331_222, shuffle_4_341_222, shuffle_4_360_222, shuffle_4_362_222, shuffle_4_367_222, shuffle_4_369_222, shuffle_4_374_222, shuffle_4_376_222, shuffle_4_408_222, shuffle_4_418_222, shuffle_4_426_222, shuffle_4_428_222, shuffle_4_435_222, shuffle_4_436_222, shuffle_4_442_222, shuffle_4_443_222, shuffle_4_491_222, shuffle_4_497_222, shuffle_4_498_222, shuffle_4_513_222, shuffle_4_514_222, shuffle_4_520_222, shuffle_4_524_222, shuffle_4_532_222, shuffle_4_572_222, shuffle_4_573_222, shuffle_4_576_222, shuffle_4_594_222, shuffle_4_596_222, shuffle_4_604_222, shuffle_4_621_222, shuffle_4_627_222, shuffle_4_653_222, shuffle_4_654_222, shuffle_4_669_222, shuffle_4_678_222, shuffle_4_679_222, shuffle_4_685_222, shuffle_4_701_222, shuffle_4_708_222, shuffle_4_730_222, shuffle_4_734_222, shuffle_4_740_222, shuffle_4_763_222, shuffle_4_766_222, shuffle_4_769_222, shuffle_4_779_222, shuffle_4_784_222, shuffle_4_794_222, shuffle_4_816_222, shuffle_4_823_222, shuffle_4_847_222, shuffle_4_851_222, shuffle_4_856_222, shuffle_4_859_222, shuffle_4_860_222, shuffle_4_877_222, shuffle_4_904_222, shuffle_4_911_222, shuffle_4_917_222, shuffle_4_926_222, shuffle_4_929_222, shuffle_4_930_222, shuffle_4_933_222, shuffle_4_953_222, shuffle_4_956_222, shuffle_4_965_222, shuffle_4_969_222, shuffle_4_1000_222, shuffle_4_1001_222, shuffle_4_1003_222, shuffle_4_1005_222, shuffle_4_1029_222, shuffle_4_1030_222, shuffle_4_1032_222, shuffle_4_1070_222, shuffle_4_1083_222, shuffle_4_1091_222, shuffle_4_1093_222, shuffle_4_1094_222, shuffle_4_1110_222, shuffle_4_1120_222, shuffle_4_1121_222, shuffle_4_1141_222, shuffle_4_1157_222, shuffle_4_1170_222, shuffle_4_1172_222, shuffle_4_1178_222, shuffle_4_1192_222, shuffle_4_1201_222, shuffle_4_1205_222, shuffle_4_1229_222, shuffle_4_1235_222, shuffle_4_1249_222, shuffle_4_1250_222, shuffle_4_1256_222, shuffle_4_1286_222, shuffle_4_1297_222, shuffle_4_1299_222, shuffle_4_1313_222, shuffle_4_1318_222, shuffle_4_1328_222, shuffle_4_1331_222, shuffle_4_1334_222, shuffle_4_1371_222, shuffle_4_1381_222, shuffle_4_1387_222, shuffle_4_1397_222, shuffle_4_1400_222, shuffle_4_1408_222, shuffle_4_1409_222, shuffle_4_1413_222, shuffle_4_1456_222, shuffle_4_1458_222, shuffle_4_1470_222, shuffle_4_1478_222, shuffle_4_1480_222, shuffle_4_1486_222, shuffle_4_1490_222, shuffle_4_1493_222, shuffle_4_1544_222, shuffle_4_1545_222, shuffle_4_1552_222, shuffle_4_1560_222, shuffle_4_1565_222, shuffle_4_1569_222, shuffle_4_1570_222, shuffle_4_1577_222, shuffle_4_1620_222, shuffle_4_1623_222, shuffle_4_1633_222, shuffle_4_1640_222, shuffle_4_1651_222, shuffle_4_1652_222, shuffle_4_1656_222, shuffle_4_1665_222, shuffle_4_1710_222, shuffle_4_1717_222, shuffle_4_1718_222, shuffle_4_1723_222]}
> 15/01/03 05:33:25 TRACE [shuffle-server-3] protocol.MessageDecoder (MessageDecoder.java:decode(42)) - Received message ChunkFetchRequest: ChunkFetchRequest{streamChunkId=StreamChunkId{streamId=1465867812717, chunkIndex=120}}
> 15/01/03 05:33:25 TRACE [shuffle-server-3] server.TransportRequestHandler (TransportRequestHandler.java:processFetchRequest(107)) - Received req from /10.181.127.41:60981 to fetch block StreamChunkId{streamId=1465867812717, chunkIndex=120}
> 15/01/03 05:33:25 TRACE [shuffle-server-3] server.TransportRequestHandler (TransportRequestHandler.java:operationComplete(152)) - Sent result ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=1465867812717, chunkIndex=120}, buffer=FileSegmentManagedBuffer{file=/mnt/spark/local/spark-local-20150103040327-4f92/16/shuffle_4_1201_0.data, offset=4406836, length=18644}} to client /10.181.127.41:60981
> 15/01/03 05:33:25 TRACE [shuffle-server-3] protocol.MessageDecoder (MessageDecoder.java:decode(42)) - Received message ChunkFetchRequest: ChunkFetchRequest{streamChunkId=StreamChunkId{streamId=1465867812717, chunkIndex=121}}
> 15/01/03 05:33:25 TRACE [shuffle-server-3] server.TransportRequestHandler (TransportRequestHandler.java:processFetchRequest(107)) - Received req from /10.181.127.41:60981 to fetch block StreamChunkId{streamId=1465867812717, chunkIndex=121}
> 15/01/03 05:33:25 TRACE [shuffle-server-3] server.TransportRequestHandler (TransportRequestHandler.java:operationComplete(152)) - Sent result ChunkFetchSuccess{streamChunkId=StreamChunkId{streamId=1465867812717, chunkIndex=121}, buffer=FileSegmentManagedBuffer{file=/mnt/spark/local/spark-local-20150103040327-4f92/28/shuffle_4_1205_0.data, offset=4285664, length=18251}} to client /10.181.127.41:60981
> 15/01/03 05:33:25 TRACE [shuffle-server-1] netty.NettyBlockRpcServer (Logging.scala:logTrace(67)) - Registered streamId 1465867812720 with 171 buffers
> {code}
> What's odd is that the entire cluster seems to all hit these connect timeouts at once. Running dstat during this period, nothing is happening--no cpu, disk, or network usage. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org