You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (Jira)" <ji...@apache.org> on 2019/10/08 05:44:21 UTC

[jira] [Resolved] (SPARK-23981) ShuffleBlockFetcherIterator - Spamming Logs

     [ https://issues.apache.org/jira/browse/SPARK-23981?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Hyukjin Kwon resolved SPARK-23981.
----------------------------------
    Resolution: Incomplete

> ShuffleBlockFetcherIterator - Spamming Logs
> -------------------------------------------
>
>                 Key: SPARK-23981
>                 URL: https://issues.apache.org/jira/browse/SPARK-23981
>             Project: Spark
>          Issue Type: Improvement
>          Components: Shuffle
>    Affects Versions: 2.3.0
>            Reporter: David Mollitor
>            Priority: Major
>              Labels: bulk-closed
>
> If a remote host shuffle service fails, Spark Executors produce a huge amount of logging.
> {code:java}
> 2018-04-10 20:24:44,834 INFO  [Block Fetch Retry-1] shuffle.RetryingBlockFetcher (RetryingBlockFetcher.java:initiateRetry(163)) - Retrying fetch (3/3) for 1753 outstanding blocks after 5000 ms
> 2018-04-10 20:24:49,865 ERROR [Block Fetch Retry-1] storage.ShuffleBlockFetcherIterator (Logging.scala:logError(95)) - Failed to get block(s) from myhost.local:7337
> java.io.IOException: Failed to connect to myhost.local/10.11.12.13:7337
> 	at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:216)
> 	at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:167)
> 	at org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:105)
> 	at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
> 	at org.apache.spark.network.shuffle.RetryingBlockFetcher.access$200(RetryingBlockFetcher.java:43)
> 	at org.apache.spark.network.shuffle.RetryingBlockFetcher$1.run(RetryingBlockFetcher.java:170)
> 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> 	at java.lang.Thread.run(Thread.java:748)
> Caused by: java.net.ConnectException: Connection refused: myhost.local/12.13.14.15:7337
> 	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> 	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
> 	at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
> 	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
> 	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
> 	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
> 	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
> 	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
> 	at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
> 	... 1 more
> {code}
>  
> We can see from the code, that if a block fetch fails, a "listener" is updated once for each block. From the error messages previously, it can be seen that 1753 blocks were being fetched. However, since the remote host has become unavailable, they all fail and every block is alerted on.
>  
> {code:java|title=RetryingBlockFetcher.java}
>       if (shouldRetry(e)) {
>         initiateRetry();
>       } else {
>         for (String bid : blockIdsToFetch) {
>           listener.onBlockFetchFailure(bid, e);
>         }
>       }
> {code}
> {code:java|title=ShuffleBlockFetcherIterator.scala}
>     override def onBlockFetchFailure(blockId: String, e: Throwable): Unit = {
>         logError(s"Failed to get block(s) from ${req.address.host}:${req.address.port}", e)
>         results.put(new FailureFetchResult(BlockId(blockId), address, e))
>       }
> {code}
> So what we get here, is 1753 ERROR stack traces in the logging all printing the same message:
> {quote}Failed to get block(s) from myhost.local:7337
>  ...
> {quote}
> Perhaps it would be better if the method signature {{onBlockFetchFailure}} was changed to accept an entire Collection of block IDs instead of one-by-one.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org