You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Spark101 (Jira)" <ji...@apache.org> on 2021/10/02 18:59:00 UTC

[jira] [Commented] (SPARK-36912) Get Result time for task is taking very long time and timesout

    [ https://issues.apache.org/jira/browse/SPARK-36912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17423593#comment-17423593 ] 

Spark101 commented on SPARK-36912:
----------------------------------

Attached Spark UI screenshots, environment, thread dump for executors where task is stuck on get result

> Get Result time for task is taking very long time and timesout
> --------------------------------------------------------------
>
>                 Key: SPARK-36912
>                 URL: https://issues.apache.org/jira/browse/SPARK-36912
>             Project: Spark
>          Issue Type: Question
>          Components: Block Manager
>    Affects Versions: 3.0.3
>            Reporter: Spark101
>            Priority: Major
>         Attachments: Stage-result.pdf, Storage-result.pdf, environment.pdf, executors.pdf, thread-dump-exec3.pdf, threadDump-exc2.pdf
>
>
> We use Spark on Kubernetes to run batch jobs to analyze flows and produce insights. The flows are read from timeseries database. We have 3 exec instances each having 5g mem + driver (5g mem). We observe the following warning followed by timeout errors after which the job fails. We have been stuck on this for some time and really hoping to get some help from this forum:2021-10-02T16:07:09.459ZGMT  WARN dispatcher-CoarseGrainedScheduler TaskSetManager - Stage 52 contains a task of very large size (2842 KiB). The maximum recommended task size is 1000 KiB.
> 2021-10-02T16:08:19.151ZGMT ERROR task-result-getter-0 RetryingBlockFetcher - Exception while beginning fetch of 1 outstanding blocks 
> java.io.IOException: Failed to connect to /192.168.7.99:34259
> 	at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:253)
> 	at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:195)
> 	at org.apache.spark.network.netty.NettyBlockTransferService$$anon$2.createAndStart(NettyBlockTransferService.scala:122)
> 	at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:141)
> 	at org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:121)
> 	at org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:143)
> 	at org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:103)
> 	at org.apache.spark.storage.BlockManager.fetchRemoteManagedBuffer(BlockManager.scala:1010)
> 	at org.apache.spark.storage.BlockManager.$anonfun$getRemoteBlock$8(BlockManager.scala:954)
> 	at scala.Option.orElse(Option.scala:447)
> 	at org.apache.spark.storage.BlockManager.getRemoteBlock(BlockManager.scala:954)
> 	at org.apache.spark.storage.BlockManager.getRemoteBytes(BlockManager.scala:1092)
> 	at org.apache.spark.scheduler.TaskResultGetter$$anon$3.$anonfun$run$1(TaskResultGetter.scala:88)
> 	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> 	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1934)
> 	at org.apache.spark.scheduler.TaskResultGetter$$anon$3.run(TaskResultGetter.scala:63)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> 	at java.lang.Thread.run(Thread.java:748)
> Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection timed out: /192.168.7.99:34259
> Caused by: java.net.ConnectException: Connection timed out
> 	at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> 	at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:716)
> 	at io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:330)
> 	at io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:334)
> 	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:702)
> 	at io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:650)
> 	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:576)
> 	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:493)
> 	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
> 	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
> 	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> 	at java.lang.Thread.run(Thread.java:748)
> 2021-10-02T16:08:19.151ZGMT ERROR task-result-getter-2 RetryingBlockFetcher - Exception while beginning fetch of 1 outstanding blocks 
> java.io.IOException: Failed to connect to /192.168.6.167:42405
> 	at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:253)
> 	at org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:195)
> 	at org.apache.spark.network.netty.NettyBlockTransferService$$anon$2.createAndStart(NettyBlockTransferService.scala:122)
> 	at org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:141)
> 	at org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:121)
> 	at org.apache.spark.network.netty.NettyBlockTransferService.fetchBlocks(NettyBlockTransferService.scala:143)
> 	at org.apache.spark.network.BlockTransferService.fetchBlockSync(BlockTransferService.scala:103)
> 	at org.apache.spark.storage.BlockManager.fetchRemoteManagedBuffer(BlockManager.scala:1010)
> 	at org.apache.spark.storage.BlockManager.$anonfun$getRemoteBlock$8(BlockManager.scala:954)
> 	at scala.Option.orElse(Option.scala:447)
> 	at org.apache.spark.storage.BlockManager.getRemoteBlock(BlockManager.scala:954)
> 	at org.apache.spark.storage.BlockManager.getRemoteBytes(BlockManager.scala:1092)
> 	at org.apache.spark.scheduler.TaskResultGetter$$anon$3.$anonfun$run$1(TaskResultGetter.scala:88)
> 	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
> 	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1934)
> 	at org.apache.spark.scheduler.TaskResultGetter$$anon$3.run(TaskResultGetter.scala:63)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> 	at java.lang.Thread.run(Thread.java:748)
> Caused by: io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection timed out: /192.168.6.167:42



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org