You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "geoff langenderfer (Jira)" <ji...@apache.org> on 2022/11/10 21:39:00 UTC

[jira] [Commented] (SPARK-36964) Reuse CachedDNSToSwitchMapping for yarn container requests

    [ https://issues.apache.org/jira/browse/SPARK-36964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17631934#comment-17631934 ] 

geoff langenderfer commented on SPARK-36964:
--------------------------------------------

environment:

{code:bash}

Release label:emr-6.8.0
Hadoop distribution:Amazon 3.2.1
Applications:Spark 3.3.0

{code}

here's another stacktrace:

 
{code:bash}

22/11/10 19:05:20 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.0
22/11/10 19:15:12 ERROR AsyncEventQueue: Dropping event from queue executorManagement. This likely means one of the listeners is too slow and cannot keep up with the rate at which tasks are being started by the scheduler.
22/11/10 19:15:20 ERROR AsyncEventQueue: Dropping event from queue eventLog. This likely means one of the listeners is too slow and cannot keep up with the rate at which tasks are being started by the scheduler.
22/11/10 19:16:20 ERROR TransportChannelHandler: Connection to /10.0.0.107:47700 has been quiet for 120000 ms while there are outstanding requests. Assuming connection is dead; please adjust spark.rpc.io.connectionTimeout if this is wrong.
22/11/10 19:16:20 ERROR TransportResponseHandler: Still have 1 requests outstanding when connection from /10.0.0.107:47700 is closed
22/11/10 19:16:20 ERROR YarnSchedulerBackend$YarnSchedulerEndpoint: Sending RequestExecutors(Map(Profile: id = 0, executor resources: cores -> name: cores, amount: 32, script: , vendor: ,memory -> name: memory, amount: 218880, script: , vendor: ,offHeap -> name: offHeap, amount: 0, script: , vendor: , task resources: cpus -> name: cpus, amount: 1.0 -> 6721),Map(0 -> 212218),Map(0 -> Map(* -> 212218)),Set()) to AM was unsuccessful
org.apache.spark.rpc.RpcTimeoutException: Cannot receive any reply from /10.0.0.107:47700 in 120 seconds. This timeout is controlled by spark.rpc.askTimeout
    at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47) ~[spark-core_2.12-3.3.0-amzn-0.jar:3.3.0-amzn-0]
    at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:62) ~[spark-core_2.12-3.3.0-amzn-0.jar:3.3.0-amzn-0]
    at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:58) ~[spark-core_2.12-3.3.0-amzn-0.jar:3.3.0-amzn-0]
    at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38) ~[scala-library-2.12.15.jar:?]
    at scala.util.Failure.recover(Try.scala:234) ~[scala-library-2.12.15.jar:?]
    at scala.concurrent.Future.$anonfun$recover$1(Future.scala:395) ~[scala-library-2.12.15.jar:?]
    at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33) ~[scala-library-2.12.15.jar:?]
    at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33) ~[scala-library-2.12.15.jar:?]
    at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64) ~[scala-library-2.12.15.jar:?]
    at org.apache.spark.util.ThreadUtils$$anon$1.execute(ThreadUtils.scala:99) ~[spark-core_2.12-3.3.0-amzn-0.jar:3.3.0-amzn-0]
    at scala.concurrent.impl.ExecutionContextImpl$$anon$4.execute(ExecutionContextImpl.scala:138) ~[scala-library-2.12.15.jar:?]
    at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:72) ~[scala-library-2.12.15.jar:?]
    at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:288) ~[scala-library-2.12.15.jar:?]
    at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:288) ~[scala-library-2.12.15.jar:?]
    at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:288) ~[scala-library-2.12.15.jar:?]
    at scala.concurrent.Promise.complete(Promise.scala:53) ~[scala-library-2.12.15.jar:?]
    at scala.concurrent.Promise.complete$(Promise.scala:52) ~[scala-library-2.12.15.jar:?]
    at scala.concurrent.impl.Promise$DefaultPromise.complete(Promise.scala:187) ~[scala-library-2.12.15.jar:?]
    at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33) ~[scala-library-2.12.15.jar:?]
    at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64) ~[scala-library-2.12.15.jar:?]
    at scala.concurrent.BatchingExecutor$Batch.processBatch$1(BatchingExecutor.scala:67) ~[scala-library-2.12.15.jar:?]
    at scala.concurrent.BatchingExecutor$Batch.$anonfun$run$1(BatchingExecutor.scala:82) ~[scala-library-2.12.15.jar:?]
    at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) ~[scala-library-2.12.15.jar:?]
    at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:85) ~[scala-library-2.12.15.jar:?]
    at scala.concurrent.BatchingExecutor$Batch.run(BatchingExecutor.scala:59) ~[scala-library-2.12.15.jar:?]
    at scala.concurrent.Future$InternalCallbackExecutor$.unbatchedExecute(Future.scala:875) ~[scala-library-2.12.15.jar:?]
    at scala.concurrent.BatchingExecutor.execute(BatchingExecutor.scala:110) ~[scala-library-2.12.15.jar:?]
    at scala.concurrent.BatchingExecutor.execute$(BatchingExecutor.scala:107) ~[scala-library-2.12.15.jar:?]
    at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:873) ~[scala-library-2.12.15.jar:?]
    at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:72) ~[scala-library-2.12.15.jar:?]
    at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:288) ~[scala-library-2.12.15.jar:?]
    at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:288) ~[scala-library-2.12.15.jar:?]
    at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:288) ~[scala-library-2.12.15.jar:?]
    at scala.concurrent.Promise.tryFailure(Promise.scala:112) ~[scala-library-2.12.15.jar:?]
    at scala.concurrent.Promise.tryFailure$(Promise.scala:112) ~[scala-library-2.12.15.jar:?]
    at scala.concurrent.impl.Promise$DefaultPromise.tryFailure(Promise.scala:187) ~[scala-library-2.12.15.jar:?]
    at org.apache.spark.rpc.netty.NettyRpcEnv.org$apache$spark$rpc$netty$NettyRpcEnv$$onFailure$1(NettyRpcEnv.scala:214) ~[spark-core_2.12-3.3.0-amzn-0.jar:3.3.0-amzn-0]
    at org.apache.spark.rpc.netty.NettyRpcEnv$$anon$1.run(NettyRpcEnv.scala:264) ~[spark-core_2.12-3.3.0-amzn-0.jar:3.3.0-amzn-0]
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) ~[?:?]
    at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304) ~[?:?]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
    at java.lang.Thread.run(Thread.java:829) ~[?:?]
Caused by: java.util.concurrent.TimeoutException: Cannot receive any reply from /10.0.0.107:47700 in 120 seconds
    at org.apache.spark.rpc.netty.NettyRpcEnv$$anon$1.run(NettyRpcEnv.scala:265) ~[spark-core_2.12-3.3.0-amzn-0.jar:3.3.0-amzn-0]
    ... 6 more
22/11/10 19:16:20 ERROR TransportClient: Failed to send RPC RPC 6522570887499332644 to /10.0.0.107:47700: io.netty.channel.StacklessClosedChannelException
io.netty.channel.StacklessClosedChannelException: null
    at io.netty.channel.AbstractChannel$AbstractUnsafe.write(Object, ChannelPromise)(Unknown Source) ~[netty-transport-4.1.74.Final.jar:4.1.74.Final]
22/11/10 19:16:20 ERROR YarnSchedulerBackend$YarnSchedulerEndpoint: Sending RequestExecutors(Map(Profile: id = 0, executor resources: cores -> name: cores, amount: 32, script: , vendor: ,memory -> name: memory, amount: 218880, script: , vendor: ,offHeap -> name: offHeap, amount: 0, script: , vendor: , task resources: cpus -> name: cpus, amount: 1.0 -> 6634),Map(0 -> 212218),Map(0 -> Map(* -> 212218)),Set()) to AM was unsuccessful
java.io.IOException: Failed to send RPC RPC 6522570887499332644 to /10.0.0.107:47700: io.netty.channel.StacklessClosedChannelException
    at org.apache.spark.network.client.TransportClient$RpcChannelListener.handleFailure(TransportClient.java:392) ~[spark-network-common_2.12-3.3.0-amzn-0.jar:3.3.0-amzn-0]
    at org.apache.spark.network.client.TransportClient$StdChannelListener.operationComplete(TransportClient.java:369) ~[spark-network-common_2.12-3.3.0-amzn-0.jar:3.3.0-amzn-0]
    at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578) ~[netty-common-4.1.74.Final.jar:4.1.74.Final]
    at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:552) ~[netty-common-4.1.74.Final.jar:4.1.74.Final]
    at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491) ~[netty-common-4.1.74.Final.jar:4.1.74.Final]
    at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616) ~[netty-common-4.1.74.Final.jar:4.1.74.Final]
    at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:609) ~[netty-common-4.1.74.Final.jar:4.1.74.Final]
    at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117) ~[netty-common-4.1.74.Final.jar:4.1.74.Final]
    at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:999) ~[netty-transport-4.1.74.Final.jar:4.1.74.Final]
    at io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:860) ~[netty-transport-4.1.74.Final.jar:4.1.74.Final]
    at io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367) ~[netty-transport-4.1.74.Final.jar:4.1.74.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717) ~[netty-transport-4.1.74.Final.jar:4.1.74.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764) ~[netty-transport-4.1.74.Final.jar:4.1.74.Final]
    at io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1071) ~[netty-transport-4.1.74.Final.jar:4.1.74.Final]
    at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) ~[netty-common-4.1.74.Final.jar:4.1.74.Final]
    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:469) ~[netty-common-4.1.74.Final.jar:4.1.74.Final]
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:503) ~[netty-transport-4.1.74.Final.jar:4.1.74.Final]
    at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) ~[netty-common-4.1.74.Final.jar:4.1.74.Final]
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[netty-common-4.1.74.Final.jar:4.1.74.Final]
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[netty-common-4.1.74.Final.jar:4.1.74.Final]
    at java.lang.Thread.run(Thread.java:829) ~[?:?]
Caused by: io.netty.channel.StacklessClosedChannelException
    at io.netty.channel.AbstractChannel$AbstractUnsafe.write(Object, ChannelPromise)(Unknown Source) ~[netty-transport-4.1.74.Final.jar:4.1.74.Final]
22/11/10 19:16:20 ERROR TransportClient: Failed to send RPC RPC 4915998445035541588 to /10.0.0.107:47700: io.netty.channel.StacklessClosedChannelException
io.netty.channel.StacklessClosedChannelException: null
    at io.netty.channel.AbstractChannel$AbstractUnsafe.write(Object, ChannelPromise)(Unknown Source) ~[netty-transport-4.1.74.Final.jar:4.1.74.Final]
22/11/10 19:16:20 ERROR YarnSchedulerBackend$YarnSchedulerEndpoint: Sending RequestExecutors(Map(Profile: id = 0, executor resources: cores -> name: cores, amount: 32, script: , vendor: ,memory -> name: memory, amount: 218880, script: , vendor: ,offHeap -> name: offHeap, amount: 0, script: , vendor: , task resources: cpus -> name: cpus, amount: 1.0 -> 6568),Map(0 -> 212218),Map(0 -> Map(* -> 212218)),Set()) to AM was unsuccessful
java.io.IOException: Failed to send RPC RPC 4915998445035541588 to /10.0.0.107:47700: io.netty.channel.StacklessClosedChannelException
    at org.apache.spark.network.client.TransportClient$RpcChannelListener.handleFailure(TransportClient.java:392) ~[spark-network-common_2.12-3.3.0-amzn-0.jar:3.3.0-amzn-0]
    at org.apache.spark.network.client.TransportClient$StdChannelListener.operationComplete(TransportClient.java:369) ~[spark-network-common_2.12-3.3.0-amzn-0.jar:3.3.0-amzn-0]
    at io.netty.util.concurrent.DefaultPromise.notifyListener0(DefaultPromise.java:578) ~[netty-common-4.1.74.Final.jar:4.1.74.Final]
    at io.netty.util.concurrent.DefaultPromise.notifyListenersNow(DefaultPromise.java:552) ~[netty-common-4.1.74.Final.jar:4.1.74.Final]
    at io.netty.util.concurrent.DefaultPromise.notifyListeners(DefaultPromise.java:491) ~[netty-common-4.1.74.Final.jar:4.1.74.Final]
    at io.netty.util.concurrent.DefaultPromise.setValue0(DefaultPromise.java:616) ~[netty-common-4.1.74.Final.jar:4.1.74.Final]
    at io.netty.util.concurrent.DefaultPromise.setFailure0(DefaultPromise.java:609) ~[netty-common-4.1.74.Final.jar:4.1.74.Final]
    at io.netty.util.concurrent.DefaultPromise.tryFailure(DefaultPromise.java:117) ~[netty-common-4.1.74.Final.jar:4.1.74.Final]
    at io.netty.channel.AbstractChannel$AbstractUnsafe.safeSetFailure(AbstractChannel.java:999) ~[netty-transport-4.1.74.Final.jar:4.1.74.Final]
    at io.netty.channel.AbstractChannel$AbstractUnsafe.write(AbstractChannel.java:860) ~[netty-transport-4.1.74.Final.jar:4.1.74.Final]
    at io.netty.channel.DefaultChannelPipeline$HeadContext.write(DefaultChannelPipeline.java:1367) ~[netty-transport-4.1.74.Final.jar:4.1.74.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717) ~[netty-transport-4.1.74.Final.jar:4.1.74.Final]
    at io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764) ~[netty-transport-4.1.74.Final.jar:4.1.74.Final]
    at io.netty.channel.AbstractChannelHandlerContext$WriteTask.run(AbstractChannelHandlerContext.java:1071) ~[netty-transport-4.1.74.Final.jar:4.1.74.Final]
    at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164) ~[netty-common-4.1.74.Final.jar:4.1.74.Final]
    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:469) ~[netty-common-4.1.74.Final.jar:4.1.74.Final]
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:503) ~[netty-transport-4.1.74.Final.jar:4.1.74.Final]
    at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:986) ~[netty-common-4.1.74.Final.jar:4.1.74.Final]
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74) ~[netty-common-4.1.74.Final.jar:4.1.74.Final]
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) ~[netty-common-4.1.74.Final.jar:4.1.74.Final]
    at java.lang.Thread.run(Thread.java:829) ~[?:?]
Caused by: io.netty.channel.StacklessClosedChannelException

{code}


> Reuse CachedDNSToSwitchMapping for yarn  container requests
> -----------------------------------------------------------
>
>                 Key: SPARK-36964
>                 URL: https://issues.apache.org/jira/browse/SPARK-36964
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core, YARN
>    Affects Versions: 3.0.3, 3.1.2, 3.2.0, 3.3.0
>            Reporter: gaoyajun02
>            Priority: Major
>
> Similar to SPARK-13704​, In some cases, YarnAllocator add container requests with locality preference can be expensive, it may call the topology script for rack awareness.
> When submit a very large job in a very large Yarn cluster, the topology script may take signifiant time to run. And this blocks receiving YarnSchedulerBackend's RequestExecutors rpc calls, This request comes from spark dynamic executor allocation thread, which may blocks the ExecutorAllocationListener, and then result in executorManagement queue backlog.
>  
> Some logs:
> {code:java}
> 21/09/29 12:04:35 INFO spark-dynamic-executor-allocation ExecutorAllocationManager: Error reaching cluster manager.21/09/29 12:04:35 INFO spark-dynamic-executor-allocation ExecutorAllocationManager: Error reaching cluster manager.org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.rpc.askTimeout at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:62) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:58) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:839) at org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:411) at org.apache.spark.ExecutorAllocationManager.updateAndSyncNumExecutorsTarget(ExecutorAllocationManager.scala:361) at org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:316) at org.apache.spark.ExecutorAllocationManager$$anon$1.run(ExecutorAllocationManager.scala:227) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)Caused by: java.util.concurrent.TimeoutException: Futures timed out after [120 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:259) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:263) at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:294) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) ... 12 more21/09/29 12:04:35 WARN spark-dynamic-executor-allocation ExecutorAllocationManager: Unable to reach the cluster manager to request 1922 total executors!
> 21/09/29 12:04:35 INFO spark-dynamic-executor-allocation ExecutorAllocationManager: Error reaching cluster manager.21/09/29 12:04:35 INFO spark-dynamic-executor-allocation ExecutorAllocationManager: Error reaching cluster manager.org.apache.spark.rpc.RpcTimeoutException: Futures timed out after [120 seconds]. This timeout is controlled by spark.rpc.askTimeout at org.apache.spark.rpc.RpcTimeout.org$apache$spark$rpc$RpcTimeout$$createRpcTimeoutException(RpcTimeout.scala:47) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:62) at org.apache.spark.rpc.RpcTimeout$$anonfun$addMessageIfTimeout$1.applyOrElse(RpcTimeout.scala:58) at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:76) at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:839) at org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:411) at org.apache.spark.ExecutorAllocationManager.updateAndSyncNumExecutorsTarget(ExecutorAllocationManager.scala:361) at org.apache.spark.ExecutorAllocationManager.org$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:316) at org.apache.spark.ExecutorAllocationManager$$anon$1.run(ExecutorAllocationManager.scala:227) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745)Caused by: java.util.concurrent.TimeoutException: Futures timed out after [120 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:259) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:263) at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:294) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) ... 12 more21/09/29 12:04:35 WARN spark-dynamic-executor-allocation ExecutorAllocationManager: Unable to reach the cluster manager to request 1922 total executors!
> 21/09/29 12:02:49 ERROR dag-scheduler-event-loop AsyncEventQueue: Dropping event from queue executorManagement. This likely means one of the listeners is too slow and cannot keep up with the rate at which tasks are being started by the scheduler.
> 21/09/29 12:02:49 WARN dag-scheduler-event-loop AsyncEventQueue: Dropped 1 events from executorManagement since the application started.
> 21/09/29 12:02:55 INFO spark-listener-group-eventLog AsyncEventQueue: Process of event SparkListenerExecutorAdded(1632888172920,543,org.apache.spark.scheduler.cluster.ExecutorData@8cfab8f5,None) by listener EventLoggingListener took 3.037686034s.
> 21/09/29 12:03:03 INFO spark-listener-group-eventLog AsyncEventQueue: Process of event SparkListenerBlockManagerAdded(1632888181779,BlockManagerId(1359, --, 57233, None),2704696934,Some(2704696934),Some(0)) by listener EventLoggingListener took 1.462598355s.
> 21/09/29 12:03:49 WARN dispatcher-BlockManagerMaster AsyncEventQueue: Dropped 74388 events from executorManagement since Wed Sep 29 12:02:49 CST 2021.
> 21/09/29 12:04:35 INFO spark-listener-group-executorManagement AsyncEventQueue: Process of event SparkListenerStageSubmitted(org.apache.spark.scheduler.StageInfo@52f810ad,{...}) by listener ExecutorAllocationListener took 116.526408932s.
> 21/09/29 12:04:49 WARN heartbeat-receiver-event-loop-thread AsyncEventQueue: Dropped 18892 events from executorManagement since Wed Sep 29 12:03:49 CST 2021.
> 21/09/29 12:05:49 WARN dispatcher-BlockManagerMaster AsyncEventQueue: Dropped 19397 events from executorManagement since Wed Sep 29 12:04:49 CST 2021.
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org