You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/06/25 14:32:52 UTC

[GitHub] [spark] pan3793 opened a new pull request, #36991: [SPARK-39601][YARN] AllocationFailure should not be treated as exitCausedByApp when driver is shutting down

pan3793 opened a new pull request, #36991:
URL: https://github.com/apache/spark/pull/36991

### What changes were proposed in this pull request?

Treating container `AllocationFailure` as not "exitCausedByApp" when driver is shutting down

### Why are the changes needed?

I observed some Spark Applications failed in the shutting down phase w/ reason: Max number of executor failures (16) reached, the timeline is

Driver - Job success, Spark starts shutting down process.
```
2022-06-23 19:50:55 CST AbstractConnector INFO - Stopped Spark@74e9431b{HTTP/1.1, (http/1.1)}{0.0.0.0:0}
2022-06-23 19:50:55 CST SparkUI INFO - Stopped Spark web UI at http://hadoop2627.jd.163.org:28446
2022-06-23 19:50:55 CST YarnClusterSchedulerBackend INFO - Shutting down all executors
```

Driver - A container allocate successful during shutting down phase.
```
2022-06-23 19:52:21 CST YarnAllocator INFO - Launching container container_e94_1649986670278_7743380_02_000025 on host hadoop4388.jd.163.org for executor with ID 24 for ResourceProfile Id 0
```

Executor - The executor can not connect to driver endpoint because driver already stopped the endpoint.
```
Exception in thread "main" java.lang.reflect.UndeclaredThrowableException
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1911)
at org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:61)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:393)
at org.apache.spark.executor.YarnCoarseGrainedExecutorBackend$.main(YarnCoarseGrainedExecutorBackend.scala:81)
at org.apache.spark.executor.YarnCoarseGrainedExecutorBackend.main(YarnCoarseGrainedExecutorBackend.scala)
Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:301)
at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:101)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$9(CoarseGrainedExecutorBackend.scala:413)
at scala.runtime.java8.JFunction1$mcVI$sp.apply(JFunction1$mcVI$sp.java:23)
at scala.collection.TraversableLike$WithFilter.$anonfun$foreach$1(TraversableLike.scala:877)
at scala.collection.immutable.Range.foreach(Range.scala:158)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:876)
at org.apache.spark.executor.CoarseGrainedExecutorBackend$.$anonfun$run$7(CoarseGrainedExecutorBackend.scala:411)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:62)
at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:61)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1893)
... 4 more
Caused by: org.apache.spark.rpc.RpcEndpointNotFoundException: Cannot find endpoint: spark://CoarseGrainedScheduler@hadoop2627.jd.163.org:21956
at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$asyncSetupEndpointRefByURI$1(NettyRpcEnv.scala:148)
at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$asyncSetupEndpointRefByURI$1$adapted(NettyRpcEnv.scala:144)
at scala.concurrent.Future.$anonfun$flatMap$1(Future.scala:307)
at scala.concurrent.impl.Promise.$anonfun$transformWith$1(Promise.scala:41)
at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
at org.apache.spark.util.ThreadUtils$$anon$1.execute(ThreadUtils.scala:99)
at scala.concurrent.impl.ExecutionContextImpl$$anon$4.execute(ExecutionContextImpl.scala:138)
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:72)
at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:288)
at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:288)
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:288)
```

Driver - YarnAllocator reciveed container launch error message and treat it as `exitCausedByApp`
```
2022-06-23 19:52:27 CST YarnAllocator INFO - Completed container container_e94_1649986670278_7743380_02_000025 on host: hadoop4388.jd.163.org (state: COMPLETE, exit status: 1)
2022-06-23 19:52:27 CST YarnAllocator WARN - Container from a bad node: container_e94_1649986670278_7743380_02_000025 on host: hadoop4388.jd.163.org. Exit status: 1. Diagnostics: [2022-06-23 19:52:24.932]Exception from container-launch.
Container id: container_e94_1649986670278_7743380_02_000025
Exit code: 1
Shell output: main : command provided 1
main : run as user is bdms_pm
main : requested yarn user is bdms_pm
Getting exit code file...
Creating script paths...
Writing pid file...
Writing to tmp file /mnt/dfs/2/yarn/local/nmPrivate/application_1649986670278_7743380/container_e94_1649986670278_7743380_02_000025/container_e94_1649986670278_7743380_02_000025.pid.tmp
Writing to cgroup task files...
Creating local dirs...
Launching container...
Getting exit code file...
Creating script paths...

[2022-06-23 19:52:24.938]Container exited with a non-zero exit code 1. Error file: prelaunch.err.
Last 4096 bytes of prelaunch.err :
Last 4096 bytes of stderr :
at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:873)
at scala.concurrent.impl.CallbackRunnable.executeWithValue(Promise.scala:72)
at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1(Promise.scala:288)
at scala.concurrent.impl.Promise$DefaultPromise.$anonfun$tryComplete$1$adapted(Promise.scala:288)
at scala.concurrent.impl.Promise$DefaultPromise.tryComplete(Promise.scala:288)
at scala.concurrent.Promise.trySuccess(Promise.scala:94)
at scala.concurrent.Promise.trySuccess$(Promise.scala:94)
at scala.concurrent.impl.Promise$DefaultPromise.trySuccess(Promise.scala:187)
at org.apache.spark.rpc.netty.NettyRpcEnv.onSuccess$1(NettyRpcEnv.scala:225)
at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$askAbortable$7(NettyRpcEnv.scala:246)
at org.apache.spark.rpc.netty.NettyRpcEnv.$anonfun$askAbortable$7$adapted(NettyRpcEnv.scala:246)
at org.apache.spark.rpc.netty.RpcOutboxMessage.onSuccess(Outbox.scala:90)
at org.apache.spark.network.client.TransportResponseHandler.handle(TransportResponseHandler.java:195)
at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:142)
at org.apache.spark.network.server.TransportChannelHandler.channelRead0(TransportChannelHandler.java:53)
at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:99)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
```

Driver - Eventually application failed because ”failed“ executor reached thredhold
```
2022-06-23 19:52:30 CST ApplicationMaster INFO - Final app status: FAILED, exitCode: 11, (reason: Max number of executor failures (16) reached)
```
### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Update UT.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org