You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Hyukjin Kwon (Jira)" <ji...@apache.org> on 2020/10/14 06:16:00 UTC

[jira] [Commented] (SPARK-33085) "Master removed our application" error leads to FAILED driver status instead of KILLED driver status

    [ https://issues.apache.org/jira/browse/SPARK-33085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17213647#comment-17213647 ] 

Hyukjin Kwon commented on SPARK-33085:
--------------------------------------

Can you show the reproducible steps?

> "Master removed our application" error leads to FAILED driver status instead of KILLED driver status
> ----------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-33085
>                 URL: https://issues.apache.org/jira/browse/SPARK-33085
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler, Spark Core
>    Affects Versions: 2.4.6
>            Reporter: t oo
>            Priority: Major
>
>  
> driver-20200930160855-0316 exited with status FAILED
>  
> I am using Spark Standalone scheduler with spot ec2 workers. I confirmed that myip.87 EC2 instance was terminated at 2020-09-30 16:16
>  
> *I would expect the overall driver status to be KILLED but instead it was FAILED*, my goal is to interpret FAILED status as 'don't rerun as non-transient error faced' but KILLED/ERROR status as 'yes, rerun as transient error faced'. But it looks like FAILED status is being set in below case of transient error:
>   
> Below are driver logs
> {code:java}
> 2020-09-30 16:12:41,183 [main] INFO  com.yotpo.metorikku.output.writers.file.FileOutputWriter - Writing file to s3a://redacted2020-09-30 16:12:41,183 [main] INFO  com.yotpo.metorikku.output.writers.file.FileOutputWriter - Writing file to s3a://redacted20-09-30 16:16:40,366 [dispatcher-event-loop-15] ERROR org.apache.spark.scheduler.TaskSchedulerImpl - Lost executor 0 on myip.87: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.2020-09-30 16:16:40,372 [dispatcher-event-loop-15] WARN  org.apache.spark.scheduler.TaskSetManager - Lost task 0.0 in stage 6.0 (TID 6, myip.87, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.2020-09-30 16:16:40,376 [dispatcher-event-loop-13] WARN  org.apache.spark.storage.BlockManagerMasterEndpoint - No more replicas available for rdd_3_0 !2020-09-30 16:16:40,398 [dispatcher-event-loop-2] INFO  org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor app-20200930160902-0895/0 removed: Worker shutting down2020-09-30 16:16:40,399 [dispatcher-event-loop-2] INFO  org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Granted executor ID app-20200930160902-0895/1 on hostPort myip.87:11647 with 2 core(s), 5.0 GB RAM2020-09-30 16:16:40,401 [dispatcher-event-loop-5] INFO  org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor app-20200930160902-0895/1 removed: java.lang.IllegalStateException: Shutdown hooks cannot be modified during shutdown.2020-09-30 16:16:40,402 [dispatcher-event-loop-5] INFO  org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Granted executor ID app-20200930160902-0895/2 on hostPort myip.87:11647 with 2 core(s), 5.0 GB RAM2020-09-30 16:16:40,403 [dispatcher-event-loop-11] INFO  org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor app-20200930160902-0895/2 removed: java.lang.IllegalStateException: Shutdown hooks cannot be modified during shutdown.2020-09-30 16:16:40,404 [dispatcher-event-loop-11] INFO  org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Granted executor ID app-20200930160902-0895/3 on hostPort myip.87:11647 with 2 core(s), 5.0 GB RAM2020-09-30 16:16:40,405 [dispatcher-event-loop-1] INFO  org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor app-20200930160902-0895/3 removed: java.lang.IllegalStateException: Shutdown hooks cannot be modified during shutdown.2020-09-30 16:16:40,406 [dispatcher-event-loop-1] INFO  org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Granted executor ID app-20200930160902-0895/4 on hostPort myip.87:11647 with 2 core(s), 5.0 GB RAM2020-09-30 16:16:40,407 [dispatcher-event-loop-12] INFO  org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor app-20200930160902-0895/4 removed: java.lang.IllegalStateException: Shutdown hooks cannot be modified during shutdown.2020-09-30 16:16:40,408 [dispatcher-event-loop-12] INFO  org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Granted executor ID app-20200930160902-0895/5 on hostPort myip.87:11647 with 2 core(s), 5.0 GB RAM2020-09-30 16:16:40,409 [dispatcher-event-loop-4] INFO  org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor app-20200930160902-0895/5 removed: java.lang.IllegalStateException: Shutdown hooks cannot be modified during shutdown.2020-09-30 16:16:40,410 [dispatcher-event-loop-5] INFO  org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Granted executor ID app-20200930160902-0895/6 on hostPort myip.87:11647 with 2 core(s), 5.0 GB RAM2020-09-30 16:16:40,420 [dispatcher-event-loop-9] INFO  org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor app-20200930160902-0895/6 removed: java.lang.IllegalStateException: Shutdown hooks cannot be modified during shutdown.2020-09-30 16:16:40,421 [dispatcher-event-loop-9] INFO  org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Granted executor ID app-20200930160902-0895/7 on hostPort myip.87:11647 with 2 core(s), 5.0 GB RAM2020-09-30 16:16:40,423 [dispatcher-event-loop-15] INFO  org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor app-20200930160902-0895/7 removed: java.lang.IllegalStateException: Shutdown hooks cannot be modified during shutdown.2020-09-30 16:16:40,424 [dispatcher-event-loop-15] INFO  org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Granted executor ID app-20200930160902-0895/8 on hostPort myip.87:11647 with 2 core(s), 5.0 GB RAM2020-09-30 16:16:40,425 [dispatcher-event-loop-2] INFO  org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor app-20200930160902-0895/8 removed: java.lang.IllegalStateException: Shutdown hooks cannot be modified during shutdown.2020-09-30 16:16:40,425 [dispatcher-event-loop-2] INFO  org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Granted executor ID app-20200930160902-0895/9 on hostPort myip.87:11647 with 2 core(s), 5.0 GB RAM2020-09-30 16:16:40,427 [dispatcher-event-loop-14] INFO  org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Executor app-20200930160902-0895/9 removed: java.lang.IllegalStateException: Shutdown hooks cannot be modified during shutdown.2020-09-30 16:16:40,429 [dispatcher-event-loop-5] ERROR org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Application has been killed. Reason: Master removed our application: FAILED2020-09-30 16:16:40,438 [main] ERROR org.apache.spark.sql.execution.datasources.FileFormatWriter - Aborting job 564822f2-f2fd-42cd-8d57-b6d5dff145f6.org.apache.spark.SparkException: Job aborted due to stage failure: Master removed our application: FAILED at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1891) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1879) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1878) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1878) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:927) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:927) at scala.Option.foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:927) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2112) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2061) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2050) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:738) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2061) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:167) at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104) at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102) at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131) at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127) at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152) at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127) at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:83) at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:81) at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:677) at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:677) at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:80) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:127) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:75) at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:677) at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:286) at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:272) at com.yotpo.metorikku.output.writers.file.FileOutputWriter.save(FileOutputWriter.scala:134) at com.yotpo.metorikku.output.writers.file.FileOutputWriter.write(FileOutputWriter.scala:65) at com.yotpo.metorikku.metric.Metric.com$yotpo$metorikku$metric$Metric$$writeBatch(Metric.scala:97) at com.yotpo.metorikku.metric.Metric$$anonfun$write$1.apply(Metric.scala:136) at com.yotpo.metorikku.metric.Metric$$anonfun$write$1.apply(Metric.scala:125) at scala.collection.immutable.List.foreach(List.scala:392) at com.yotpo.metorikku.metric.Metric.write(Metric.scala:125) at com.yotpo.metorikku.metric.MetricSet$$anonfun$run$1.apply(MetricSet.scala:44) at com.yotpo.metorikku.metric.MetricSet$$anonfun$run$1.apply(MetricSet.scala:39) at scala.collection.immutable.List.foreach(List.scala:392) at com.yotpo.metorikku.metric.MetricSet.run(MetricSet.scala:39) at com.yotpo.metorikku.Metorikku$$anonfun$runMetrics$1.apply(Metorikku.scala:17) at com.yotpo.metorikku.Metorikku$$anonfun$runMetrics$1.apply(Metorikku.scala:15) at scala.collection.immutable.List.foreach(List.scala:392) at com.yotpo.metorikku.Metorikku$.runMetrics(Metorikku.scala:15) at com.yotpo.metorikku.Metorikku$.delayedEndpoint$com$yotpo$metorikku$Metorikku$1(Metorikku.scala:11) at com.yotpo.metorikku.Metorikku$delayedInit$body.apply(Metorikku.scala:7) at scala.Function0$class.apply$mcV$sp(Function0.scala:34) at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12) at scala.App$$anonfun$main$1.apply(App.scala:76) at scala.App$$anonfun$main$1.apply(App.scala:76) at scala.collection.immutable.List.foreach(List.scala:392) at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35) at scala.App$class.main(App.scala:76) at com.yotpo.metorikku.Metorikku$.main(Metorikku.scala:7) at com.yotpo.metorikku.Metorikku.main(Metorikku.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.deploy.worker.DriverWrapper$.main(DriverWrapper.scala:65) at org.apache.spark.deploy.worker.DriverWrapper.main(DriverWrapper.scala)2020-09-30 16:16:40,457 [stop-spark-context] INFO  org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend - Shutting down all executors2020-09-30 16:16:40,461 [stop-spark-context] ERROR org.apache.spark.util.Utils - Uncaught exception in thread stop-spark-contextorg.apache.spark.SparkException: Exception thrown in awaitResult:  at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:226) at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75) at org.apache.spark.deploy.client.StandaloneAppClient.stop(StandaloneAppClient.scala:283) at org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.org$apache$spark$scheduler$cluster$StandaloneSchedulerBackend$$stop(StandaloneSchedulerBackend.scala:227) at org.apache.spark.scheduler.cluster.StandaloneSchedulerBackend.stop(StandaloneSchedulerBackend.scala:124) at org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:669) at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:2044) at org.apache.spark.SparkContext$$anonfun$stop$6.apply$mcV$sp(SparkContext.scala:1949) at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1340) at org.apache.spark.SparkContext.stop(SparkContext.scala:1948) at org.apache.spark.SparkContext$$anon$3.run(SparkContext.scala:1903)Caused by: org.apache.spark.SparkException: Could not find AppClient. at org.apache.spark.rpc.netty.Dispatcher.postMessage(Dispatcher.scala:160) at org.apache.spark.rpc.netty.Dispatcher.postLocalMessage(Dispatcher.scala:135) at org.apache.spark.rpc.netty.NettyRpcEnv.ask(NettyRpcEnv.scala:229) at org.apache.spark.rpc.netty.NettyRpcEndpointRef.ask(NettyRpcEnv.scala:523) at org.apache.spark.rpc.RpcEndpointRef.ask(RpcEndpointRef.scala:63) ... 9 more
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org