You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Atul Anand (JIRA)" <ji...@apache.org> on 2019/05/27 09:16:00 UTC
[jira] [Commented] (SPARK-13182) Spark Executor retries infinitely

    [ https://issues.apache.org/jira/browse/SPARK-13182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16848749#comment-16848749 ] 

Atul Anand commented on SPARK-13182:
------------------------------------

[~srowen] The issue here is spark does not consider this as failures, and so keeps retrying.

I have hit infinite retry in a valid scenario, please see. [here|[https://stackoverflow.com/questions/56236216/spark-keeps-relaunching-executors-after-yarn-kills-them]].

Basically yarn preempted spark containers as they were running on lower priority queue.

But spark restarted the containers right away. Yarn again killed them.

Spark should have hit max failures count after few kills, but it does not consider these as failures.
{noformat}
2019-05-20 03:40:07 [dispatcher-event-loop-0] INFO TaskSetManager :54 Task 95 failed because while it was being computed, its executor exited for a reason unrelated to the task. Not counting this failure towards the maximum number of failures for the task.{noformat}
Hence it keeps relaunching containers, while Yarn keeps killing them.

> Spark Executor retries infinitely
> ---------------------------------
>
>                 Key: SPARK-13182
>                 URL: https://issues.apache.org/jira/browse/SPARK-13182
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.5.2
>            Reporter: Prabhu Joseph
>            Priority: Minor
>
>   When a Spark job (Spark-1.5.2) is submitted with a single executor and if user passes some wrong JVM arguments with spark.executor.extraJavaOptions, the first executor fails. But the job keeps on retrying, creating a new executor and failing every time, until CTRL-C is pressed. 
> ./spark-submit --class SimpleApp --master "spark://10.10.72.145:7077"  --conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35 -XX:ConcGCThreads=16" /SPARK/SimpleApp.jar
> Here when user submits job with ConcGCThreads 16 which is greater than ParallelGCThreads, JVM will crash. But the job does not exit, keeps on creating executors and retrying.
> ..........
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160201065319-0014/2846 on hostPort 10.10.72.145:36558 with 12 cores, 2.0 GB RAM
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: app-20160201065319-0014/2846 is now LOADING
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: app-20160201065319-0014/2846 is now RUNNING
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: app-20160201065319-0014/2846 is now EXITED (Command exited with code 1)
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Executor app-20160201065319-0014/2846 removed: Command exited with code 1
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Asked to remove non-existent executor 2846
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor added: app-20160201065319-0014/2847 on worker-20160131230345-10.10.72.145-36558 (10.10.72.145:36558) with 12 cores
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160201065319-0014/2847 on hostPort 10.10.72.145:36558 with 12 cores, 2.0 GB RAM
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: app-20160201065319-0014/2847 is now LOADING
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: app-20160201065319-0014/2847 is now EXITED (Command exited with code 1)
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Executor app-20160201065319-0014/2847 removed: Command exited with code 1
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Asked to remove non-existent executor 2847
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor added: app-20160201065319-0014/2848 on worker-20160131230345-10.10.72.145-36558 (10.10.72.145:36558) with 12 cores
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160201065319-0014/2848 on hostPort 10.10.72.145:36558 with 12 cores, 2.0 GB RAM
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: app-20160201065319-0014/2848 is now LOADING
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated: app-20160201065319-0014/2848 is now RUNNING
> Spark should not fall into a trap on these kind of user errors on a production cluster.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org