You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Prabhu Joseph <pr...@gmail.com> on 2016/02/01 13:16:26 UTC

Spark Executor retries infinitely

Hi All,

  When a Spark job (Spark-1.5.2) is submitted with a single executor and if
user passes some wrong JVM arguments with spark.executor.extraJavaOptions,
the first executor fails. But the job keeps on retrying, creating a new
executor and failing every tim*e, *until CTRL-C is pressed*. *Do we have
configuration to limit the retry attempts.

*Example:*

./spark-submit --class SimpleApp --master "spark://10.10.72.145:7077"
--conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps -XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35
-XX:ConcGCThreads=16" /SPARK/SimpleApp.jar

Executor fails with

Error occurred during initialization of VM
Can't have more ConcGCThreads than ParallelGCThreads.

But the job does not exit, keeps on creating executors and retrying.
..........
16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: *Granted executor ID
app-20160201065319-0014/2846* on hostPort 10.10.72.145:36558 with 12 cores,
2.0 GB RAM
16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated:
app-20160201065319-0014/2846 is now LOADING
16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated:
app-20160201065319-0014/2846 is now RUNNING
16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated:
app-20160201065319-0014/2846 is now EXITED (Command exited with code 1)
16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Executor
app-20160201065319-0014/2846 removed: Command exited with code 1
16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Asked to remove
non-existent executor 2846
16/02/01 06:54:28 INFO AppClient$ClientEndpoint: *Executor added:
app-20160201065319-0014/2847* on worker-20160131230345-10.10.72.145-36558 (
10.10.72.145:36558) with 12 cores
16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Granted executor ID
app-20160201065319-0014/2847 on hostPort 10.10.72.145:36558 with 12 cores,
2.0 GB RAM
16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated:
app-20160201065319-0014/2847 is now LOADING
16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated:
app-20160201065319-0014/2847 is now EXITED (Command exited with code 1)
16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Executor
app-20160201065319-0014/2847 removed: Command exited with code 1
16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Asked to remove
non-existent executor 2847
16/02/01 06:54:28 INFO AppClient$ClientEndpoint:* Executor added:
app-20160201065319-0014/2848* on worker-20160131230345-10.10.72.145-36558 (
10.10.72.145:36558) with 12 cores
16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Granted executor ID
app-20160201065319-0014/2848 on hostPort 10.10.72.145:36558 with 12 cores,
2.0 GB RAM
16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated:
app-20160201065319-0014/2848 is now LOADING
16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated:
app-20160201065319-0014/2848 is now RUNNING
............



Thanks,
Prabhu Joseph

Re: Spark Executor retries infinitely

Posted by Prabhu Joseph <pr...@gmail.com>.

Thanks Ted. My concern is how to avoid these kind of user errors on a
production cluster, it would be better if Spark handles this instead of
creating an Executor for every second and fails and overloading the Spark
Master. Shall i report a Spark JIRA to handle this.


Thanks,
Prabhu Joseph


On Mon, Feb 1, 2016 at 9:09 PM, Ted Yu <yu...@gmail.com> wrote:

> I haven't found config knob for controlling the retry count after brief
> search.
>
> According to
> http://www.oracle.com/technetwork/articles/java/g1gc-1984535.html ,
> default value for -XX:ParallelGCThreads= seems to be 8.
> This seems to explain why you got the VM initialization error.
>
> FYI
>
> On Mon, Feb 1, 2016 at 4:16 AM, Prabhu Joseph <pr...@gmail.com>
> wrote:
>
>> Hi All,
>>
>>   When a Spark job (Spark-1.5.2) is submitted with a single executor and
>> if user passes some wrong JVM arguments with
>> spark.executor.extraJavaOptions, the first executor fails. But the job
>> keeps on retrying, creating a new executor and failing every tim*e, *until
>> CTRL-C is pressed*. *Do we have configuration to limit the retry
>> attempts.
>>
>> *Example:*
>>
>> ./spark-submit --class SimpleApp --master "spark://10.10.72.145:7077"
>> --conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails
>> -XX:+PrintGCTimeStamps -XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35
>> -XX:ConcGCThreads=16" /SPARK/SimpleApp.jar
>>
>> Executor fails with
>>
>> Error occurred during initialization of VM
>> Can't have more ConcGCThreads than ParallelGCThreads.
>>
>> But the job does not exit, keeps on creating executors and retrying.
>> ..........
>> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: *Granted executor ID
>> app-20160201065319-0014/2846* on hostPort 10.10.72.145:36558 with 12
>> cores, 2.0 GB RAM
>> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated:
>> app-20160201065319-0014/2846 is now LOADING
>> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated:
>> app-20160201065319-0014/2846 is now RUNNING
>> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated:
>> app-20160201065319-0014/2846 is now EXITED (Command exited with code 1)
>> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Executor
>> app-20160201065319-0014/2846 removed: Command exited with code 1
>> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Asked to remove
>> non-existent executor 2846
>> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: *Executor added:
>> app-20160201065319-0014/2847* on
>> worker-20160131230345-10.10.72.145-36558 (10.10.72.145:36558) with 12
>> cores
>> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Granted executor ID
>> app-20160201065319-0014/2847 on hostPort 10.10.72.145:36558 with 12
>> cores, 2.0 GB RAM
>> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated:
>> app-20160201065319-0014/2847 is now LOADING
>> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated:
>> app-20160201065319-0014/2847 is now EXITED (Command exited with code 1)
>> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Executor
>> app-20160201065319-0014/2847 removed: Command exited with code 1
>> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Asked to remove
>> non-existent executor 2847
>> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint:* Executor added:
>> app-20160201065319-0014/2848* on
>> worker-20160131230345-10.10.72.145-36558 (10.10.72.145:36558) with 12
>> cores
>> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Granted executor ID
>> app-20160201065319-0014/2848 on hostPort 10.10.72.145:36558 with 12
>> cores, 2.0 GB RAM
>> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated:
>> app-20160201065319-0014/2848 is now LOADING
>> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated:
>> app-20160201065319-0014/2848 is now RUNNING
>> ............
>>
>>
>>
>> Thanks,
>> Prabhu Joseph
>>
>>
>>
>

Re: Spark Executor retries infinitely

Posted by Ted Yu <yu...@gmail.com>.

I haven't found config knob for controlling the retry count after brief
search.

According to
http://www.oracle.com/technetwork/articles/java/g1gc-1984535.html , default
value for -XX:ParallelGCThreads= seems to be 8.
This seems to explain why you got the VM initialization error.

FYI

On Mon, Feb 1, 2016 at 4:16 AM, Prabhu Joseph <pr...@gmail.com>
wrote:

> Hi All,
>
>   When a Spark job (Spark-1.5.2) is submitted with a single executor and
> if user passes some wrong JVM arguments with
> spark.executor.extraJavaOptions, the first executor fails. But the job
> keeps on retrying, creating a new executor and failing every tim*e, *until
> CTRL-C is pressed*. *Do we have configuration to limit the retry attempts.
>
> *Example:*
>
> ./spark-submit --class SimpleApp --master "spark://10.10.72.145:7077"
> --conf "spark.executor.extraJavaOptions=-XX:+PrintGCDetails
> -XX:+PrintGCTimeStamps -XX:+UseG1GC -XX:InitiatingHeapOccupancyPercent=35
> -XX:ConcGCThreads=16" /SPARK/SimpleApp.jar
>
> Executor fails with
>
> Error occurred during initialization of VM
> Can't have more ConcGCThreads than ParallelGCThreads.
>
> But the job does not exit, keeps on creating executors and retrying.
> ..........
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: *Granted executor ID
> app-20160201065319-0014/2846* on hostPort 10.10.72.145:36558 with 12
> cores, 2.0 GB RAM
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated:
> app-20160201065319-0014/2846 is now LOADING
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated:
> app-20160201065319-0014/2846 is now RUNNING
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated:
> app-20160201065319-0014/2846 is now EXITED (Command exited with code 1)
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Executor
> app-20160201065319-0014/2846 removed: Command exited with code 1
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Asked to remove
> non-existent executor 2846
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: *Executor added:
> app-20160201065319-0014/2847* on worker-20160131230345-10.10.72.145-36558
> (10.10.72.145:36558) with 12 cores
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Granted executor ID
> app-20160201065319-0014/2847 on hostPort 10.10.72.145:36558 with 12
> cores, 2.0 GB RAM
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated:
> app-20160201065319-0014/2847 is now LOADING
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated:
> app-20160201065319-0014/2847 is now EXITED (Command exited with code 1)
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Executor
> app-20160201065319-0014/2847 removed: Command exited with code 1
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Asked to remove
> non-existent executor 2847
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint:* Executor added:
> app-20160201065319-0014/2848* on worker-20160131230345-10.10.72.145-36558
> (10.10.72.145:36558) with 12 cores
> 16/02/01 06:54:28 INFO SparkDeploySchedulerBackend: Granted executor ID
> app-20160201065319-0014/2848 on hostPort 10.10.72.145:36558 with 12
> cores, 2.0 GB RAM
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated:
> app-20160201065319-0014/2848 is now LOADING
> 16/02/01 06:54:28 INFO AppClient$ClientEndpoint: Executor updated:
> app-20160201065319-0014/2848 is now RUNNING
> ............
>
>
>
> Thanks,
> Prabhu Joseph
>
>
>