You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Tejas Patil (JIRA)" <ji...@apache.org> on 2016/06/27 18:58:51 UTC

[jira] [Created] (SPARK-16230) Executors self-killing after being assigned tasks while still in init

Tejas Patil created SPARK-16230:
-----------------------------------

             Summary: Executors self-killing after being assigned tasks while still in init
                 Key: SPARK-16230
                 URL: https://issues.apache.org/jira/browse/SPARK-16230
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
            Reporter: Tejas Patil
            Priority: Minor


I see this happening frequently in our prod clusters:

* EXECUTOR:   [CoarseGrainedExecutorBackend|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L61] sends request to register itself to the driver.
* DRIVER: Registers executor and [replies|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L179]
* EXECUTOR:  ExecutorBackend receives ACK and [starts creating an Executor|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L81]
* DRIVER:  Tries to launch a task as it knows there is a new executor. Sends a [LaunchTask|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L268] to this new executor.
* EXECUTOR:  Executor is not init'ed (one of the reasons I have seen is because it was still trying to register to local external shuffle service). Meanwhile, receives a `LaunchTask`. [Kills itself|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L90] as Executor is not init'ed.

The driver assumes that Executor is ready to accept tasks as soon as it is registered but thats not true.

How this affects jobs / cluster:
* We waste time + resources with these executors but they don't do any meaningful computation.
* Driver thinks that the executor has started running the task but since the Executor has self killed, it does not tell driver (BTW: this is also another issue which I think could be fixed separately). Driver waits for 10 mins and then declares the executor dead. This adds up to the latency of the job. Plus, failure attempts also gets bumped up for the tasks despite the tasks were never started. For unlucky tasks, this might cause the job failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org