You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Shixiong Zhu (JIRA)" <ji...@apache.org> on 2016/07/15 21:29:20 UTC
[jira] [Updated] (SPARK-16230) Executors self-killing after being assigned tasks while still in init

     [ https://issues.apache.org/jira/browse/SPARK-16230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shixiong Zhu updated SPARK-16230:
---------------------------------
         Assignee: Tejas Patil
    Fix Version/s: 2.1.0
                   2.0.1

> Executors self-killing after being assigned tasks while still in init
> ---------------------------------------------------------------------
>
>                 Key: SPARK-16230
>                 URL: https://issues.apache.org/jira/browse/SPARK-16230
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>            Reporter: Tejas Patil
>            Assignee: Tejas Patil
>            Priority: Minor
>             Fix For: 2.0.1, 2.1.0
>
>
> I see this happening frequently in our prod clusters:
> * EXECUTOR:   [CoarseGrainedExecutorBackend|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L61] sends request to register itself to the driver.
> * DRIVER: Registers executor and [replies|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L179]
> * EXECUTOR:  ExecutorBackend receives ACK and [starts creating an Executor|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L81]
> * DRIVER:  Tries to launch a task as it knows there is a new executor. Sends a [LaunchTask|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L268] to this new executor.
> * EXECUTOR:  Executor is not init'ed (one of the reasons I have seen is because it was still trying to register to local external shuffle service). Meanwhile, receives a `LaunchTask`. [Kills itself|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L90] as Executor is not init'ed.
> The driver assumes that Executor is ready to accept tasks as soon as it is registered but thats not true.
> How this affects jobs / cluster:
> * We waste time + resources with these executors but they don't do any meaningful computation.
> * Driver thinks that the executor has started running the task but since the Executor has self killed, it does not tell driver (BTW: this is also another issue which I think could be fixed separately). Driver waits for 10 mins and then declares the executor dead. This adds up to the latency of the job. Plus, failure attempts also gets bumped up for the tasks despite the tasks were never started. For unlucky tasks, this might cause the job failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org