You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Shixiong Zhu (JIRA)" <ji...@apache.org> on 2016/10/21 21:45:58 UTC

[jira] [Resolved] (SPARK-17929) Deadlock when AM restart and send RemoveExecutor on reset

     [ https://issues.apache.org/jira/browse/SPARK-17929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shixiong Zhu resolved SPARK-17929.
----------------------------------
       Resolution: Fixed
         Assignee: Fei Wang
    Fix Version/s: 2.1.0
                   2.0.2

> Deadlock when AM restart and send RemoveExecutor on reset
> ---------------------------------------------------------
>
>                 Key: SPARK-17929
>                 URL: https://issues.apache.org/jira/browse/SPARK-17929
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.0.0
>            Reporter: Weizhong
>            Assignee: Fei Wang
>            Priority: Minor
>             Fix For: 2.0.2, 2.1.0
>
>
> We fix SPARK-10582, and add reset in CoarseGrainedSchedulerBackend.scala
> {code}
>   protected def reset(): Unit = synchronized {
>     numPendingExecutors = 0
>     executorsPendingToRemove.clear()
>     // Remove all the lingering executors that should be removed but not yet. The reason might be
>     // because (1) disconnected event is not yet received; (2) executors die silently.
>     executorDataMap.toMap.foreach { case (eid, _) =>
>       driverEndpoint.askWithRetry[Boolean](
>         RemoveExecutor(eid, SlaveLost("Stale executor after cluster manager re-registered.")))
>     }
>   }
> {code}
> but on removeExecutor also need the lock "CoarseGrainedSchedulerBackend.this.synchronized", this will cause deadlock, and send RPC will failed, and reset failed
> {code}
>     private def removeExecutor(executorId: String, reason: ExecutorLossReason): Unit = {
>       logDebug(s"Asked to remove executor $executorId with reason $reason")
>       executorDataMap.get(executorId) match {
>         case Some(executorInfo) =>
>           // This must be synchronized because variables mutated
>           // in this block are read when requesting executors
>           val killed = CoarseGrainedSchedulerBackend.this.synchronized {
>             addressToExecutorId -= executorInfo.executorAddress
>             executorDataMap -= executorId
>             executorsPendingLossReason -= executorId
>             executorsPendingToRemove.remove(executorId).getOrElse(false)
>           }
>      ...
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org