You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Weizhong (JIRA)" <ji...@apache.org> on 2016/10/14 02:57:20 UTC

[jira] [Created] (SPARK-17929) Deadlock when AM restart send RemoveExecutor

Weizhong created SPARK-17929:
--------------------------------

             Summary: Deadlock when AM restart send RemoveExecutor
                 Key: SPARK-17929
                 URL: https://issues.apache.org/jira/browse/SPARK-17929
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 2.0.0
            Reporter: Weizhong
            Priority: Minor


We fix SPARK-10582, and add reset in CoarseGrainedSchedulerBackend.scala
{code}
  protected def reset(): Unit = synchronized {
    numPendingExecutors = 0
    executorsPendingToRemove.clear()

    // Remove all the lingering executors that should be removed but not yet. The reason might be
    // because (1) disconnected event is not yet received; (2) executors die silently.
    executorDataMap.toMap.foreach { case (eid, _) =>
      driverEndpoint.askWithRetry[Boolean](
        RemoveExecutor(eid, SlaveLost("Stale executor after cluster manager re-registered.")))
    }
  }
{code}
but on removeExecutor also need the lock "CoarseGrainedSchedulerBackend.this.synchronized", this will cause deadlock, and send RPC will failed, and reset failed
{code}
    private def removeExecutor(executorId: String, reason: ExecutorLossReason): Unit = {
      logDebug(s"Asked to remove executor $executorId with reason $reason")
      executorDataMap.get(executorId) match {
        case Some(executorInfo) =>
          // This must be synchronized because variables mutated
          // in this block are read when requesting executors
          val killed = CoarseGrainedSchedulerBackend.this.synchronized {
            addressToExecutorId -= executorInfo.executorAddress
            executorDataMap -= executorId
            executorsPendingLossReason -= executorId
            executorsPendingToRemove.remove(executorId).getOrElse(false)
          }
     ...
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org