You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Weizhong (JIRA)" <ji...@apache.org> on 2016/10/14 03:16:21 UTC
[jira] [Updated] (SPARK-17929) Deadlock when AM restart and send
RemoveExecutor on reset
[ https://issues.apache.org/jira/browse/SPARK-17929?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Weizhong updated SPARK-17929:
-----------------------------
Summary: Deadlock when AM restart and send RemoveExecutor on reset (was: Deadlock when AM restart send RemoveExecutor)
> Deadlock when AM restart and send RemoveExecutor on reset
> ---------------------------------------------------------
>
> Key: SPARK-17929
> URL: https://issues.apache.org/jira/browse/SPARK-17929
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 2.0.0
> Reporter: Weizhong
> Priority: Minor
>
> We fix SPARK-10582, and add reset in CoarseGrainedSchedulerBackend.scala
> {code}
> protected def reset(): Unit = synchronized {
> numPendingExecutors = 0
> executorsPendingToRemove.clear()
> // Remove all the lingering executors that should be removed but not yet. The reason might be
> // because (1) disconnected event is not yet received; (2) executors die silently.
> executorDataMap.toMap.foreach { case (eid, _) =>
> driverEndpoint.askWithRetry[Boolean](
> RemoveExecutor(eid, SlaveLost("Stale executor after cluster manager re-registered.")))
> }
> }
> {code}
> but on removeExecutor also need the lock "CoarseGrainedSchedulerBackend.this.synchronized", this will cause deadlock, and send RPC will failed, and reset failed
> {code}
> private def removeExecutor(executorId: String, reason: ExecutorLossReason): Unit = {
> logDebug(s"Asked to remove executor $executorId with reason $reason")
> executorDataMap.get(executorId) match {
> case Some(executorInfo) =>
> // This must be synchronized because variables mutated
> // in this block are read when requesting executors
> val killed = CoarseGrainedSchedulerBackend.this.synchronized {
> addressToExecutorId -= executorInfo.executorAddress
> executorDataMap -= executorId
> executorsPendingLossReason -= executorId
> executorsPendingToRemove.remove(executorId).getOrElse(false)
> }
> ...
> {code}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org