You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/03/12 17:09:15 UTC
[GitHub] [spark] squito commented on a change in pull request #24035: [SPARK-27112] : Spark Scheduler encounters two independent Deadlocks …

squito commented on a change in pull request #24035: [SPARK-27112] : Spark Scheduler encounters two independent Deadlocks …
URL: https://github.com/apache/spark/pull/24035#discussion_r264787223
 
 

 ##########
 File path: core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
 ##########
 @@ -622,67 +633,107 @@ class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val rpcEnv: Rp
    * @param countFailures if there are tasks running on the executors when they are killed, whether
    *                      those failures be counted to task failure limits?
    * @param force whether to force kill busy executors, default false
+   * @param blacklistingOnTaskCompletion whether the executors are being killed due to
+   *                                     blacklisting triggered by the task completion event
    * @return the ids of the executors acknowledged by the cluster manager to be removed.
    */
   final override def killExecutors(
       executorIds: Seq[String],
       adjustTargetNumExecutors: Boolean,
       countFailures: Boolean,
-      force: Boolean): Seq[String] = {
+      force: Boolean,
+    blacklistingOnTaskCompletion: Boolean): Seq[String] = {
     logInfo(s"Requesting to kill executor(s) ${executorIds.mkString(", ")}")
 
-    val response = synchronized {
-      val (knownExecutors, unknownExecutors) = executorIds.partition(executorDataMap.contains)
-      unknownExecutors.foreach { id =>
-        logWarning(s"Executor to kill $id does not exist!")
-      }
-
-      // If an executor is already pending to be removed, do not kill it again (SPARK-9795)
-      // If this executor is busy, do not kill it unless we are told to force kill it (SPARK-9552)
-      val executorsToKill = knownExecutors
-        .filter { id => !executorsPendingToRemove.contains(id) }
-        .filter { id => force || !scheduler.isExecutorBusy(id) }
-      executorsToKill.foreach { id => executorsPendingToRemove(id) = !countFailures }
-
-      logInfo(s"Actual list of executor(s) to be killed is ${executorsToKill.mkString(", ")}")
-
-      // If we do not wish to replace the executors we kill, sync the target number of executors
-      // with the cluster manager to avoid allocating new ones. When computing the new target,
-      // take into account executors that are pending to be added or removed.
-      val adjustTotalExecutors =
-        if (adjustTargetNumExecutors) {
-          requestedTotalExecutors = math.max(requestedTotalExecutors - executorsToKill.size, 0)
-          if (requestedTotalExecutors !=
-              (numExistingExecutors + numPendingExecutors - executorsPendingToRemove.size)) {
-            logDebug(
-              s"""killExecutors($executorIds, $adjustTargetNumExecutors, $countFailures, $force):
-                 |Executor counts do not match:
-                 |requestedTotalExecutors  = $requestedTotalExecutors
-                 |numExistingExecutors     = $numExistingExecutors
-                 |numPendingExecutors      = $numPendingExecutors
-                 |executorsPendingToRemove = ${executorsPendingToRemove.size}""".stripMargin)
-          }
-          doRequestTotalExecutors(requestedTotalExecutors)
-        } else {
-          numPendingExecutors += executorsToKill.size
-          Future.successful(true)
+    var response: Future[Seq[String]] = null
+    val idleExecutorIds = executorIds.filter { id => !scheduler.isExecutorBusy(id) }
+    if (!blacklistingOnTaskCompletion) {
 
 Review comment:
   I don't think that `makeOffersLock` solves the deadlock here.  You wont' get a deadlock between the same two locks, but now it can be with `makeOffersLock` instead.  Consider this sequence (some simplification of full call stack, but showing the important locks at least)
   
   1) taskresultgetter: handleFailedTask --> lock on taskSchedulerImpl
   2) taskresultgetter: BlacklistTracker.killExecutor
   
   3) dispatcher: receive --> lock on CoarseGrainedSchedulerBackendkk
   4) dispatcher: makeOffers --> lock on makeOffersLock
   5) dispatcher: blocked on TaskSchedulerImpl lock
   
   6) taskResultGetter: makeOffers, but blocked on makeOffersLock
   
   As Attila suggested, I would consider creating an ordering between the TaskSchedulerImpl lock and the CoarseGrainedSchedulerBackend lock, so that we always get the TaskSchedulerImpl lock first.  Of course that comes with a performance penalty, and we will have to audit all other uses of the CoarseGrainedSchedulerBackend lock too.
   
   Still thinking about any other options ...

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org