You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/10/13 12:01:15 UTC

[GitHub] [spark] Ngone51 opened a new pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Ngone51 opened a new pull request #29817:
URL: https://github.com/apache/spark/pull/29817


   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
     2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
     3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
     4. Be sure to keep the PR description updated to reflect all changes.
     5. Please write your PR title to summarize what this PR proposes.
     6. If possible, provide a concise example to reproduce the issue for a faster review.
     7. If you want to add a new configuration, please read the guideline first for naming configurations in
        'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
   -->
   
   ### What changes were proposed in this pull request?
   <!--
   Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. 
   If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
     1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
     2. If you fix some SQL features, you can provide some references of other DBMSes.
     3. If there is design documentation, please add the link.
     4. If there is a discussion in the mailing list, please add the link.
   -->
   
   
   This PR cleans up the RPC message flow among the multiple decommission use cases, it includes changes:
   
   * Keep `Worker`'s decommission status be consistent between the case where decommission starts from `Worker` and the case where decommission starts from the `MasterWebUI`: sending `DecommissionWorker` from `Master` to `Worker` in the latter case.
   
   * Change from two-way communication to one-way communication when notifying decommission between driver and executor: it's obviously unnecessary for the executor to acknowledge the decommission status to the driver since the decommission request is from the driver. And it's same in reverse.
   
   * Only send one message instead of two(`DecommissionSelf`/`DecommissionBlockManager`) when decommission the executor: executor and `BlockManager` are in the same JVM.
   
   * Clean up codes around here.
   
   
   
   ### Why are the changes needed?
   <!--
   Please clarify why the changes are needed. For instance,
     1. If you propose a new API, clarify the use case for a new API.
     2. If you fix a bug, you can clarify why it is a bug.
   -->
   
   Before:
   
   <img width="1948" alt="WeChat56c00cc34d9785a67a544dca036d49da" src="https://user-images.githubusercontent.com/16397174/92850308-dc461c80-f41e-11ea-8ac0-287825f4e0c4.png">
   
   After:
   <img width="1968" alt="WeChat05f7afb017e3f0132394c5e54245e49e" src="https://user-images.githubusercontent.com/16397174/93189571-de88dd80-f774-11ea-9300-1943920aa27d.png">
   
   
   
   
   (Note the diagrams only counts those RPC calls that needed to go through the network. Local RPC calls are not counted here.)
   
   After this change, We reduced 6 original RPC calls and added one more RPC call for keeping the consistent decommission status for the Worker. And the RPC flow becomes more clear.
   
   
   ### Does this PR introduce _any_ user-facing change?
   <!--
   Note that it means *any* user-facing change including all aspects such as the documentation fix.
   If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
   If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
   If no, write 'No'.
   -->
   
   No.
   
   
   ### How was this patch tested?
   <!--
   If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
   If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
   If tests were not added, please describe why they were not added and/or why it was difficult to add.
   -->
   
   Updated existing tests.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-699965753


   **[Test build #129182 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129182/testReport)** for PR 29817 at commit [`e3ca2af`](https://github.com/apache/spark/commit/e3ca2aff082ecaf2850030cab56fe083d28142f2).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on a change in pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on a change in pull request #29817:
URL: https://github.com/apache/spark/pull/29817#discussion_r509902290



##########
File path: core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
##########
@@ -465,72 +464,50 @@ class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val rpcEnv: Rp
    * @param executorsAndDecomInfo Identifiers of executors & decommission info.
    * @param adjustTargetNumExecutors whether the target number of executors will be adjusted down
    *                                 after these executors have been decommissioned.
+   * @param triggeredByExecutor whether the decommission is triggered at executor.
    * @return the ids of the executors acknowledged by the cluster manager to be removed.
    */
   override def decommissionExecutors(
       executorsAndDecomInfo: Array[(String, ExecutorDecommissionInfo)],
-      adjustTargetNumExecutors: Boolean): Seq[String] = {
-
+      adjustTargetNumExecutors: Boolean,
+      triggeredByExecutor: Boolean): Seq[String] = withLock {
     // Do not change this code without running the K8s integration suites
-    val executorsToDecommission = executorsAndDecomInfo.filter { case (executorId, decomInfo) =>
-      CoarseGrainedSchedulerBackend.this.synchronized {

Review comment:
       Because now we not only change the mutable status but also call `scheduler.executorDecommission`. And `scheduler.executorDecommission` requires the lock on `TaskSchedulerImpl`. To avoid potential deadlock, we should use `withLock`.
   
   More context: Previously,  change the mutable status and call `scheduler.executorDecommission` happens in two separate functions. In this PR, we combined these two functions into one function. Therefore, those two operations now are within the same code block and need the `withLock` protection.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-714544707






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-696137003


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33556/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-699882714


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33787/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-696162528


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/33557/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon edited a comment on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
HyukjinKwon edited a comment on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-696106376


   @holdenk, seems like the test failure wasn't caused by this PR. Dose your -1 at [here](https://issues.apache.org/jira/browse/SPARK-32937?focusedCommentId=17198867&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17198867) still stand?
   
   cc @dongjoon-hyun as well per #29751.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] holdenk commented on a change in pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
holdenk commented on a change in pull request #29817:
URL: https://github.com/apache/spark/pull/29817#discussion_r493940745



##########
File path: core/src/main/scala/org/apache/spark/deploy/DeployMessage.scala
##########
@@ -61,13 +61,34 @@ private[deploy] object DeployMessages {
   }
 
   /**
+   * An internal message that used by Master itself, in order to handle the
+   * `DecommissionWorkersOnHosts` request from `MasterWebUI` asynchronously.
+   * @param ids A collection of Worker ids, which should be decommissioned.
+   */
+  case class DecommissionWorkers(ids: Seq[String]) extends DeployMessage
+
+  /**
+   * A message that sent from Master to Worker to decommission the Worker.
+   * It's used for the case where decommission is triggered at MasterWebUI.
+   *
+   * Note that decommission a Worker will cause all the executors on that Worker
+   * to be decommissioned as well.
+   */
+  object DecommissionWorker extends DeployMessage
+
+  /**
+   * A message that sent to the Worker itself when it receives PWR signal,

Review comment:
       Do you mean "sent by the worker to itself"?

##########
File path: core/src/main/scala/org/apache/spark/deploy/DeployMessage.scala
##########
@@ -61,13 +61,34 @@ private[deploy] object DeployMessages {
   }
 
   /**
+   * An internal message that used by Master itself, in order to handle the
+   * `DecommissionWorkersOnHosts` request from `MasterWebUI` asynchronously.
+   * @param ids A collection of Worker ids, which should be decommissioned.
+   */
+  case class DecommissionWorkers(ids: Seq[String]) extends DeployMessage
+
+  /**
+   * A message that sent from Master to Worker to decommission the Worker.
+   * It's used for the case where decommission is triggered at MasterWebUI.
+   *
+   * Note that decommission a Worker will cause all the executors on that Worker
+   * to be decommissioned as well.
+   */
+  object DecommissionWorker extends DeployMessage
+
+  /**
+   * A message that sent to the Worker itself when it receives PWR signal,
+   * indicating the Worker starts to decommission.

Review comment:
       I'm not sure about this parts meaning

##########
File path: core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
##########
@@ -166,17 +171,6 @@ private[spark] class CoarseGrainedExecutorBackend(
       if (executor == null) {
         exitExecutor(1, "Received LaunchTask command but executor was null")
       } else {
-        if (decommissioned) {
-          val msg = "Asked to launch a task while decommissioned."
-          logError(msg)
-          driver match {
-            case Some(endpoint) =>
-              logInfo("Sending DecommissionExecutor to driver.")
-              endpoint.send(DecommissionExecutor(executorId, ExecutorDecommissionInfo(msg)))
-            case _ =>
-              logError("No registered driver to send Decommission to.")
-          }
-        }

Review comment:
       I don't think we should just take this out. async sends could fail, re-sending the message if we receive a request which indicates the master hasn't received our notification indicates we should resend.

##########
File path: core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
##########
@@ -272,10 +268,14 @@ class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val rpcEnv: Rp
         removeWorker(workerId, host, message)
         context.reply(true)
 
-      case DecommissionExecutor(executorId, decommissionInfo) =>
-        logError(s"Received decommission executor message ${executorId}: ${decommissionInfo}.")
-        context.reply(decommissionExecutor(executorId, decommissionInfo,
-          adjustTargetNumExecutors = false))
+      case ExecutorDecommissioning(executorId) =>
+        logWarning(s"Received executor $executorId decommissioned message")

Review comment:
       Here might be where you break the test suite last time, so double check it.

##########
File path: core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
##########
@@ -213,9 +207,17 @@ private[spark] class CoarseGrainedExecutorBackend(
       logInfo(s"Received tokens of ${tokenBytes.length} bytes")
       SparkHadoopUtil.get.addDelegationTokens(tokenBytes, env.conf)
 
-    case DecommissionSelf =>
-      logInfo("Received decommission self")
+    case DecommissionExecutor =>
       decommissionSelf()
+
+    case ExecutorSigPWRReceived =>
+      decommissionSelf()
+      if (driver.nonEmpty) {

Review comment:
       Doing this after decommissionSelf introduces a race condition, not too important one but given that it wasn't in the original flow I'd suggest swapping the order.

##########
File path: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala
##########
@@ -70,7 +70,10 @@ private[deploy] class Worker(
   if (conf.get(config.DECOMMISSION_ENABLED)) {
     logInfo("Registering SIGPWR handler to trigger decommissioning.")
     SignalUtils.register("PWR", "Failed to register SIGPWR handler - " +
-      "disabling worker decommission feature.")(decommissionSelf)
+      "disabling worker decommission feature.") {
+       self.send(WorkerSigPWRReceived)

Review comment:
       I think this might mean we return from handling the signal right away rather than waiting for decommissionSelf to be finished. Is this an intentional change?
   
   Also this will no longer report a decommissioning failure by signal return value, so may block pod deletion or other cleanup task longer than needed.

##########
File path: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala
##########
@@ -668,8 +672,13 @@ private[deploy] class Worker(
       finishedApps += id
       maybeCleanupApplication(id)
 
-    case WorkerDecommission(_, _) =>
+    case DecommissionWorker =>
+      decommissionSelf()
+
+    case WorkerSigPWRReceived =>
       decommissionSelf()
+      // Tell master we starts decommissioning so it stops trying to launch executor/driver on us

Review comment:
       May I suggest "Tell the master that we are starting decommissioning so it stops trying to launch executor/driver on us"

##########
File path: core/src/main/scala/org/apache/spark/storage/BlockManager.scala
##########
@@ -1809,7 +1809,9 @@ private[spark] class BlockManager(
     blocksToRemove.size
   }
 
-  def decommissionBlockManager(): Unit = synchronized {
+  def decommissionBlockManager(): Unit = storageEndpoint.ask(DecommissionBlockManager)

Review comment:
       Why?

##########
File path: core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
##########
@@ -213,9 +207,17 @@ private[spark] class CoarseGrainedExecutorBackend(
       logInfo(s"Received tokens of ${tokenBytes.length} bytes")
       SparkHadoopUtil.get.addDelegationTokens(tokenBytes, env.conf)
 
-    case DecommissionSelf =>
-      logInfo("Received decommission self")
+    case DecommissionExecutor =>
       decommissionSelf()
+
+    case ExecutorSigPWRReceived =>
+      decommissionSelf()
+      if (driver.nonEmpty) {
+        // Tell driver we starts decommissioning so it stops trying to schedule us

Review comment:
       s/we starts/we are starting/
   ?

##########
File path: core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
##########
@@ -79,12 +79,17 @@ private[spark] class CoarseGrainedExecutorBackend(
    */
   private[executor] val taskResources = new mutable.HashMap[Long, Map[String, ResourceInformation]]
 
-  @volatile private var decommissioned = false
+  private var decommissioned = false
 
   override def onStart(): Unit = {
-    logInfo("Registering PWR handler.")
-    SignalUtils.register("PWR", "Failed to register SIGPWR handler - " +
-      "disabling decommission feature.")(decommissionSelf)
+    if (env.conf.get(DECOMMISSION_ENABLED)) {
+      logInfo("Registering PWR handler to trigger decommissioning.")
+      SignalUtils.register("PWR", "Failed to register SIGPWR handler - " +
+      "disabling executor decommission feature.") {
+        self.send(ExecutorSigPWRReceived)
+        true
+      }
+    }

Review comment:
       Similar comments as in Worker.scala




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-700361928


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33813/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-714611969






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-714562853






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-714292042


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34737/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-714340734


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34747/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-696469625


   > I think we should not commit this with the K8s test being broken.
   
   We don't. That's also why I added `[K8S]` tag in the PR title. And feel free to leave comments, I can address them in followups.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-696205240


   **[Test build #128934 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128934/testReport)** for PR 29817 at commit [`5ca0fe8`](https://github.com/apache/spark/commit/5ca0fe88f5b651d4ead5abf157ec70cbd77fac13).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-696100181


   Bring this back since it isn't the original commit that breaks the K8S test. As the PR https://github.com/apache/spark/pull/29751 merged before this already failed the K8S tests


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] holdenk commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
holdenk commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-698064374


   If it helps I called out the part of the PR I believe most likely responsible for that during my code review already.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-714361195


   **[Test build #130148 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130148/testReport)** for PR 29817 at commit [`e625051`](https://github.com/apache/spark/commit/e625051c1ae1d9b7878710aaa4d039b8f577a2c1).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-714495723


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-714279682


   **[Test build #130130 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130130/testReport)** for PR 29817 at commit [`e625051`](https://github.com/apache/spark/commit/e625051c1ae1d9b7878710aaa4d039b8f577a2c1).
    * This patch **fails due to an unknown error code, -9**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] holdenk commented on a change in pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
holdenk commented on a change in pull request #29817:
URL: https://github.com/apache/spark/pull/29817#discussion_r494435828



##########
File path: core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
##########
@@ -213,9 +207,17 @@ private[spark] class CoarseGrainedExecutorBackend(
       logInfo(s"Received tokens of ${tokenBytes.length} bytes")
       SparkHadoopUtil.get.addDelegationTokens(tokenBytes, env.conf)
 
-    case DecommissionSelf =>
-      logInfo("Received decommission self")
+    case DecommissionExecutor =>
       decommissionSelf()
+
+    case ExecutorSigPWRReceived =>
+      decommissionSelf()
+      if (driver.nonEmpty) {

Review comment:
       So we don’t ask the driver to stop scheduling jobs on us first, and the driver could ask us to run a job while we are part way through decommissioning. This won’t result in a failure because well accept the job but it will slow down the decommissioning. So swap the order of these two.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon edited a comment on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
HyukjinKwon edited a comment on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-696492521


   @holdenk, why don't you take a look for the test failure since it blocks all changes in decommission in k8s, and you were involved mainly in the development there?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-714292085


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/34737/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-698071835


   > This PR in its original broke another one of the conditions.
   
   Except for the breaking of the log(like you just fixed), what other conditions this PR breaks?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-706821615


   **[Test build #129652 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129652/testReport)** for PR 29817 at commit [`17bbbd9`](https://github.com/apache/spark/commit/17bbbd9461a898c898b87635e4167ef5e01360d2).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-700367536


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-699989257






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on a change in pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on a change in pull request #29817:
URL: https://github.com/apache/spark/pull/29817#discussion_r502913421



##########
File path: core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
##########
@@ -465,72 +464,50 @@ class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val rpcEnv: Rp
    * @param executorsAndDecomInfo Identifiers of executors & decommission info.
    * @param adjustTargetNumExecutors whether the target number of executors will be adjusted down
    *                                 after these executors have been decommissioned.
+   * @param triggeredByExecutor whether the decommission is triggered at executor.
    * @return the ids of the executors acknowledged by the cluster manager to be removed.
    */
   override def decommissionExecutors(
       executorsAndDecomInfo: Array[(String, ExecutorDecommissionInfo)],
-      adjustTargetNumExecutors: Boolean): Seq[String] = {
-
+      adjustTargetNumExecutors: Boolean,
+      triggeredByExecutor: Boolean): Seq[String] = withLock {

Review comment:
       > Not 100% sure, but I do remember that the locking mechanism withLock did manage to deadlock before
   
   Yeah. We need to avoid the deadlock now because we need both CoarseGrainedSchedulerBackend's lock and TaskSchedulerImpl's lock within the same code block. 
   
   
   > it might be good to add a logging statement in here that verifies this block is called in the integration test.
   
   Updated the logs and the integration test.
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-706838186


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34256/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-706722260






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-696106376


   @holdenk, seems like the test failure wasn't caused by this PR. Dose your -1 at [here](https://issues.apache.org/jira/browse/SPARK-32937?focusedCommentId=17198867&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17198867) still stand?
   
   cc @dongjoon-hyun as well per #29408.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-701460997


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33894/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-699965753


   **[Test build #129182 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129182/testReport)** for PR 29817 at commit [`e3ca2af`](https://github.com/apache/spark/commit/e3ca2aff082ecaf2850030cab56fe083d28142f2).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-714424285


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34755/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-699610093


   @Ngone51, shall we add a test https://github.com/apache/spark/pull/29817#issuecomment-698438316 and fix the test as requested? Seems like otherwise good to go.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-714446595






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #29817:
URL: https://github.com/apache/spark/pull/29817#discussion_r509853800



##########
File path: core/src/main/scala/org/apache/spark/deploy/DeployMessage.scala
##########
@@ -61,13 +61,34 @@ private[deploy] object DeployMessages {
   }
 
   /**
+   * An internal message that used by Master itself, in order to handle the
+   * `DecommissionWorkersOnHosts` request from `MasterWebUI` asynchronously.
+   * @param ids A collection of Worker ids, which should be decommissioned.
+   */
+  case class DecommissionWorkers(ids: Seq[String]) extends DeployMessage
+
+  /**
+   * A message that sent from Master to Worker to decommission the Worker.
+   * It's used for the case where decommission is triggered at MasterWebUI.
+   *
+   * Note that decommission a Worker will cause all the executors on that Worker
+   * to be decommissioned as well.
+   */
+  object DecommissionWorker extends DeployMessage
+
+  /**
+   * A message that sent by the Worker to itself when it receives PWR signal,
+   * indicating the Worker starts to decommission.
+   */
+  object WorkerSigPWRReceived extends DeployMessage
+
+  /**
+   * A message sent from Worker to Master to tell Master that the Worker has started
+   * decommissioning. It's used for the case where decommission is triggered at Worker.
+   *
    * @param id the worker id
-   * @param worker the worker endpoint ref

Review comment:
       shall we keep the param doc for `workerRef`?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-706702375


   **[Test build #129640 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129640/testReport)** for PR 29817 at commit [`0064ba8`](https://github.com/apache/spark/commit/0064ba862196539726cfb07ab9243ec3c149b871).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 edited a comment on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
Ngone51 edited a comment on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-696100181


   Bring this back since it isn't the original commit that breaks the K8S test. As the PR https://github.com/apache/spark/pull/29751 merged before this(https://github.com/apache/spark/pull/29722) already failed the K8S tests


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-714627622






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-706721962


   **[Test build #129640 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129640/testReport)** for PR 29817 at commit [`0064ba8`](https://github.com/apache/spark/commit/0064ba862196539726cfb07ab9243ec3c149b871).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-696162518


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 edited a comment on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
Ngone51 edited a comment on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-696100181


   Bring this back since it isn't the original commit that breaks the K8S test. As the PR https://github.com/apache/spark/pull/29751 merged before this(https://github.com/apache/spark/pull/29722) already failed the K8S tests


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on a change in pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on a change in pull request #29817:
URL: https://github.com/apache/spark/pull/29817#discussion_r494003455



##########
File path: core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
##########
@@ -213,9 +207,17 @@ private[spark] class CoarseGrainedExecutorBackend(
       logInfo(s"Received tokens of ${tokenBytes.length} bytes")
       SparkHadoopUtil.get.addDelegationTokens(tokenBytes, env.conf)
 
-    case DecommissionSelf =>
-      logInfo("Received decommission self")
+    case DecommissionExecutor =>
       decommissionSelf()
+
+    case ExecutorSigPWRReceived =>
+      decommissionSelf()
+      if (driver.nonEmpty) {

Review comment:
       What's race condition?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-700403260


   **[Test build #129198 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129198/testReport)** for PR 29817 at commit [`e3ca2af`](https://github.com/apache/spark/commit/e3ca2aff082ecaf2850030cab56fe083d28142f2).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-714286400


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-699978422


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33795/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-714497664


   **[Test build #130159 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130159/testReport)** for PR 29817 at commit [`e625051`](https://github.com/apache/spark/commit/e625051c1ae1d9b7878710aaa4d039b8f577a2c1).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-706821615


   **[Test build #129652 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129652/testReport)** for PR 29817 at commit [`17bbbd9`](https://github.com/apache/spark/commit/17bbbd9461a898c898b87635e4167ef5e01360d2).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-706718702


   cc @holdenk for taking another look. Thanks!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-696255949






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] MaxGekk commented on a change in pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
MaxGekk commented on a change in pull request #29817:
URL: https://github.com/apache/spark/pull/29817#discussion_r555211907



##########
File path: core/src/test/scala/org/apache/spark/storage/BlockManagerDecommissionIntegrationSuite.scala
##########
@@ -40,6 +40,46 @@ class BlockManagerDecommissionIntegrationSuite extends SparkFunSuite with LocalS
   val TaskEnded = "TASK_ENDED"
   val JobEnded = "JOB_ENDED"
 
+  Seq(false, true).foreach { isEnabled =>
+    test(s"SPARK-32850: BlockManager decommission should respect the configuration " +
+      s"(enabled=${isEnabled})") {
+      val conf = new SparkConf()
+        .setAppName("test-blockmanager-decommissioner")
+        .setMaster("local-cluster[2, 1, 1024]")
+        .set(config.DECOMMISSION_ENABLED, true)
+        .set(config.STORAGE_DECOMMISSION_ENABLED, isEnabled)
+      sc = new SparkContext(conf)
+      TestUtils.waitUntilExecutorsUp(sc, 2, 6000)

Review comment:
       I got test failure in my PR #31131 (the PR is not related to the test I believe):
   ```
   [info] BlockManagerDecommissionIntegrationSuite:
   [info] - SPARK-32850: BlockManager decommission should respect the configuration (enabled=false) *** FAILED *** (6 seconds, 165 milliseconds)
   [info]   java.util.concurrent.TimeoutException: Can't find 2 executors before 6000 milliseconds elapsed
   [info]   at org.apache.spark.TestUtils$.waitUntilExecutorsUp(TestUtils.scala:374)
   [info]   at org.apache.spark.storage.BlockManagerDecommissionIntegrationSuite.$anonfun$new$2(BlockManagerDecommissionIntegrationSuite.scala:52)
   [info]   at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
   ```




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on a change in pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on a change in pull request #29817:
URL: https://github.com/apache/spark/pull/29817#discussion_r494003110



##########
File path: core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
##########
@@ -166,17 +171,6 @@ private[spark] class CoarseGrainedExecutorBackend(
       if (executor == null) {
         exitExecutor(1, "Received LaunchTask command but executor was null")
       } else {
-        if (decommissioned) {
-          val msg = "Asked to launch a task while decommissioned."
-          logError(msg)
-          driver match {
-            case Some(endpoint) =>
-              logInfo("Sending DecommissionExecutor to driver.")
-              endpoint.send(DecommissionExecutor(executorId, ExecutorDecommissionInfo(msg)))
-            case _ =>
-              logError("No registered driver to send Decommission to.")
-          }
-        }

Review comment:
       First, we use `askSync` to send decommission notice to the driver whenever it needs(see `ExecutorSigPWRReceived`). Second, even if driver receives the decommission notice successfully, there still could be LaunchTask request due to the async between LaunchTask and decommission notice. Third, this part also uses async send,  so we still can not ensure the decommission notice is received by driver successfully.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on a change in pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on a change in pull request #29817:
URL: https://github.com/apache/spark/pull/29817#discussion_r494003752



##########
File path: core/src/main/scala/org/apache/spark/deploy/DeployMessage.scala
##########
@@ -61,13 +61,34 @@ private[deploy] object DeployMessages {
   }
 
   /**
+   * An internal message that used by Master itself, in order to handle the
+   * `DecommissionWorkersOnHosts` request from `MasterWebUI` asynchronously.
+   * @param ids A collection of Worker ids, which should be decommissioned.
+   */
+  case class DecommissionWorkers(ids: Seq[String]) extends DeployMessage
+
+  /**
+   * A message that sent from Master to Worker to decommission the Worker.
+   * It's used for the case where decommission is triggered at MasterWebUI.
+   *
+   * Note that decommission a Worker will cause all the executors on that Worker
+   * to be decommissioned as well.
+   */
+  object DecommissionWorker extends DeployMessage
+
+  /**
+   * A message that sent to the Worker itself when it receives PWR signal,

Review comment:
       yeah




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] holdenk commented on a change in pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
holdenk commented on a change in pull request #29817:
URL: https://github.com/apache/spark/pull/29817#discussion_r496321331



##########
File path: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala
##########
@@ -70,7 +70,10 @@ private[deploy] class Worker(
   if (conf.get(config.DECOMMISSION_ENABLED)) {
     logInfo("Registering SIGPWR handler to trigger decommissioning.")
     SignalUtils.register("PWR", "Failed to register SIGPWR handler - " +
-      "disabling worker decommission feature.")(decommissionSelf)
+      "disabling worker decommission feature.") {
+       self.send(WorkerSigPWRReceived)

Review comment:
       I do, I think if the signal is unhandled then the process will be killed immediately. If we think of decommissioning/graceful shut down I believe that behavior is desirable, since if we can't shut down gracefully the least we can do is exit quickly.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-700353978


   **[Test build #129198 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129198/testReport)** for PR 29817 at commit [`e3ca2af`](https://github.com/apache/spark/commit/e3ca2aff082ecaf2850030cab56fe083d28142f2).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-706875853






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-696100181






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-706874710


   **[Test build #129652 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129652/testReport)** for PR 29817 at commit [`17bbbd9`](https://github.com/apache/spark/commit/17bbbd9461a898c898b87635e4167ef5e01360d2).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-714445728


   **[Test build #130148 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130148/testReport)** for PR 29817 at commit [`e625051`](https://github.com/apache/spark/commit/e625051c1ae1d9b7878710aaa4d039b8f577a2c1).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-696137031






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-706838194






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-714407276


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34755/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on a change in pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on a change in pull request #29817:
URL: https://github.com/apache/spark/pull/29817#discussion_r502913401



##########
File path: resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/DecommissionSuite.scala
##########
@@ -41,9 +41,8 @@ private[spark] trait DecommissionSuite { k8sSuite: KubernetesSuite =>
       mainClass = "",
       expectedLogOnCompletion = Seq(
         "Finished waiting, stopping Spark",
-        "Received decommission executor message",
-        "Acknowledged decommissioning block manager",
-        ": Executor decommission.",
+        "Asking BlockManagers", "to decommissioning",
+        "Asking executor", "to decommissioning.",

Review comment:
       Updated to use only one "decommissioning".




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-696125882


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33556/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-714544707


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-696136507


   Got it about -1 but what about we pushing this as is, and work on the test as followups? It's a bit odds that we reverted it for the reason this PR didn't cause, and ask more things before merging it in.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-699914545


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/33787/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-699936668






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-699936668






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-700349606


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-696469625


   > I think we should not commit this with the K8s test being broken.
   
   We don't. That's also why I added `[K8S]` tag in the PR title. And feel free to leave comments, I can address them in followups.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-700367347


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33813/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on a change in pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on a change in pull request #29817:
URL: https://github.com/apache/spark/pull/29817#discussion_r502912206



##########
File path: core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
##########
@@ -465,72 +464,50 @@ class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val rpcEnv: Rp
    * @param executorsAndDecomInfo Identifiers of executors & decommission info.
    * @param adjustTargetNumExecutors whether the target number of executors will be adjusted down
    *                                 after these executors have been decommissioned.
+   * @param triggeredByExecutor whether the decommission is triggered at executor.
    * @return the ids of the executors acknowledged by the cluster manager to be removed.
    */
   override def decommissionExecutors(
       executorsAndDecomInfo: Array[(String, ExecutorDecommissionInfo)],
-      adjustTargetNumExecutors: Boolean): Seq[String] = {
-
+      adjustTargetNumExecutors: Boolean,
+      triggeredByExecutor: Boolean): Seq[String] = withLock {
     // Do not change this code without running the K8s integration suites
-    val executorsToDecommission = executorsAndDecomInfo.filter { case (executorId, decomInfo) =>
-      CoarseGrainedSchedulerBackend.this.synchronized {
-        // Only bother decommissioning executors which are alive.
-        if (isExecutorActive(executorId)) {
-          executorsPendingDecommission(executorId) = decomInfo.workerHost
-          true
-        } else {
-          false
-        }
+    val executorsToDecommission = executorsAndDecomInfo.flatMap { case (executorId, decomInfo) =>
+      // Only bother decommissioning executors which are alive.
+      if (isExecutorActive(executorId)) {
+        scheduler.executorDecommission(executorId, decomInfo)
+        executorsPendingDecommission(executorId) = decomInfo.workerHost
+        Some(executorId)
+      } else {
+        None
       }
     }
 
     // If we don't want to replace the executors we are decommissioning
     if (adjustTargetNumExecutors) {
-      adjustExecutors(executorsToDecommission.map(_._1))
+      adjustExecutors(executorsToDecommission)
     }
 
-    executorsToDecommission.filter { case (executorId, decomInfo) =>
-      doDecommission(executorId, decomInfo)
-    }.map(_._1)
-  }
-
-  // Do not change this code without running the K8s integration suites
-  private def doDecommission(executorId: String,
-      decomInfo: ExecutorDecommissionInfo): Boolean = {
-
-    logInfo(s"Asking executor $executorId to decommissioning.")
-    scheduler.executorDecommission(executorId, decomInfo)
-    // Send decommission message to the executor (it could have originated on the executor
-    // but not necessarily).
-    CoarseGrainedSchedulerBackend.this.synchronized {
-      executorDataMap.get(executorId) match {
-        case Some(executorInfo) =>
-          executorInfo.executorEndpoint.send(DecommissionSelf)
-        case None =>
-          // Ignoring the executor since it is not registered.
-          logWarning(s"Attempted to decommission unknown executor $executorId.")
-          return false
-      }
+    // Mark those corresponding BlockManagers as decommissioned first before we sending
+    // decommission notification to executors. So, it's less likely to lead to the race
+    // condition where `getPeer` request from the decommissioned executor comes first
+    // before the BlockManagers are marked as decommissioned.
+    if (conf.get(STORAGE_DECOMMISSION_ENABLED)) {
+      logInfo(s"Asking BlockManagers on executors (${executorsToDecommission.mkString(", ")}) " +
+        s"to decommissioning.")
+      scheduler.sc.env.blockManager.master.decommissionBlockManagers(executorsToDecommission)
     }
-    logInfo(s"Asked executor $executorId to decommission.")
 
-    if (conf.get(STORAGE_DECOMMISSION_ENABLED)) {
-      try {
-        logInfo(s"Asking block manager corresponding to executor $executorId to decommission.")
-        scheduler.sc.env.blockManager.master.decommissionBlockManagers(Seq(executorId))
-      } catch {
-        case e: Exception =>
-          logError("Unexpected error during block manager " +
-            s"decommissioning for executor $executorId: ${e.toString}", e)
-          return false
+    if (!triggeredByExecutor) {

Review comment:
       We don't really migrate blocks when decommissioning BlockManagers above. We only mark them as being decommissioning at driver side. So I think the problem you mentioned won't exist.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-700367539


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/33813/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-711607829


   @holdenk take a look?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-696590670


   Can we find out which commit caused the test failure in the first place? We should either revert that commit, or fix it soon, as the test failure blocks others.
   
   Since this PR is resubmitted (although the revert is not necessary now given the test failure was already there), I think it's a good chance for @holdenk to take a closer look before re-merging. And I agree with @holdenk that we can't merge a PR when the related test is already broken. We should fix that first. @holdenk can you give some hints about it? I took a look at https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33556/ , but I don't even see how the test failed. The output is very different from normal Spark tests.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-700404808






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-706875853






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] holdenk commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
holdenk commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-696169281


   I think we should not commit this with the K8s test being broken. It’s in the same chunk of code and changes the logging string (although another PR also changed that string first too?). I do not believe this PR was appropriately tested when first merger given it changed decommissioning messages and did not run the decommission tests.
   
   For clarity: if you want to fix the tests in a separate PR that’s ok with me, but I would prefer not to commit this without passing integration testing.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-714181079


   @holdenk ?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-714290360


   **[Test build #130140 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130140/testReport)** for PR 29817 at commit [`e625051`](https://github.com/apache/spark/commit/e625051c1ae1d9b7878710aaa4d039b8f577a2c1).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-714192264






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-696100375


   **[Test build #128934 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128934/testReport)** for PR 29817 at commit [`5ca0fe8`](https://github.com/apache/spark/commit/5ca0fe88f5b651d4ead5abf157ec70cbd77fac13).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-706838194






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] holdenk commented on a change in pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
holdenk commented on a change in pull request #29817:
URL: https://github.com/apache/spark/pull/29817#discussion_r494439130



##########
File path: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala
##########
@@ -70,7 +70,10 @@ private[deploy] class Worker(
   if (conf.get(config.DECOMMISSION_ENABLED)) {
     logInfo("Registering SIGPWR handler to trigger decommissioning.")
     SignalUtils.register("PWR", "Failed to register SIGPWR handler - " +
-      "disabling worker decommission feature.")(decommissionSelf)
+      "disabling worker decommission feature.") {
+       self.send(WorkerSigPWRReceived)

Review comment:
       Can you look into what the difference of this behavior might cause at the system level and then tell me if that’s a desired change? I’m ok with us making changes here, I just want us to be intentional and know if we need to test the change and it seems like this change was incidental.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-714292069






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-714497664


   **[Test build #130159 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130159/testReport)** for PR 29817 at commit [`e625051`](https://github.com/apache/spark/commit/e625051c1ae1d9b7878710aaa4d039b8f577a2c1).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-700046661


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 closed pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
Ngone51 closed pull request #29817:
URL: https://github.com/apache/spark/pull/29817


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on a change in pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on a change in pull request #29817:
URL: https://github.com/apache/spark/pull/29817#discussion_r497546446



##########
File path: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala
##########
@@ -70,7 +70,10 @@ private[deploy] class Worker(
   if (conf.get(config.DECOMMISSION_ENABLED)) {
     logInfo("Registering SIGPWR handler to trigger decommissioning.")
     SignalUtils.register("PWR", "Failed to register SIGPWR handler - " +
-      "disabling worker decommission feature.")(decommissionSelf)
+      "disabling worker decommission feature.") {
+       self.send(WorkerSigPWRReceived)

Review comment:
       Updated, but please note that I only updated for the executor's case since Worker's case always returns `true` before.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-714528189


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34766/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-714279909






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-696255949






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-701461036


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/33894/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-708379134


   kindly ping @holdenk 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-714610489


   **[Test build #130159 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130159/testReport)** for PR 29817 at commit [`e625051`](https://github.com/apache/spark/commit/e625051c1ae1d9b7878710aaa4d039b8f577a2c1).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-706702375


   **[Test build #129640 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129640/testReport)** for PR 29817 at commit [`0064ba8`](https://github.com/apache/spark/commit/0064ba862196539726cfb07ab9243ec3c149b871).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] holdenk commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
holdenk commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-698438316


   I’m mostly concerned with the change around how the storage decommissioning is being done now, I’d like to see some tests that the flow from the master to the worker results in storage decommissioning.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-700367536






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-701421787


   **[Test build #129277 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129277/testReport)** for PR 29817 at commit [`af9dbdd`](https://github.com/apache/spark/commit/af9dbdd4dc96e5e3f2ed2e9e21bb803081337df9).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-700404808






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-700045129


   **[Test build #129182 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129182/testReport)** for PR 29817 at commit [`e3ca2af`](https://github.com/apache/spark/commit/e3ca2aff082ecaf2850030cab56fe083d28142f2).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] holdenk commented on a change in pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
holdenk commented on a change in pull request #29817:
URL: https://github.com/apache/spark/pull/29817#discussion_r494434701



##########
File path: core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
##########
@@ -166,17 +171,6 @@ private[spark] class CoarseGrainedExecutorBackend(
       if (executor == null) {
         exitExecutor(1, "Received LaunchTask command but executor was null")
       } else {
-        if (decommissioned) {
-          val msg = "Asked to launch a task while decommissioned."
-          logError(msg)
-          driver match {
-            case Some(endpoint) =>
-              logInfo("Sending DecommissionExecutor to driver.")
-              endpoint.send(DecommissionExecutor(executorId, ExecutorDecommissionInfo(msg)))
-            case _ =>
-              logError("No registered driver to send Decommission to.")
-          }
-        }

Review comment:
       Right, so we should resend the notice then right?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-699868395


   @holdenk Thanks for the review! I've addressed your most comments. And tests are updated and added in fa04b493324ea52698bed7ce15795530052b37be and
   9d0f36d73e5750267ec2f33e7d964fc09778767f . 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-699914518


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33787/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon edited a comment on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
HyukjinKwon edited a comment on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-696106376


   @holdenk, seems like the test failure wasn't caused by this PR. Dose your -1 at [here](https://issues.apache.org/jira/browse/SPARK-32937?focusedCommentId=17198867&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17198867) still stand?
   
   cc @dongjoon-hyun as well per #29751.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-696137031


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-714446603


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/130148/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-714354893


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34747/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-696137042


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/33556/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-696254503


   **[Test build #128935 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128935/testReport)** for PR 29817 at commit [`15f6085`](https://github.com/apache/spark/commit/15f6085d24bd26fc803831a91a424c94e3bf10f6).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds the following public classes _(experimental)_:
     * `  case class DecommissionWorkers(ids: Seq[String]) extends DeployMessage`
     * `  case class WorkerDecommissioning(id: String, workerRef: RpcEndpointRef) extends DeployMessage`
     * `  case class ExecutorDecommissioning(executorId: String) extends CoarseGrainedClusterMessage`


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-714562823


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34770/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-714424328


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/34755/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-698062962


   > but that's only half true, this PR just broke the test some more.
   
   @holdenk, you're kidding right? There was only one test failure that was not caused by this PR in K8S tests. That test was just fixed by you. How come this PR broke more tests? Can you be more explicit on that? Which tests were more broken, and how did you test?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-696100375






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-696100375






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-714562853






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on a change in pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on a change in pull request #29817:
URL: https://github.com/apache/spark/pull/29817#discussion_r494005398



##########
File path: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala
##########
@@ -70,7 +70,10 @@ private[deploy] class Worker(
   if (conf.get(config.DECOMMISSION_ENABLED)) {
     logInfo("Registering SIGPWR handler to trigger decommissioning.")
     SignalUtils.register("PWR", "Failed to register SIGPWR handler - " +
-      "disabling worker decommission feature.")(decommissionSelf)
+      "disabling worker decommission feature.") {
+       self.send(WorkerSigPWRReceived)

Review comment:
       This's Worker, I guess you cares more about the executor? In Worker, `decommissionSelf` always returns true. and In exeutor, there's a change to return false to fail the decommissionSelf but seems rarely happen. If you would insist on returning the value, I think we can use askSync instead.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-696492521


   @holdenk, why don't you take a look for a test failure since it blocks all changes in decommission in k8s, and you were involved mainly in the development there?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-714354919






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] holdenk commented on a change in pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
holdenk commented on a change in pull request #29817:
URL: https://github.com/apache/spark/pull/29817#discussion_r494434701



##########
File path: core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
##########
@@ -166,17 +171,6 @@ private[spark] class CoarseGrainedExecutorBackend(
       if (executor == null) {
         exitExecutor(1, "Received LaunchTask command but executor was null")
       } else {
-        if (decommissioned) {
-          val msg = "Asked to launch a task while decommissioned."
-          logError(msg)
-          driver match {
-            case Some(endpoint) =>
-              logInfo("Sending DecommissionExecutor to driver.")
-              endpoint.send(DecommissionExecutor(executorId, ExecutorDecommissionInfo(msg)))
-            case _ =>
-              logError("No registered driver to send Decommission to.")
-          }
-        }

Review comment:
       Right, so we should resend the notice then right?

##########
File path: core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
##########
@@ -213,9 +207,17 @@ private[spark] class CoarseGrainedExecutorBackend(
       logInfo(s"Received tokens of ${tokenBytes.length} bytes")
       SparkHadoopUtil.get.addDelegationTokens(tokenBytes, env.conf)
 
-    case DecommissionSelf =>
-      logInfo("Received decommission self")
+    case DecommissionExecutor =>
       decommissionSelf()
+
+    case ExecutorSigPWRReceived =>
+      decommissionSelf()
+      if (driver.nonEmpty) {

Review comment:
       So we don’t ask the driver to stop scheduling jobs on us first, and the driver could ask us to run a job while we are part way through decommissioning. This won’t result in a failure because well accept the job but it will slow down the decommissioning. So swap the order of these two.

##########
File path: core/src/main/scala/org/apache/spark/storage/BlockManager.scala
##########
@@ -1809,7 +1809,9 @@ private[spark] class BlockManager(
     blocksToRemove.size
   }
 
-  def decommissionBlockManager(): Unit = synchronized {
+  def decommissionBlockManager(): Unit = storageEndpoint.ask(DecommissionBlockManager)

Review comment:
       Why did you make this change?

##########
File path: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala
##########
@@ -70,7 +70,10 @@ private[deploy] class Worker(
   if (conf.get(config.DECOMMISSION_ENABLED)) {
     logInfo("Registering SIGPWR handler to trigger decommissioning.")
     SignalUtils.register("PWR", "Failed to register SIGPWR handler - " +
-      "disabling worker decommission feature.")(decommissionSelf)
+      "disabling worker decommission feature.") {
+       self.send(WorkerSigPWRReceived)

Review comment:
       Can you look into what the difference of this behavior might cause at the system level and then tell me if that’s a desired change? I’m ok with us making changes here, I just want us to be intentional and know if we need to test the change and it seems like this change was incidental.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-699914533


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-701537342






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #29817:
URL: https://github.com/apache/spark/pull/29817#discussion_r509858066



##########
File path: core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
##########
@@ -213,11 +204,31 @@ private[spark] class CoarseGrainedExecutorBackend(
       logInfo(s"Received tokens of ${tokenBytes.length} bytes")
       SparkHadoopUtil.get.addDelegationTokens(tokenBytes, env.conf)
 
-    case DecommissionSelf =>
-      logInfo("Received decommission self")
+    case DecommissionExecutor =>
       decommissionSelf()
   }
 
+  override def receiveAndReply(context: RpcCallContext): PartialFunction[Any, Unit] = {
+    case ExecutorSigPWRReceived =>
+      var driverNotified = false
+      try {
+        driver.foreach { driverRef =>
+          // Tell driver that we are starting decommissioning so it stops trying to schedule us
+          driverNotified = driverRef.askSync[Boolean](ExecutorDecommissioning(executorId))
+          if (driverNotified) decommissionSelf()

Review comment:
       This part is new. I think it makes sense to skip executor decommissioning if the driver side rejects to do so.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-714415856


   **[Test build #130140 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130140/testReport)** for PR 29817 at commit [`e625051`](https://github.com/apache/spark/commit/e625051c1ae1d9b7878710aaa4d039b8f577a2c1).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-706709955






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] holdenk commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
holdenk commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-700368043


   Tests are still failing @Ngone51 happy to take another look once the K8s integration tests are passing.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-714544670


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34766/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-714256362


   **[Test build #130130 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130130/testReport)** for PR 29817 at commit [`e625051`](https://github.com/apache/spark/commit/e625051c1ae1d9b7878710aaa4d039b8f577a2c1).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-714192264


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on a change in pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on a change in pull request #29817:
URL: https://github.com/apache/spark/pull/29817#discussion_r494005838



##########
File path: core/src/main/scala/org/apache/spark/storage/BlockManager.scala
##########
@@ -1809,7 +1809,9 @@ private[spark] class BlockManager(
     blocksToRemove.size
   }
 
-  def decommissionBlockManager(): Unit = synchronized {
+  def decommissionBlockManager(): Unit = storageEndpoint.ask(DecommissionBlockManager)

Review comment:
       ?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-696133813


   **[Test build #128935 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128935/testReport)** for PR 29817 at commit [`15f6085`](https://github.com/apache/spark/commit/15f6085d24bd26fc803831a91a424c94e3bf10f6).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-696100375


   **[Test build #128934 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128934/testReport)** for PR 29817 at commit [`5ca0fe8`](https://github.com/apache/spark/commit/5ca0fe88f5b651d4ead5abf157ec70cbd77fac13).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-698080541


   > That being said I still have concerns this PR is not sufficiently tested, can you add some more tests for the new flows you've introduced?
   
   There's only one new flow that is from Master to Worker. I can update the existing test by verifying Worker's decommission status... What kind of other concerns do you have? Could you elaborate more? So I can improve the PR accordingly.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-696162518






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-714279909






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-714359495


   retest this please


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-706235848


   @holdenk Sorry for the delay. I had taken some days off recently. I'm addressing your comments right now and debugging the integration test(the decommissioned executor doesn't migrate shuffle blocks as expected, so I'm still investigating it.)
   
   Thank you for the further review :)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-714278352


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34737/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-714185475


   **[Test build #130119 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130119/testReport)** for PR 29817 at commit [`17bbbd9`](https://github.com/apache/spark/commit/17bbbd9461a898c898b87635e4167ef5e01360d2).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-699989280


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/33795/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-714424315






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-714233903


   **[Test build #130119 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130119/testReport)** for PR 29817 at commit [`17bbbd9`](https://github.com/apache/spark/commit/17bbbd9461a898c898b87635e4167ef5e01360d2).
    * This patch **fails Spark unit tests**.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-701537342






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-714184178


   Retest this please.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-699857383


   **[Test build #129172 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129172/testReport)** for PR 29817 at commit [`23bfdf5`](https://github.com/apache/spark/commit/23bfdf5a5bcc05b4c8066593aa1607dbe7dbe5b0).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-700046661






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] holdenk commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
holdenk commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-698024268


   I really like the idea of simplifying the RPC message flow, thanks for taking this on @Ngone51 and I'm sorry the code here is so brittle to these types of changes (the K8s integration tests are kind of limited).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-706709955






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on a change in pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on a change in pull request #29817:
URL: https://github.com/apache/spark/pull/29817#discussion_r495777455



##########
File path: core/src/main/scala/org/apache/spark/storage/BlockManager.scala
##########
@@ -1809,7 +1809,9 @@ private[spark] class BlockManager(
     blocksToRemove.size
   }
 
-  def decommissionBlockManager(): Unit = synchronized {
+  def decommissionBlockManager(): Unit = storageEndpoint.ask(DecommissionBlockManager)

Review comment:
       If I unsterstand your question correctly:
   
   We didn't really change the `decommissionBlockManager`. The original `decommissionBlockManager` has been renamed to `decommissionSelf` to avoid the naming collision.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on a change in pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on a change in pull request #29817:
URL: https://github.com/apache/spark/pull/29817#discussion_r495769784



##########
File path: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala
##########
@@ -70,7 +70,10 @@ private[deploy] class Worker(
   if (conf.get(config.DECOMMISSION_ENABLED)) {
     logInfo("Registering SIGPWR handler to trigger decommissioning.")
     SignalUtils.register("PWR", "Failed to register SIGPWR handler - " +
-      "disabling worker decommission feature.")(decommissionSelf)
+      "disabling worker decommission feature.") {
+       self.send(WorkerSigPWRReceived)

Review comment:
       The return value of the signal handling decides whether we should forward the signal to the other handlers. If `true`, no other handlers will handle the `PWR` signal except ourselves. If `false`, we will handle it (for decommission) and other handlers will handle it too.  Do you expect other handlers to continue handling the `SIGPWR` when the system isn't really experiencing a power failure?
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #29817:
URL: https://github.com/apache/spark/pull/29817#discussion_r509857454



##########
File path: core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
##########
@@ -166,17 +171,6 @@ private[spark] class CoarseGrainedExecutorBackend(
       if (executor == null) {
         exitExecutor(1, "Received LaunchTask command but executor was null")
       } else {
-        if (decommissioned) {
-          val msg = "Asked to launch a task while decommissioned."
-          logError(msg)
-          driver match {
-            case Some(endpoint) =>
-              logInfo("Sending DecommissionExecutor to driver.")
-              endpoint.send(DecommissionExecutor(executorId, ExecutorDecommissionInfo(msg)))
-            case _ =>
-              logError("No registered driver to send Decommission to.")
-          }
-        }

Review comment:
       Yea I don't see the point of resending the notice to the driver, especially in this race condition. If we want to make sure the driver is noticed, we should design a mechanism for it, instead of doing it here randomly.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-699989257


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-714417601






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-714361195


   **[Test build #130148 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130148/testReport)** for PR 29817 at commit [`e625051`](https://github.com/apache/spark/commit/e625051c1ae1d9b7878710aaa4d039b8f577a2c1).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] holdenk commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
holdenk commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-696133122






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-696137031






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-699988761


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33795/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon closed pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
HyukjinKwon closed pull request #29817:
URL: https://github.com/apache/spark/pull/29817


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-714290360


   **[Test build #130140 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130140/testReport)** for PR 29817 at commit [`e625051`](https://github.com/apache/spark/commit/e625051c1ae1d9b7878710aaa4d039b8f577a2c1).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on a change in pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on a change in pull request #29817:
URL: https://github.com/apache/spark/pull/29817#discussion_r495763742



##########
File path: core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
##########
@@ -213,9 +207,17 @@ private[spark] class CoarseGrainedExecutorBackend(
       logInfo(s"Received tokens of ${tokenBytes.length} bytes")
       SparkHadoopUtil.get.addDelegationTokens(tokenBytes, env.conf)
 
-    case DecommissionSelf =>
-      logInfo("Received decommission self")
+    case DecommissionExecutor =>
       decommissionSelf()
+
+    case ExecutorSigPWRReceived =>
+      decommissionSelf()
+      if (driver.nonEmpty) {

Review comment:
       Addressed




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] holdenk commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
holdenk commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-698064084


   I’m not joking. The test has multiple conditions. Another PR broke one of the conditions. This PR in its original broke another one of the conditions. It was reported in the same test because K8s integration tests focus on the integration so they cover multiple pieces. If you want I’m happy to do a video call (and I can be flex on my timezone if that’s a constraint) and screen share so we can discuss the details.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-714417601






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-696590670


   Can we find out which commit caused the test failure in the first place? We should either revert that commit, or fix it soon, as the test failure blocks others.
   
   Since this PR is resubmitted (although the revert is not necessary now given the test failure was already there), I think it's a good chance for @holdenk to take a closer look before re-merging. And I agree with @holdenk that we can't merge a PR when the related test is already broken. We should fix that first. @holdenk can you give some hints about it? I took a look at https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33556/ , but I don't even see how the test failed. The output is very different from normal Spark tests.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] holdenk commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
holdenk commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-706254628


   Taking days off is healthy and great! We're not in a rush on this so please take your time. My offer of help with debugging tests was not intended to make you feel like you needed to work on this right away, just I've been in the situation where I'm stuck on a test failure I can't figure out in the past. Let me know when you want me to take another look :) 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-696206713






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-699857383


   **[Test build #129172 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129172/testReport)** for PR 29817 at commit [`23bfdf5`](https://github.com/apache/spark/commit/23bfdf5a5bcc05b4c8066593aa1607dbe7dbe5b0).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-701448365


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/33894/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-714911468


   Merged to master.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-714922479


   thanks all!!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-714517755


   **[Test build #130163 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130163/testReport)** for PR 29817 at commit [`3c1e033`](https://github.com/apache/spark/commit/3c1e033089e392066425139d7ccb52cd501dcc31).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-700353978


   **[Test build #129198 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129198/testReport)** for PR 29817 at commit [`e3ca2af`](https://github.com/apache/spark/commit/e3ca2aff082ecaf2850030cab56fe083d28142f2).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon edited a comment on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
HyukjinKwon edited a comment on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-696492521


   @holdenk, why don't you take a look for the test failure since it blocks all changes in decommission in k8s, and you were involved mainly in the development there?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-706709950


   Kubernetes integration test status success
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34244/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-714234553


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/130119/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-696492521


   @holdenk, why don't you take a look for a test failure since it blocks all changes in decommission in k8s, and you were involved mainly in the development there?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on a change in pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on a change in pull request #29817:
URL: https://github.com/apache/spark/pull/29817#discussion_r502911491



##########
File path: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala
##########
@@ -668,8 +672,14 @@ private[deploy] class Worker(
       finishedApps += id
       maybeCleanupApplication(id)
 
-    case WorkerDecommission(_, _) =>
+    case DecommissionWorker =>
+      decommissionSelf()
+
+    case WorkerSigPWRReceived =>
       decommissionSelf()
+      // Tell the Master that we are starting decommissioning
+      // so it stops trying to launch executor/driver on us
+      sendToMaster(WorkerDecommissioning(workerId, self))

Review comment:
       I added the param previously but removed it later in order to keep `decommissionSelf` unchanged according to the comment: https://github.com/apache/spark/pull/29722#discussion_r487855223




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-706832739


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34256/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-696106376






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-714192268






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-714354919






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-699935766


   **[Test build #129172 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129172/testReport)** for PR 29817 at commit [`23bfdf5`](https://github.com/apache/spark/commit/23bfdf5a5bcc05b4c8066593aa1607dbe7dbe5b0).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-714517755


   **[Test build #130163 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130163/testReport)** for PR 29817 at commit [`3c1e033`](https://github.com/apache/spark/commit/3c1e033089e392066425139d7ccb52cd501dcc31).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] holdenk commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
holdenk commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-698438316






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-714446595


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-714234549






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-714544725


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/34766/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-714626368


   **[Test build #130163 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130163/testReport)** for PR 29817 at commit [`3c1e033`](https://github.com/apache/spark/commit/3c1e033089e392066425139d7ccb52cd501dcc31).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-701421787


   **[Test build #129277 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129277/testReport)** for PR 29817 at commit [`af9dbdd`](https://github.com/apache/spark/commit/af9dbdd4dc96e5e3f2ed2e9e21bb803081337df9).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-701536315


   **[Test build #129277 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129277/testReport)** for PR 29817 at commit [`af9dbdd`](https://github.com/apache/spark/commit/af9dbdd4dc96e5e3f2ed2e9e21bb803081337df9).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-696133813


   **[Test build #128935 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/128935/testReport)** for PR 29817 at commit [`15f6085`](https://github.com/apache/spark/commit/15f6085d24bd26fc803831a91a424c94e3bf10f6).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-700046669


   Test FAILed.
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/129182/
   Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 edited a comment on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
Ngone51 edited a comment on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-698071835


   > This PR in its original broke another one of the conditions.
   
   Except for the breaking of the log(like you just fixed), what other conditions does this PR break?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] HyukjinKwon commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
HyukjinKwon commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-711656428


   Let me merge this in few days if there are no more comments assuming lazy consensus.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-714627622






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-701461022






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-714611969






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-698080541


   > That being said I still have concerns this PR is not sufficiently tested, can you add some more tests for the new flows you've introduced?
   
   There's only one new flow that is from Master to Worker. I can update the existing test by verifying Worker's decommission status... What kind of other concerns do you have? Could you elaborate more? So I can improve the PR accordingly.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] holdenk commented on a change in pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
holdenk commented on a change in pull request #29817:
URL: https://github.com/apache/spark/pull/29817#discussion_r494436106



##########
File path: core/src/main/scala/org/apache/spark/storage/BlockManager.scala
##########
@@ -1809,7 +1809,9 @@ private[spark] class BlockManager(
     blocksToRemove.size
   }
 
-  def decommissionBlockManager(): Unit = synchronized {
+  def decommissionBlockManager(): Unit = storageEndpoint.ask(DecommissionBlockManager)

Review comment:
       Why did you make this change?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on a change in pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on a change in pull request #29817:
URL: https://github.com/apache/spark/pull/29817#discussion_r502911793



##########
File path: core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala
##########
@@ -163,8 +163,7 @@ class BlockManagerMasterEndpoint(
       context.reply(true)
 
     case DecommissionBlockManagers(executorIds) =>
-      val bmIds = executorIds.flatMap(blockManagerIdByExecutor.get)
-      decommissionBlockManagers(bmIds)
+      decommissioningBlockManagerSet ++= executorIds.flatMap(blockManagerIdByExecutor.get)

Review comment:
       Added.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] holdenk commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
holdenk commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-696133122


   The -1 is only on the PR that I mentioned it on (the refactoring with many sub classes) which still stands. I would like to review this PR more though since I think we probably need better test coverage of this change than it originally had. Sound good? Thanks for checking in about that :)


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-714424315


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] holdenk commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
holdenk commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-698438844


   Also there are multiple log conditions, this PR broke one of them. Another PR had broken another one.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-714256362


   **[Test build #130130 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130130/testReport)** for PR 29817 at commit [`e625051`](https://github.com/apache/spark/commit/e625051c1ae1d9b7878710aaa4d039b8f577a2c1).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-714185475


   **[Test build #130119 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130119/testReport)** for PR 29817 at commit [`17bbbd9`](https://github.com/apache/spark/commit/17bbbd9461a898c898b87635e4167ef5e01360d2).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-714541703


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34770/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #29817:
URL: https://github.com/apache/spark/pull/29817#discussion_r509853964



##########
File path: core/src/main/scala/org/apache/spark/deploy/master/Master.scala
##########
@@ -245,15 +245,27 @@ private[deploy] class Master(
       logError("Leadership has been revoked -- master shutting down.")
       System.exit(0)
 
-    case WorkerDecommission(id, workerRef) =>
-      logInfo("Recording worker %s decommissioning".format(id))
+    case WorkerDecommissioning(id, workerRef) =>
       if (state == RecoveryState.STANDBY) {
         workerRef.send(MasterInStandby)
       } else {
         // We use foreach since get gives us an option and we can skip the failures.
-        idToWorker.get(id).foreach(decommissionWorker)
+        idToWorker.get(id).foreach(w => decommissionWorker(w))

Review comment:
       unnecessary change?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-696206713






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-699914533






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on a change in pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on a change in pull request #29817:
URL: https://github.com/apache/spark/pull/29817#discussion_r495773071



##########
File path: core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala
##########
@@ -166,17 +171,6 @@ private[spark] class CoarseGrainedExecutorBackend(
       if (executor == null) {
         exitExecutor(1, "Received LaunchTask command but executor was null")
       } else {
-        if (decommissioned) {
-          val msg = "Asked to launch a task while decommissioned."
-          logError(msg)
-          driver match {
-            case Some(endpoint) =>
-              logInfo("Sending DecommissionExecutor to driver.")
-              endpoint.send(DecommissionExecutor(executorId, ExecutorDecommissionInfo(msg)))
-            case _ =>
-              logError("No registered driver to send Decommission to.")
-          }
-        }

Review comment:
       No. Sorry if I didn't explain it clearly.
   
   We should already send decommission notice to the driver when `decommissioned = true`, which using `askSync`. If `askSync` still fail, I wouldn't expect another send would succeed. 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a change in pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on a change in pull request #29817:
URL: https://github.com/apache/spark/pull/29817#discussion_r509859829



##########
File path: core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
##########
@@ -465,72 +464,50 @@ class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val rpcEnv: Rp
    * @param executorsAndDecomInfo Identifiers of executors & decommission info.
    * @param adjustTargetNumExecutors whether the target number of executors will be adjusted down
    *                                 after these executors have been decommissioned.
+   * @param triggeredByExecutor whether the decommission is triggered at executor.
    * @return the ids of the executors acknowledged by the cluster manager to be removed.
    */
   override def decommissionExecutors(
       executorsAndDecomInfo: Array[(String, ExecutorDecommissionInfo)],
-      adjustTargetNumExecutors: Boolean): Seq[String] = {
-
+      adjustTargetNumExecutors: Boolean,
+      triggeredByExecutor: Boolean): Seq[String] = withLock {
     // Do not change this code without running the K8s integration suites
-    val executorsToDecommission = executorsAndDecomInfo.filter { case (executorId, decomInfo) =>
-      CoarseGrainedSchedulerBackend.this.synchronized {

Review comment:
       I have a concern about the scope of the lock. Previously we only hold the lock when we change the mutable status, but now the entire method is locked. Do we have a good reason?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-696137031






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-706722260






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-706707629


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34244/
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #29817:
URL: https://github.com/apache/spark/pull/29817#issuecomment-701461022


   Merged build finished. Test FAILed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] holdenk commented on a change in pull request #29817: [SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission

Posted by GitBox <gi...@apache.org>.
holdenk commented on a change in pull request #29817:
URL: https://github.com/apache/spark/pull/29817#discussion_r500612097



##########
File path: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala
##########
@@ -668,8 +672,14 @@ private[deploy] class Worker(
       finishedApps += id
       maybeCleanupApplication(id)
 
-    case WorkerDecommission(_, _) =>
+    case DecommissionWorker =>
+      decommissionSelf()
+
+    case WorkerSigPWRReceived =>
       decommissionSelf()
+      // Tell the Master that we are starting decommissioning
+      // so it stops trying to launch executor/driver on us
+      sendToMaster(WorkerDecommissioning(workerId, self))

Review comment:
       To make this code testable I'd move it down to inside of decommissionSelf and give decommissionSelf a param for if the decom originated locally.

##########
File path: core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
##########
@@ -465,72 +464,50 @@ class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val rpcEnv: Rp
    * @param executorsAndDecomInfo Identifiers of executors & decommission info.
    * @param adjustTargetNumExecutors whether the target number of executors will be adjusted down
    *                                 after these executors have been decommissioned.
+   * @param triggeredByExecutor whether the decommission is triggered at executor.
    * @return the ids of the executors acknowledged by the cluster manager to be removed.
    */
   override def decommissionExecutors(
       executorsAndDecomInfo: Array[(String, ExecutorDecommissionInfo)],
-      adjustTargetNumExecutors: Boolean): Seq[String] = {
-
+      adjustTargetNumExecutors: Boolean,
+      triggeredByExecutor: Boolean): Seq[String] = withLock {
     // Do not change this code without running the K8s integration suites
-    val executorsToDecommission = executorsAndDecomInfo.filter { case (executorId, decomInfo) =>
-      CoarseGrainedSchedulerBackend.this.synchronized {
-        // Only bother decommissioning executors which are alive.
-        if (isExecutorActive(executorId)) {
-          executorsPendingDecommission(executorId) = decomInfo.workerHost
-          true
-        } else {
-          false
-        }
+    val executorsToDecommission = executorsAndDecomInfo.flatMap { case (executorId, decomInfo) =>
+      // Only bother decommissioning executors which are alive.
+      if (isExecutorActive(executorId)) {
+        scheduler.executorDecommission(executorId, decomInfo)
+        executorsPendingDecommission(executorId) = decomInfo.workerHost
+        Some(executorId)
+      } else {
+        None
       }
     }
 
     // If we don't want to replace the executors we are decommissioning
     if (adjustTargetNumExecutors) {
-      adjustExecutors(executorsToDecommission.map(_._1))
+      adjustExecutors(executorsToDecommission)
     }
 
-    executorsToDecommission.filter { case (executorId, decomInfo) =>
-      doDecommission(executorId, decomInfo)
-    }.map(_._1)
-  }
-
-  // Do not change this code without running the K8s integration suites
-  private def doDecommission(executorId: String,
-      decomInfo: ExecutorDecommissionInfo): Boolean = {
-
-    logInfo(s"Asking executor $executorId to decommissioning.")
-    scheduler.executorDecommission(executorId, decomInfo)
-    // Send decommission message to the executor (it could have originated on the executor
-    // but not necessarily).
-    CoarseGrainedSchedulerBackend.this.synchronized {
-      executorDataMap.get(executorId) match {
-        case Some(executorInfo) =>
-          executorInfo.executorEndpoint.send(DecommissionSelf)
-        case None =>
-          // Ignoring the executor since it is not registered.
-          logWarning(s"Attempted to decommission unknown executor $executorId.")
-          return false
-      }
+    // Mark those corresponding BlockManagers as decommissioned first before we sending
+    // decommission notification to executors. So, it's less likely to lead to the race
+    // condition where `getPeer` request from the decommissioned executor comes first
+    // before the BlockManagers are marked as decommissioned.
+    if (conf.get(STORAGE_DECOMMISSION_ENABLED)) {
+      logInfo(s"Asking BlockManagers on executors (${executorsToDecommission.mkString(", ")}) " +
+        s"to decommissioning.")
+      scheduler.sc.env.blockManager.master.decommissionBlockManagers(executorsToDecommission)
     }
-    logInfo(s"Asked executor $executorId to decommission.")
 
-    if (conf.get(STORAGE_DECOMMISSION_ENABLED)) {
-      try {
-        logInfo(s"Asking block manager corresponding to executor $executorId to decommission.")
-        scheduler.sc.env.blockManager.master.decommissionBlockManagers(Seq(executorId))
-      } catch {
-        case e: Exception =>
-          logError("Unexpected error during block manager " +
-            s"decommissioning for executor $executorId: ${e.toString}", e)
-          return false
+    if (!triggeredByExecutor) {

Review comment:
       I think we might want to move this up because doing the storage decommissioning before the executor it's self is marked for decommissioning could mean that local blocks get stored in a manner where we might not know to migrate them. If you've dug through the code here and you don't believe that happens that's fine.

##########
File path: core/src/main/scala/org/apache/spark/storage/BlockManagerMasterEndpoint.scala
##########
@@ -163,8 +163,7 @@ class BlockManagerMasterEndpoint(
       context.reply(true)
 
     case DecommissionBlockManagers(executorIds) =>
-      val bmIds = executorIds.flatMap(blockManagerIdByExecutor.get)
-      decommissionBlockManagers(bmIds)
+      decommissioningBlockManagerSet ++= executorIds.flatMap(blockManagerIdByExecutor.get)

Review comment:
       Maybe we can add a comment similar to the previous one of decommissionBlockManagers stating why we are doing this and it's intended effect.

##########
File path: core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
##########
@@ -465,72 +464,50 @@ class CoarseGrainedSchedulerBackend(scheduler: TaskSchedulerImpl, val rpcEnv: Rp
    * @param executorsAndDecomInfo Identifiers of executors & decommission info.
    * @param adjustTargetNumExecutors whether the target number of executors will be adjusted down
    *                                 after these executors have been decommissioned.
+   * @param triggeredByExecutor whether the decommission is triggered at executor.
    * @return the ids of the executors acknowledged by the cluster manager to be removed.
    */
   override def decommissionExecutors(
       executorsAndDecomInfo: Array[(String, ExecutorDecommissionInfo)],
-      adjustTargetNumExecutors: Boolean): Seq[String] = {
-
+      adjustTargetNumExecutors: Boolean,
+      triggeredByExecutor: Boolean): Seq[String] = withLock {

Review comment:
       Not 100% sure, but I do remember that the locking mechanism withLock did manage to deadlock before, it might be good to add a logging statement in here that verifies this block is called in the integration test.

##########
File path: core/src/test/scala/org/apache/spark/deploy/client/AppClientSuite.scala
##########
@@ -122,7 +122,10 @@ class AppClientSuite
 
       // Send a decommission self to all the workers
       // Note: normally the worker would send this on their own.
-      workers.foreach(worker => worker.decommissionSelf())
+      workers.foreach { worker =>
+       worker.decommissionSelf()
+       master.self.send(WorkerDecommissioning(worker.workerId, worker.self))

Review comment:
       left a comment up above about how to improve the testing, I think that part of the test should cover the message being sent from the worker so we probably don't want to manually send that message here.

##########
File path: core/src/main/scala/org/apache/spark/storage/BlockManager.scala
##########
@@ -1809,7 +1809,9 @@ private[spark] class BlockManager(
     blocksToRemove.size
   }
 
-  def decommissionBlockManager(): Unit = synchronized {
+  def decommissionBlockManager(): Unit = storageEndpoint.ask(DecommissionBlockManager)

Review comment:
       Makes sense, although maybe introducing a new name instead of changing the use of a previous function name would be easier for verifying.

##########
File path: resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/DecommissionSuite.scala
##########
@@ -41,9 +41,8 @@ private[spark] trait DecommissionSuite { k8sSuite: KubernetesSuite =>
       mainClass = "",
       expectedLogOnCompletion = Seq(
         "Finished waiting, stopping Spark",
-        "Received decommission executor message",
-        "Acknowledged decommissioning block manager",
-        ": Executor decommission.",
+        "Asking BlockManagers", "to decommissioning",
+        "Asking executor", "to decommissioning.",

Review comment:
       nit: "to decommissioning." and "to decommissioning." are close enough you really only need the second to cover both cases since we're looking for the string anywhere. If you did want to you could switch this to regex and do more accurate matching. up to you.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org