You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/09/18 22:06:23 UTC

[GitHub] [spark] warrenzhu25 opened a new pull request, #37924: [SPARK-40481][CORE] Ignore stage fetch failure caused by decommissioned executor

warrenzhu25 opened a new pull request, #37924:
URL: https://github.com/apache/spark/pull/37924

   ### What changes were proposed in this pull request?
   Add a config `spark.stage.attempt.ignoreOnDecommissionFetchFailure` to control whether ignore stage fetch failure caused by decommissioned executor when count `spark.stage.maxConsecutiveAttempts`
   
   ### Why are the changes needed?
   When executor decommission is enabled, there would be many stage failure caused by FetchFailed from decommissioned executor, further causing whole job's failure. It would be better not to count such failure in `spark.stage.maxConsecutiveAttempts`
   
   ### Does this PR introduce _any_ user-facing change?
   Yes
   
   ### How was this patch tested?
   Added test in `DAGSchedulerSuite`
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37924: [SPARK-40481][CORE] Ignore stage fetch failure caused by decommissioned executor

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on code in PR #37924:
URL: https://github.com/apache/spark/pull/37924#discussion_r977435531


##########
core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala:
##########
@@ -2159,6 +2176,24 @@ private[spark] class DAGScheduler(
     }
   }
 
+  /**
+   * Whether executor is decommissioned. Return true when executors are in below cases:
+   *  1. Waiting for decommission start
+   *  2. Under decommission process
+   *  3. Stopped or terminated after finishing decommission
+   *  4. Under decommission process, then removed by driver with other reasons
+   */
+  private[scheduler] def isExecutorDecommissioned(

Review Comment:
   The method name should be revised according to your new comments.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37924: [SPARK-40481][CORE] Ignore stage fetch failure caused by decommissioned executor

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on code in PR #37924:
URL: https://github.com/apache/spark/pull/37924#discussion_r977436492


##########
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala:
##########
@@ -1140,6 +1147,43 @@ class DAGSchedulerSuite extends SparkFunSuite with TempLocalSparkContext with Ti
     assertDataStructuresEmpty()
   }
 
+  test("SPARK-40481: Multiple consecutive stage fetch failures from decommissioned executor" +
+    "should not fail job when ignoreOnDecommissionFetchFailure is enabled.") {

Review Comment:
   We need a space, `"should` -> `" should`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37924: [SPARK-40481][CORE] Ignore stage fetch failure caused by decommissioned executor

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on code in PR #37924:
URL: https://github.com/apache/spark/pull/37924#discussion_r974893716


##########
core/src/main/scala/org/apache/spark/internal/config/package.scala:
##########
@@ -2221,6 +2221,14 @@ package object config {
       .checkValue(_ >= 0, "needs to be a non-negative value")
       .createWithDefault(5)
 
+  private[spark] val STAGE_IGNORE_DECOMMISSION_FETCH_FAILURE =
+    ConfigBuilder("spark.stage.ignoreOnDecommissionFetchFailure")

Review Comment:
   Shall we remove the redundant `On`? It would be better because the variable name already have no `On`.
   ```
   - ignoreOnDecommissionFetchFailure
   + ignoreDecommissionFetchFailure
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] weixiuli commented on a diff in pull request #37924: [SPARK-40481][CORE] Ignore stage fetch failure caused by decommissioned executor

Posted by GitBox <gi...@apache.org>.
weixiuli commented on code in PR #37924:
URL: https://github.com/apache/spark/pull/37924#discussion_r973818049


##########
core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala:
##########
@@ -1853,15 +1856,31 @@ private[spark] class DAGScheduler(
       case FetchFailed(bmAddress, shuffleId, _, mapIndex, reduceId, failureMessage) =>
         val failedStage = stageIdToStage(task.stageId)
         val mapStage = shuffleIdToMapStage(shuffleId)
+        val isExecutorDecommissioned =

Review Comment:
   Should we use a method like `def isExecutorDecommissioned(bmAddress: BlockManagerId) : Boolean = {……} `  to instead of this variable ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37924: [SPARK-40481][CORE] Ignore stage fetch failure caused by decommissioned executor

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on code in PR #37924:
URL: https://github.com/apache/spark/pull/37924#discussion_r974491025


##########
core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala:
##########
@@ -2159,6 +2171,16 @@ private[spark] class DAGScheduler(
     }
   }
 
+  private def isExecutorDecommissioned(bmAddress: BlockManagerId) = {

Review Comment:
   If you don't mind, please add `taskScheduler` as a parameter here and use it inside function body.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on a diff in pull request #37924: [SPARK-40481][CORE] Ignore stage fetch failure caused by decommissioned executor

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on code in PR #37924:
URL: https://github.com/apache/spark/pull/37924#discussion_r990925620


##########
core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala:
##########
@@ -2159,6 +2176,26 @@ private[spark] class DAGScheduler(
     }
   }
 
+  /**
+   * Whether executor is decommissioning or decommissioned.
+   * Return true when:
+   *  1. Waiting for decommission start
+   *  2. Under decommission process
+   * Return false when:
+   *  1. Stopped or terminated after finishing decommission
+   *  2. Under decommission process, then removed by driver with other reasons
+   */
+  private[scheduler] def isExecutorDecommissioningOrDecommissioned(
+      taskScheduler: TaskScheduler, bmAddress: BlockManagerId): Boolean = {
+    if (bmAddress != null) {
+      taskScheduler
+        .getExecutorDecommissionState(bmAddress.executorId)

Review Comment:
   Hi @warrenzhu25 The general idea looks good to me. Just have one concern here: I think `getExecutorDecommissionState` could return empty (which means the executor decommission has finished) in most cases if we hit shuffle fetch failures. The shuffle data is not cleaned up during the executor decommission but just be duplicated in another executor. So, normally, the reduce stage should still fetch shuffle data from the decommissioning executor successfully. Fetch failures should only happen when the executor exits itself due to network connection interruption. In that case, the executor could be already removed from the `TaskSchedulerImpl.executorsPendingDecommission` on receiving the executor lost event, which would result in `getExecutorDecommissionState` returns empty.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on a diff in pull request #37924: [SPARK-40481][CORE] Ignore stage fetch failure caused by decommissioned executor

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on code in PR #37924:
URL: https://github.com/apache/spark/pull/37924#discussion_r991057401


##########
core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala:
##########
@@ -2159,6 +2176,26 @@ private[spark] class DAGScheduler(
     }
   }
 
+  /**
+   * Whether executor is decommissioning or decommissioned.
+   * Return true when:
+   *  1. Waiting for decommission start
+   *  2. Under decommission process
+   * Return false when:
+   *  1. Stopped or terminated after finishing decommission
+   *  2. Under decommission process, then removed by driver with other reasons
+   */
+  private[scheduler] def isExecutorDecommissioningOrDecommissioned(
+      taskScheduler: TaskScheduler, bmAddress: BlockManagerId): Boolean = {
+    if (bmAddress != null) {
+      taskScheduler
+        .getExecutorDecommissionState(bmAddress.executorId)

Review Comment:
   I saw the comments but IIUC that's not what we wanted here. I think what we wanted here is to return true as long as the executor is being decommissioned or finished decommission. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37924: [SPARK-40481][CORE] Ignore stage fetch failure caused by decommissioned executor

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on code in PR #37924:
URL: https://github.com/apache/spark/pull/37924#discussion_r973808911


##########
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala:
##########
@@ -1140,6 +1146,38 @@ class DAGSchedulerSuite extends SparkFunSuite with TempLocalSparkContext with Ti
     assertDataStructuresEmpty()
   }
 
+  test("SPARK-40481: Multiple consecutive stage fetch failures should not fail job " +
+    "when ignoreOnDecommissionFetchFailure is enabled.") {

Review Comment:
   Could you revise the test case name? We are not ignoring the fetch failure from the normal executors.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37924: [SPARK-40481][CORE] Ignore stage fetch failure caused by decommissioned executor

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on code in PR #37924:
URL: https://github.com/apache/spark/pull/37924#discussion_r977436837


##########
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala:
##########
@@ -1140,6 +1147,43 @@ class DAGSchedulerSuite extends SparkFunSuite with TempLocalSparkContext with Ti
     assertDataStructuresEmpty()
   }
 
+  test("SPARK-40481: Multiple consecutive stage fetch failures from decommissioned executor" +
+    "should not fail job when ignoreOnDecommissionFetchFailure is enabled.") {

Review Comment:
   `ignoreOnDecommissionFetchFailure` -> `ignoreDecommissionFetchFailure`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] Ngone51 commented on a diff in pull request #37924: [SPARK-40481][CORE] Ignore stage fetch failure caused by decommissioned executor

Posted by GitBox <gi...@apache.org>.
Ngone51 commented on code in PR #37924:
URL: https://github.com/apache/spark/pull/37924#discussion_r990925620


##########
core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala:
##########
@@ -2159,6 +2176,26 @@ private[spark] class DAGScheduler(
     }
   }
 
+  /**
+   * Whether executor is decommissioning or decommissioned.
+   * Return true when:
+   *  1. Waiting for decommission start
+   *  2. Under decommission process
+   * Return false when:
+   *  1. Stopped or terminated after finishing decommission
+   *  2. Under decommission process, then removed by driver with other reasons
+   */
+  private[scheduler] def isExecutorDecommissioningOrDecommissioned(
+      taskScheduler: TaskScheduler, bmAddress: BlockManagerId): Boolean = {
+    if (bmAddress != null) {
+      taskScheduler
+        .getExecutorDecommissionState(bmAddress.executorId)

Review Comment:
   Hi @warrenzhu25 The general idea looks good to me. Just have one concern here: I think `getExecutorDecommissionState` could return empty (which means the executor decommission has finished) in most cases if we hit shuffle fetch failures. The shuffle data is not cleaned up during the executor decommission but just be duplicated in another executor. So, normally, the reduce stage should still fetch shuffle data from the decommissioning executor successfully. Fetch failures should only happen when the executor exits itself due to network connection interruption. In that case, the executor could be already removed from the `TaskSchedulerImpl.executorsPendingDecommission` on the executor lost event, which would result in `getExecutorDecommissionState` returns empty.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37924: [SPARK-40481][CORE] Ignore stage fetch failure caused by decommissioned executor

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on code in PR #37924:
URL: https://github.com/apache/spark/pull/37924#discussion_r973808705


##########
core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala:
##########
@@ -1853,15 +1856,31 @@ private[spark] class DAGScheduler(
       case FetchFailed(bmAddress, shuffleId, _, mapIndex, reduceId, failureMessage) =>
         val failedStage = stageIdToStage(task.stageId)
         val mapStage = shuffleIdToMapStage(shuffleId)
+        val isExecutorDecommissioned =

Review Comment:
   We should not have this overhead here. Could you move this to the `else` statement which needs this variable?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37924: [SPARK-40481][CORE] Ignore stage fetch failure caused by decommissioned executor

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on code in PR #37924:
URL: https://github.com/apache/spark/pull/37924#discussion_r977433871


##########
core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala:
##########
@@ -2159,6 +2176,24 @@ private[spark] class DAGScheduler(
     }
   }
 
+  /**
+   * Whether executor is decommissioned. Return true when executors are in below cases:
+   *  1. Waiting for decommission start
+   *  2. Under decommission process
+   *  3. Stopped or terminated after finishing decommission
+   *  4. Under decommission process, then removed by driver with other reasons

Review Comment:
   Thank you for updates, but what I mean was that this AS-IS method is not aware of Case (4), @warrenzhu25 .



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] warrenzhu25 commented on pull request #37924: [SPARK-40481][CORE] Ignore stage fetch failure caused by decommissioned executor

Posted by GitBox <gi...@apache.org>.
warrenzhu25 commented on PR #37924:
URL: https://github.com/apache/spark/pull/37924#issuecomment-1250426355

   > The following claims seems to be too general to be true. Could you be more specific, @warrenzhu25 ?
   > 
   > > When executor decommission is enabled, there would be many stage failure caused by FetchFailed from decommissioned executor,
   > 
   > cc @holdenk too
   
   Updated description.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #37924: [SPARK-40481][CORE] Ignore stage fetch failure caused by decommissioned executor

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on PR #37924:
URL: https://github.com/apache/spark/pull/37924#issuecomment-1251551174

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37924: [SPARK-40481][CORE] Ignore stage fetch failure caused by decommissioned executor

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on code in PR #37924:
URL: https://github.com/apache/spark/pull/37924#discussion_r990930867


##########
core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala:
##########
@@ -2159,6 +2176,26 @@ private[spark] class DAGScheduler(
     }
   }
 
+  /**
+   * Whether executor is decommissioning or decommissioned.
+   * Return true when:
+   *  1. Waiting for decommission start
+   *  2. Under decommission process
+   * Return false when:
+   *  1. Stopped or terminated after finishing decommission
+   *  2. Under decommission process, then removed by driver with other reasons
+   */
+  private[scheduler] def isExecutorDecommissioningOrDecommissioned(
+      taskScheduler: TaskScheduler, bmAddress: BlockManagerId): Boolean = {
+    if (bmAddress != null) {
+      taskScheduler
+        .getExecutorDecommissionState(bmAddress.executorId)

Review Comment:
   Hi, @Ngone51 . It's included in the function description, isn't it?
   ```
   * Return false when:
   *  1. Stopped or terminated after finishing decommission
   *  2. Under decommission process, then removed by driver with other reasons
   */
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] warrenzhu25 commented on a diff in pull request #37924: [SPARK-40481][CORE] Ignore stage fetch failure caused by decommissioned executor

Posted by GitBox <gi...@apache.org>.
warrenzhu25 commented on code in PR #37924:
URL: https://github.com/apache/spark/pull/37924#discussion_r978254789


##########
core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala:
##########
@@ -2159,6 +2176,24 @@ private[spark] class DAGScheduler(
     }
   }
 
+  /**
+   * Whether executor is decommissioned. Return true when executors are in below cases:
+   *  1. Waiting for decommission start
+   *  2. Under decommission process
+   *  3. Stopped or terminated after finishing decommission
+   *  4. Under decommission process, then removed by driver with other reasons

Review Comment:
   I got the point. Thanks for the explanation. Updated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37924: [SPARK-40481][CORE] Ignore stage fetch failure caused by decommissioned executor

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on code in PR #37924:
URL: https://github.com/apache/spark/pull/37924#discussion_r974490653


##########
core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala:
##########
@@ -1860,8 +1863,17 @@ private[spark] class DAGScheduler(
             s"(attempt ${failedStage.latestInfo.attemptNumber}) running")
         } else {
           failedStage.failedAttemptIds.add(task.stageAttemptId)
+          val ignoreStageFailure = ignoreDecommissionFetchFailure &&
+            isExecutorDecommissioned(bmAddress)
+          if (ignoreStageFailure) {
+            logInfo("Ignoring fetch failure from $task of $failedStage attempt" +

Review Comment:
   This log message might be misleading or wrong when `disallowStageRetryForTest` is true.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] warrenzhu25 commented on a diff in pull request #37924: [SPARK-40481][CORE] Ignore stage fetch failure caused by decommissioned executor

Posted by GitBox <gi...@apache.org>.
warrenzhu25 commented on code in PR #37924:
URL: https://github.com/apache/spark/pull/37924#discussion_r974793372


##########
core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala:
##########
@@ -1860,8 +1867,18 @@ private[spark] class DAGScheduler(
             s"(attempt ${failedStage.latestInfo.attemptNumber}) running")
         } else {
           failedStage.failedAttemptIds.add(task.stageAttemptId)
+          val ignoreStageFailure = ignoreDecommissionFetchFailure &&
+            isExecutorDecommissioned(taskScheduler, bmAddress)
+          if (ignoreStageFailure) {
+            logInfo("Ignoring fetch failure from $task of $failedStage attempt " +
+              s"${task.stageAttemptId} when count spark.stage.maxConsecutiveAttempts " +
+              "as executor ${bmAddress.executorId} is decommissioned and " +
+              s" ${config.STAGE_IGNORE_DECOMMISSION_FETCH_FAILURE.key}=true")
+          }
+
           val shouldAbortStage =
-            failedStage.failedAttemptIds.size >= maxConsecutiveStageAttempts ||
+            (!ignoreStageFailure &&
+              failedStage.failedAttemptIds.size >= maxConsecutiveStageAttempts) ||
             disallowStageRetryForTest

Review Comment:
   Good catch. Updated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37924: [SPARK-40481][CORE] Ignore stage fetch failure caused by decommissioned executor

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on code in PR #37924:
URL: https://github.com/apache/spark/pull/37924#discussion_r973809002


##########
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala:
##########
@@ -1140,6 +1146,38 @@ class DAGSchedulerSuite extends SparkFunSuite with TempLocalSparkContext with Ti
     assertDataStructuresEmpty()
   }
 
+  test("SPARK-40481: Multiple consecutive stage fetch failures should not fail job " +
+    "when ignoreOnDecommissionFetchFailure is enabled.") {
+    conf.set(config.STAGE_IGNORE_DECOMMISSION_FETCH_FAILURE.key, "true")
+
+    setupStageAbortTest(sc)
+    val parts = 2
+    val shuffleMapRdd = new MyRDD(sc, parts, Nil)
+    val shuffleDep = new ShuffleDependency(shuffleMapRdd, new HashPartitioner(parts))
+    val reduceRdd = new MyRDD(sc, parts, List(shuffleDep), tracker = mapOutputTracker)
+    submit(reduceRdd, (0 until parts).toArray)
+
+    for (attempt <- 0 until scheduler.maxConsecutiveStageAttempts) {
+      // Complete all the tasks for the current attempt of stage 0 successfully
+      completeShuffleMapStageSuccessfully(0, attempt, numShufflePartitions = parts,
+        Seq("hostA", "hostB"))
+
+      taskScheduler.executorDecommission("hostA-exec", ExecutorDecommissionInfo(""))
+      // Now we should have a new taskSet, for a new attempt of stage 1.
+      // Fail all these tasks with FetchFailure
+      completeNextStageWithFetchFailure(1, attempt, shuffleDep)
+
+      // this will trigger a resubmission of stage 0, since we've lost some of its
+      // map output, for the next iteration through the loop
+      scheduler.resubmitFailedStages()
+   }
+
+    // Confirm job finished successfully
+    sc.listenerBus.waitUntilEmpty()
+    assert(scheduler.runningStages.nonEmpty)
+    assert(!ended)
+

Review Comment:
   Shall we clean up this redundant empty line?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37924: [SPARK-40481][CORE] Ignore stage fetch failure caused by decommissioned executor

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on code in PR #37924:
URL: https://github.com/apache/spark/pull/37924#discussion_r973807828


##########
core/src/main/scala/org/apache/spark/internal/config/package.scala:
##########
@@ -2221,6 +2221,14 @@ package object config {
       .checkValue(_ >= 0, "needs to be a non-negative value")
       .createWithDefault(5)
 
+  private[spark] val STAGE_IGNORE_DECOMMISSION_FETCH_FAILURE =
+    ConfigBuilder("spark.stage.attempt.ignoreOnDecommissionFetchFailure")
+      .doc("Whether ignore stage fetch failure caused by executor decommission when " +
+      "count spark.stage.maxConsecutiveAttempts")
+      .version("3.4.0")
+      .booleanConf
+      .createWithDefault(false)

Review Comment:
   Thank you for starting this with `false` first. We can enable this after one release cycle.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #37924: [SPARK-40481][CORE] Ignore stage fetch failure caused by decommissioned executor

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on PR #37924:
URL: https://github.com/apache/spark/pull/37924#issuecomment-1250433410

   Thank you for the swift updates, @warrenzhu25 .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37924: [SPARK-40481][CORE] Ignore stage fetch failure caused by decommissioned executor

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on code in PR #37924:
URL: https://github.com/apache/spark/pull/37924#discussion_r974483075


##########
core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala:
##########
@@ -1860,8 +1863,17 @@ private[spark] class DAGScheduler(
             s"(attempt ${failedStage.latestInfo.attemptNumber}) running")
         } else {
           failedStage.failedAttemptIds.add(task.stageAttemptId)
+          val ignoreStageFailure = ignoreDecommissionFetchFailure &&
+            isExecutorDecommissioned(bmAddress)
+          if (ignoreStageFailure) {
+            logInfo("Ignoring fetch failure from $task of $failedStage attempt" +
+              s"${task.stageAttemptId} as executor ${bmAddress.executorId} is decommissioned and " +
+              s" ${config.STAGE_IGNORE_DECOMMISSION_FETCH_FAILURE.key}=true")
+          }
+
           val shouldAbortStage =
-            failedStage.failedAttemptIds.size >= maxConsecutiveStageAttempts ||
+            (!ignoreStageFailure &&
+              failedStage.failedAttemptIds.size >= maxConsecutiveStageAttempts)||

Review Comment:
   `)||` -> `) ||`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37924: [SPARK-40481][CORE] Ignore stage fetch failure caused by decommissioned executor

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on code in PR #37924:
URL: https://github.com/apache/spark/pull/37924#discussion_r977433871


##########
core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala:
##########
@@ -2159,6 +2176,24 @@ private[spark] class DAGScheduler(
     }
   }
 
+  /**
+   * Whether executor is decommissioned. Return true when executors are in below cases:
+   *  1. Waiting for decommission start
+   *  2. Under decommission process
+   *  3. Stopped or terminated after finishing decommission
+   *  4. Under decommission process, then removed by driver with other reasons

Review Comment:
   Thank you for updates, but what I mean was that this AS-IS method is not aware of Case (3) and (4), @warrenzhu25 .



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37924: [SPARK-40481][CORE] Ignore stage fetch failure caused by decommissioned executor

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on code in PR #37924:
URL: https://github.com/apache/spark/pull/37924#discussion_r974486551


##########
docs/configuration.md:
##########
@@ -2605,6 +2605,15 @@ Apart from these, the following properties are also available, and may be useful
   </td>
   <td>2.2.0</td>
 </tr>
+<tr>
+  <td><code>spark.stage.attempt.ignoreOnDecommissionFetchFailure</code></td>
+  <td>false</td>
+  <td>
+    Whether ignore stage fetch failure caused by executor decommission when 

Review Comment:
   Please remove the trailing space at the end of this line.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #37924: [SPARK-40481][CORE] Ignore stage fetch failure caused by decommissioned executor

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on PR #37924:
URL: https://github.com/apache/spark/pull/37924#issuecomment-1250438346

   I made a few comments. BTW, thank you for working on this area, @warrenzhu25 .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on pull request #37924: [SPARK-40481][CORE] Ignore stage fetch failure caused by decommissioned executor

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on PR #37924:
URL: https://github.com/apache/spark/pull/37924#issuecomment-1259896526

   Sorry. I was too busy about 3.3.1 preparation. Will resume the review again for this new improvement.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] warrenzhu25 commented on pull request #37924: [SPARK-40481][CORE] Ignore stage fetch failure caused by decommissioned executor

Posted by GitBox <gi...@apache.org>.
warrenzhu25 commented on PR #37924:
URL: https://github.com/apache/spark/pull/37924#issuecomment-1259892522

   @dongjoon-hyun Any more comments?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun closed pull request #37924: [SPARK-40481][CORE] Ignore stage fetch failure caused by decommissioned executor

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun closed pull request #37924: [SPARK-40481][CORE] Ignore stage fetch failure caused by decommissioned executor
URL: https://github.com/apache/spark/pull/37924


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37924: [SPARK-40481][CORE] Ignore stage fetch failure caused by decommissioned executor

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on code in PR #37924:
URL: https://github.com/apache/spark/pull/37924#discussion_r973807726


##########
core/src/main/scala/org/apache/spark/internal/config/package.scala:
##########
@@ -2221,6 +2221,14 @@ package object config {
       .checkValue(_ >= 0, "needs to be a non-negative value")
       .createWithDefault(5)
 
+  private[spark] val STAGE_IGNORE_DECOMMISSION_FETCH_FAILURE =
+    ConfigBuilder("spark.stage.attempt.ignoreOnDecommissionFetchFailure")
+      .doc("Whether ignore stage fetch failure caused by executor decommission when " +
+      "count spark.stage.maxConsecutiveAttempts")

Review Comment:
   indentation?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37924: [SPARK-40481][CORE] Ignore stage fetch failure caused by decommissioned executor

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on code in PR #37924:
URL: https://github.com/apache/spark/pull/37924#discussion_r973808498


##########
core/src/main/scala/org/apache/spark/internal/config/package.scala:
##########
@@ -2221,6 +2221,14 @@ package object config {
       .checkValue(_ >= 0, "needs to be a non-negative value")
       .createWithDefault(5)
 
+  private[spark] val STAGE_IGNORE_DECOMMISSION_FETCH_FAILURE =
+    ConfigBuilder("spark.stage.attempt.ignoreOnDecommissionFetchFailure")

Review Comment:
   We need a new proper name. Please ignore Amazon configuration name.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37924: [SPARK-40481][CORE] Ignore stage fetch failure caused by decommissioned executor

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on code in PR #37924:
URL: https://github.com/apache/spark/pull/37924#discussion_r974479715


##########
core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala:
##########
@@ -230,6 +230,9 @@ private[spark] class DAGScheduler(
     sc.getConf.getInt("spark.stage.maxConsecutiveAttempts",
       DAGScheduler.DEFAULT_MAX_CONSECUTIVE_STAGE_ATTEMPTS)
 
+  private[scheduler] val ignoreDecommissionFetchFailure =

Review Comment:
   Could you add a comment like the following?
   https://github.com/apache/spark/blob/2d6d5e2c44861745e4a7f5a4c15f74337f651b03/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L226-L228



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mridulm commented on a diff in pull request #37924: [SPARK-40481][CORE] Ignore stage fetch failure caused by decommissioned executor

Posted by GitBox <gi...@apache.org>.
mridulm commented on code in PR #37924:
URL: https://github.com/apache/spark/pull/37924#discussion_r974600104


##########
docs/configuration.md:
##########
@@ -2605,6 +2605,15 @@ Apart from these, the following properties are also available, and may be useful
   </td>
   <td>2.2.0</td>
 </tr>
+<tr>
+  <td><code>spark.stage.attempt.ignoreOnDecommissionFetchFailure</code></td>

Review Comment:
   `spark.stage.attempt.ignoreOnDecommissionFetchFailure` -> `spark.stage.ignoreOnDecommissionFetchFailure`



##########
core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala:
##########
@@ -1860,8 +1867,18 @@ private[spark] class DAGScheduler(
             s"(attempt ${failedStage.latestInfo.attemptNumber}) running")
         } else {
           failedStage.failedAttemptIds.add(task.stageAttemptId)
+          val ignoreStageFailure = ignoreDecommissionFetchFailure &&
+            isExecutorDecommissioned(taskScheduler, bmAddress)
+          if (ignoreStageFailure) {
+            logInfo("Ignoring fetch failure from $task of $failedStage attempt " +
+              s"${task.stageAttemptId} when count spark.stage.maxConsecutiveAttempts " +
+              "as executor ${bmAddress.executorId} is decommissioned and " +
+              s" ${config.STAGE_IGNORE_DECOMMISSION_FETCH_FAILURE.key}=true")
+          }
+
           val shouldAbortStage =
-            failedStage.failedAttemptIds.size >= maxConsecutiveStageAttempts ||
+            (!ignoreStageFailure &&
+              failedStage.failedAttemptIds.size >= maxConsecutiveStageAttempts) ||
             disallowStageRetryForTest

Review Comment:
   QQ: We are preventing the immediate failure from aborting the stage, but might be effectively reducing the number of stage failures which can be tolerated ?
   
   For example:
   attempt 0, attempt 1, attempt 2 failed due to decommission
   attempt 3 failed for other reasons -> job failed (assuming maxConsecutiveStageAttempts = 4)
   
   Is this the behavior we will now exhibit ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37924: [SPARK-40481][CORE] Ignore stage fetch failure caused by decommissioned executor

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on code in PR #37924:
URL: https://github.com/apache/spark/pull/37924#discussion_r974893716


##########
core/src/main/scala/org/apache/spark/internal/config/package.scala:
##########
@@ -2221,6 +2221,14 @@ package object config {
       .checkValue(_ >= 0, "needs to be a non-negative value")
       .createWithDefault(5)
 
+  private[spark] val STAGE_IGNORE_DECOMMISSION_FETCH_FAILURE =
+    ConfigBuilder("spark.stage.ignoreOnDecommissionFetchFailure")

Review Comment:
   Shall we remove the redundant `On`?
   ```
   - ignoreOnDecommissionFetchFailure
   + ignoreDecommissionFetchFailure
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #37924: [SPARK-40481][CORE] Ignore stage fetch failure caused by decommissioned executor

Posted by GitBox <gi...@apache.org>.
dongjoon-hyun commented on code in PR #37924:
URL: https://github.com/apache/spark/pull/37924#discussion_r977435182


##########
core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala:
##########
@@ -2159,6 +2176,24 @@ private[spark] class DAGScheduler(
     }
   }
 
+  /**
+   * Whether executor is decommissioned. Return true when executors are in below cases:
+   *  1. Waiting for decommission start
+   *  2. Under decommission process
+   *  3. Stopped or terminated after finishing decommission
+   *  4. Under decommission process, then removed by driver with other reasons

Review Comment:
   I hope we can add comments that this is a best-effort approach and there is a corner case like (4) which we cannot ignore.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org