You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/09/19 19:37:12 UTC

[GitHub] [spark] mridulm commented on a diff in pull request #37924: [SPARK-40481][CORE] Ignore stage fetch failure caused by decommissioned executor

mridulm commented on code in PR #37924:
URL: https://github.com/apache/spark/pull/37924#discussion_r974600104


##########
docs/configuration.md:
##########
@@ -2605,6 +2605,15 @@ Apart from these, the following properties are also available, and may be useful
   </td>
   <td>2.2.0</td>
 </tr>
+<tr>
+  <td><code>spark.stage.attempt.ignoreOnDecommissionFetchFailure</code></td>

Review Comment:
   `spark.stage.attempt.ignoreOnDecommissionFetchFailure` -> `spark.stage.ignoreOnDecommissionFetchFailure`



##########
core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala:
##########
@@ -1860,8 +1867,18 @@ private[spark] class DAGScheduler(
             s"(attempt ${failedStage.latestInfo.attemptNumber}) running")
         } else {
           failedStage.failedAttemptIds.add(task.stageAttemptId)
+          val ignoreStageFailure = ignoreDecommissionFetchFailure &&
+            isExecutorDecommissioned(taskScheduler, bmAddress)
+          if (ignoreStageFailure) {
+            logInfo("Ignoring fetch failure from $task of $failedStage attempt " +
+              s"${task.stageAttemptId} when count spark.stage.maxConsecutiveAttempts " +
+              "as executor ${bmAddress.executorId} is decommissioned and " +
+              s" ${config.STAGE_IGNORE_DECOMMISSION_FETCH_FAILURE.key}=true")
+          }
+
           val shouldAbortStage =
-            failedStage.failedAttemptIds.size >= maxConsecutiveStageAttempts ||
+            (!ignoreStageFailure &&
+              failedStage.failedAttemptIds.size >= maxConsecutiveStageAttempts) ||
             disallowStageRetryForTest

Review Comment:
   QQ: We are preventing the immediate failure from aborting the stage, but might be effectively reducing the number of stage failures which can be tolerated ?
   
   For example:
   attempt 0, attempt 1, attempt 2 failed due to decommission
   attempt 3 failed for other reasons -> job failed (assuming maxConsecutiveStageAttempts = 4)
   
   Is this the behavior we will now exhibit ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org