You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/07/20 13:24:10 UTC

[GitHub] [spark] attilapiros commented on a change in pull request #29014: [SPARK-32199][SPARK-32198] Reduce job failures during decommissioning

attilapiros commented on a change in pull request #29014:
URL: https://github.com/apache/spark/pull/29014#discussion_r457378657



##########
File path: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
##########
@@ -1767,10 +1767,18 @@ private[spark] class DAGScheduler(
 
           // TODO: mark the executor as failed only if there were lots of fetch failures on it
           if (bmAddress != null) {
-            val hostToUnregisterOutputs = if (env.blockManager.externalShuffleServiceEnabled &&
-              unRegisterOutputOnHostOnFetchFailure) {
-              // We had a fetch failure with the external shuffle service, so we
-              // assume all shuffle data on the node is bad.
+            val externalShuffleServiceEnabled = env.blockManager.externalShuffleServiceEnabled
+            val isHostDecommissioned = taskScheduler
+              .getExecutorDecommissionInfo(bmAddress.executorId)
+              .exists(_.isHostDecommissioned)
+            // Host shuffle data is considered lost if:
+            // - If we know that the host was decommissioned
+            // - Or when `unRegisterOutputOnHostOnFetchFailure` is enabled and we had
+            //   a fetch failure with the external shuffle service, so we assume all
+            //   shuffle data on the node is bad.
+            val hostLost = isHostDecommissioned || (externalShuffleServiceEnabled &&

Review comment:
       I have two issues:
   
   1) As the `unRegisterOutputOnHostOnFetchFailure` is a feature flag for the unregistering with the following description:
   
   https://github.com/apache/spark/blob/a4ca355af8556e8c5948e492ef70ef0b48416dc4/core/src/main/scala/org/apache/spark/internal/config/package.scala#L794-L801
   
   But with your change when the host is decommissioned the feature flag is not considered at all and the unregistering will happen. I see the benefit to do the unregistering but if you leave this condition as it is now then please update the feature flag description too.
   
   2) Can we find a different name from `hostLost`? As the host can be lost even when  `unRegisterOutputOnHostOnFetchFailure` is false. Or just inline this val as the comment above must be sufficient.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org