You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2018/12/20 15:27:59 UTC

[GitHub] tgravescs commented on a change in pull request #23223: [SPARK-26269][YARN]Yarnallocator should have same blacklist behaviour with yarn to maxmize use of cluster resource

tgravescs commented on a change in pull request #23223: [SPARK-26269][YARN]Yarnallocator should have same blacklist behaviour with yarn to maxmize use of cluster resource
URL: https://github.com/apache/spark/pull/23223#discussion_r243312174
 
 

 ##########
 File path: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala
 ##########
 @@ -612,13 +612,23 @@ private[yarn] class YarnAllocator(
             val message = "Container killed by YARN for exceeding physical memory limits. " +
               s"$diag Consider boosting ${EXECUTOR_MEMORY_OVERHEAD.key}."
             (true, message)
-          case _ =>
-            // all the failures which not covered above, like:
-            // disk failure, kill by app master or resource manager, ...
-            allocatorBlacklistTracker.handleResourceAllocationFailure(hostOpt)
-            (true, "Container marked as failed: " + containerId + onHostStr +
-              ". Exit status: " + completedContainer.getExitStatus +
-              ". Diagnostics: " + completedContainer.getDiagnostics)
+          case other_exit_status =>
+            // SPARK-26269: follow YARN's blacklisting behaviour(see https://github
+            // .com/apache/hadoop/blob/228156cfd1b474988bc4fedfbf7edddc87db41e3/had
+            // oop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/ap
+            // ache/hadoop/yarn/util/Apps.java#L273 for details)
+            if (NOT_APP_AND_SYSTEM_FAULT_EXIT_STATUS.contains(other_exit_status)) {
+              (true, s"Container marked as failed: $containerId$onHostStr" +
 
 Review comment:
   I think we want to return false here for exitCausedByApp since these don't seem to be issues with the App.  From the comment in the yarn code: // Neither the app's fault nor the system's fault. This happens by design,
         // so no need for skipping nodes
   
   If we mark it as true then it counts against our task failure, if its false it doesn't seems like these shouldn't count against our failures. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org