You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by sr...@apache.org on 2020/06/01 14:47:30 UTC
[spark] branch master updated: [SPARK-29683][YARN] False report isAllNodeBlacklisted when RM is having issue

This is an automated email from the ASF dual-hosted git repository.

srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git


The following commit(s) were added to refs/heads/master by this push:
     new e70df2c  [SPARK-29683][YARN] False report isAllNodeBlacklisted when RM is having issue
e70df2c is described below

commit e70df2cea46f71461d8d401a420e946f999862c1
Author: Yuexin Zhang <za...@gmail.com>
AuthorDate: Mon Jun 1 09:46:18 2020 -0500

    [SPARK-29683][YARN] False report isAllNodeBlacklisted when RM is having issue
    
    ### What changes were proposed in this pull request?
    
    Improve the check logic on if all node managers are really being backlisted.
    
    ### Why are the changes needed?
    
    I observed when the AM is out of sync with ResourceManager, or RM is having issue report back with current number of available NMs, something like below happens:
    ...
    20/05/13 09:01:21 INFO RetryInvocationHandler: java.io.EOFException: End of File Exception between local host is: "client.zyx.com/x.x.x.124"; destination host is: "rm.zyx.com":8030; : java.io.EOFException; For more details see:  http://wiki.apache.org/hadoop/EOFException, while invoking ApplicationMasterProtocolPBClientImpl.allocate over rm543. Trying to failover immediately.
    ...
    20/05/13 09:01:28 WARN AMRMClientImpl: ApplicationMaster is out of sync with ResourceManager, hence resyncing.
    ...
    
    then the spark job would suddenly run into AllNodeBlacklisted state:
    ...
    20/05/13 09:01:31 INFO ApplicationMaster: Final app status: FAILED, exitCode: 11, (reason: Due to executor failures all available nodes are blacklisted)
    ...
    
    but actually there's no black listed nodes in currentBlacklistedYarnNodes, and I do not see any blacklisting message from:
    https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocatorBlacklistTracker.scala#L119
    
    We should only return isAllNodeBlacklisted =true when we see there are >0  numClusterNodes AND 'currentBlacklistedYarnNodes.size >= numClusterNodes'.
    
    ### Does this PR introduce _any_ user-facing change?
    
    No.
    
    ### How was this patch tested?
    
    A minor change. No changes on tests.
    
    Closes #28606 from cnZach/false_AllNodeBlacklisted_when_RM_is_having_issue.
    
    Authored-by: Yuexin Zhang <za...@gmail.com>
    Signed-off-by: Sean Owen <sr...@gmail.com>
---
 .../apache/spark/deploy/yarn/YarnAllocatorBlacklistTracker.scala | 9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocatorBlacklistTracker.scala b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocatorBlacklistTracker.scala
index fa8c961..339d371 100644
--- a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocatorBlacklistTracker.scala
+++ b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocatorBlacklistTracker.scala
@@ -103,7 +103,14 @@ private[spark] class YarnAllocatorBlacklistTracker(
     refreshBlacklistedNodes()
   }
 
-  def isAllNodeBlacklisted: Boolean = currentBlacklistedYarnNodes.size >= numClusterNodes
+  def isAllNodeBlacklisted: Boolean = {
+    if (numClusterNodes <= 0) {
+      logWarning("No available nodes reported, please check Resource Manager.")
+      false
+    } else {
+      currentBlacklistedYarnNodes.size >= numClusterNodes
+    }
+  }
 
   private def refreshBlacklistedNodes(): Unit = {
     removeExpiredYarnBlacklistedNodes()


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org