You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@spark.apache.org by sr...@apache.org on 2020/06/01 14:47:30 UTC
[spark] branch master updated: [SPARK-29683][YARN] False report
isAllNodeBlacklisted when RM is having issue
This is an automated email from the ASF dual-hosted git repository.
srowen pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/spark.git
The following commit(s) were added to refs/heads/master by this push:
new e70df2c [SPARK-29683][YARN] False report isAllNodeBlacklisted when RM is having issue
e70df2c is described below
commit e70df2cea46f71461d8d401a420e946f999862c1
Author: Yuexin Zhang <za...@gmail.com>
AuthorDate: Mon Jun 1 09:46:18 2020 -0500
[SPARK-29683][YARN] False report isAllNodeBlacklisted when RM is having issue
### What changes were proposed in this pull request?
Improve the check logic on if all node managers are really being backlisted.
### Why are the changes needed?
I observed when the AM is out of sync with ResourceManager, or RM is having issue report back with current number of available NMs, something like below happens:
...
20/05/13 09:01:21 INFO RetryInvocationHandler: java.io.EOFException: End of File Exception between local host is: "client.zyx.com/x.x.x.124"; destination host is: "rm.zyx.com":8030; : java.io.EOFException; For more details see: http://wiki.apache.org/hadoop/EOFException, while invoking ApplicationMasterProtocolPBClientImpl.allocate over rm543. Trying to failover immediately.
...
20/05/13 09:01:28 WARN AMRMClientImpl: ApplicationMaster is out of sync with ResourceManager, hence resyncing.
...
then the spark job would suddenly run into AllNodeBlacklisted state:
...
20/05/13 09:01:31 INFO ApplicationMaster: Final app status: FAILED, exitCode: 11, (reason: Due to executor failures all available nodes are blacklisted)
...
but actually there's no black listed nodes in currentBlacklistedYarnNodes, and I do not see any blacklisting message from:
https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocatorBlacklistTracker.scala#L119
We should only return isAllNodeBlacklisted =true when we see there are >0 numClusterNodes AND 'currentBlacklistedYarnNodes.size >= numClusterNodes'.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
A minor change. No changes on tests.
Closes #28606 from cnZach/false_AllNodeBlacklisted_when_RM_is_having_issue.
Authored-by: Yuexin Zhang <za...@gmail.com>
Signed-off-by: Sean Owen <sr...@gmail.com>
---
.../apache/spark/deploy/yarn/YarnAllocatorBlacklistTracker.scala | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocatorBlacklistTracker.scala b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocatorBlacklistTracker.scala
index fa8c961..339d371 100644
--- a/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocatorBlacklistTracker.scala
+++ b/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocatorBlacklistTracker.scala
@@ -103,7 +103,14 @@ private[spark] class YarnAllocatorBlacklistTracker(
refreshBlacklistedNodes()
}
- def isAllNodeBlacklisted: Boolean = currentBlacklistedYarnNodes.size >= numClusterNodes
+ def isAllNodeBlacklisted: Boolean = {
+ if (numClusterNodes <= 0) {
+ logWarning("No available nodes reported, please check Resource Manager.")
+ false
+ } else {
+ currentBlacklistedYarnNodes.size >= numClusterNodes
+ }
+ }
private def refreshBlacklistedNodes(): Unit = {
removeExpiredYarnBlacklistedNodes()
---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@spark.apache.org
For additional commands, e-mail: commits-help@spark.apache.org