You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/10/22 09:37:42 UTC

[GitHub] [spark] AngersZhuuuu opened a new pull request #34366: [SPARK-37097][YARN] yarn-cluster mode don't need to retry when AM container exit code 0 but application failed.

AngersZhuuuu opened a new pull request #34366:
URL: https://github.com/apache/spark/pull/34366


   ### What changes were proposed in this pull request?
   
   Some yarn-cluster application meet such exception.
   ```
   21/10/20 03:31:55 ERROR Client: Application diagnostics message: Application application_1632999510150_2163647 failed 1 times (global limit =8; local limit is =1) due to AM Container for appattempt_1632999510150_2163647_000001 exited with  exitCode: 0
   Failing this attempt.Diagnostics: For more detailed output, check the application tracking page: http://ip-xx-xx-xx-xx.idata-server.shopee.io:8088/cluster/app/application_1632999510150_2163647 Then click on links to logs of each attempt.
   . Failing the application.
   Exception in thread "main" org.apache.spark.SparkException: Application application_1632999510150_2163647 finished with failed status
   ```
   
   It's caused by below situation:
   1. yarn-cluster mode application usr code finished, AM shutdown hook triggered
   2. AM call unregister from RM but timeout, since AM shutdown hook have try catch, won't throw exception, so AM container exit with code 0(application user code running success).
   3. Since RM lose connection with AM, then treat this container as failed final status.
   4. Then client side got application report as final status failed but am container exit code 0. client treat it as failed, then retry.
   
   
   it's a unnecessary retry. we can avoid it.
   
   
   ### Why are the changes needed?
   Avoid unnecessary retry
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   
   Manual tested


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AngersZhuuuu commented on pull request #34366: [SPARK-37097][YARN] yarn-cluster mode don't need to retry when AM container exit code 0 but application failed.

Posted by GitBox <gi...@apache.org>.
AngersZhuuuu commented on pull request #34366:
URL: https://github.com/apache/spark/pull/34366#issuecomment-949458437


   ping @tgravescs @mridulm .
   
   One consideration is , if we need to add a conf like `spark.yarn.retryOnlyAmContainerExitFailed`,  then only when AM container exit 0 and this conf is true, then won't retry.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34366: [SPARK-37097][YARN] yarn-cluster mode don't need to retry when AM container exit code 0 but application failed.

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34366:
URL: https://github.com/apache/spark/pull/34366#issuecomment-960417698


   **[Test build #144890 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144890/testReport)** for PR 34366 at commit [`edd8f2e`](https://github.com/apache/spark/commit/edd8f2e5157d6e916348431da2c40493c88033e7).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AngersZhuuuu commented on a change in pull request #34366: [SPARK-37097][YARN] yarn-cluster mode don't need to retry when AM container exit code 0 but application failed.

Posted by GitBox <gi...@apache.org>.
AngersZhuuuu commented on a change in pull request #34366:
URL: https://github.com/apache/spark/pull/34366#discussion_r742540317



##########
File path: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
##########
@@ -1277,10 +1277,14 @@ private[spark] class Client(
     } else {
       val YarnAppReport(appState, finalState, diags) = monitorApplication(appId)
       if (appState == YarnApplicationState.FAILED || finalState == FinalApplicationStatus.FAILED) {
+        var amContainerSuccess = false
         diags.foreach { err =>
+          amContainerSuccess = err.contains("AM Container") && err.contains("exitCode: 0")
           logError(s"Application diagnostics message: $err")
         }
-        throw new SparkException(s"Application $appId finished with failed status")
+        if (!amContainerSuccess) {
+          throw new SparkException(s"Application $appId finished with failed status")
+        }

Review comment:
       > > getDriverLogsLink
   > 
   > Nice suggestion! updated
   
   Tested in our prod,  throw such exception
   ```
   Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.yarn.exceptions.ContainerNotFoundException): Container with id 'container_e240_1632999510150_3975819_02_000001' doesn't exist in RM.
   	at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getContainerReport(ClientRMService.java:532)
   	at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getContainerReport(ApplicationClientProtocolPBServiceImpl.java:466)
   	at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:645)
   	at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:622)
   	at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:590)
   	at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:574)
   	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1129)
   	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1056)
   	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:984)
   	at java.base/java.security.AccessController.doPrivileged(Native Method)
   	at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
   	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1972)
   	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3039)
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AngersZhuuuu commented on a change in pull request #34366: [SPARK-37097][YARN] yarn-cluster mode don't need to retry when AM container exit code 0 but application failed.

Posted by GitBox <gi...@apache.org>.
AngersZhuuuu commented on a change in pull request #34366:
URL: https://github.com/apache/spark/pull/34366#discussion_r743349589



##########
File path: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
##########
@@ -1277,10 +1277,14 @@ private[spark] class Client(
     } else {
       val YarnAppReport(appState, finalState, diags) = monitorApplication(appId)
       if (appState == YarnApplicationState.FAILED || finalState == FinalApplicationStatus.FAILED) {
+        var amContainerSuccess = false
         diags.foreach { err =>
+          amContainerSuccess = err.contains("AM Container") && err.contains("exitCode: 0")
           logError(s"Application diagnostics message: $err")
         }
-        throw new SparkException(s"Application $appId finished with failed status")
+        if (!amContainerSuccess) {
+          throw new SparkException(s"Application $appId finished with failed status")
+        }

Review comment:
       > Thanks for the details @AngersZhuuuu, this is unfortunate. I wanted to avoid parsing from diagnostic, given it is potentially brittle - without timeline service, I am not sure how to extract this without parsing the string. Any thoughts @tgravescs ?
   
   Yea, we can got this from timeline service, but it's also strange..

##########
File path: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
##########
@@ -1277,10 +1277,14 @@ private[spark] class Client(
     } else {
       val YarnAppReport(appState, finalState, diags) = monitorApplication(appId)
       if (appState == YarnApplicationState.FAILED || finalState == FinalApplicationStatus.FAILED) {
+        var amContainerSuccess = false
         diags.foreach { err =>
+          amContainerSuccess = err.contains("AM Container") && err.contains("exitCode: 0")
           logError(s"Application diagnostics message: $err")
         }
-        throw new SparkException(s"Application $appId finished with failed status")
+        if (!amContainerSuccess) {
+          throw new SparkException(s"Application $appId finished with failed status")
+        }

Review comment:
       > yeah unfortunately that final application status is what YARN really uses and advertises and it really wants you to unregister.
   >
   > How often does this really happen? it seems like if you are timing out talking to RM very often you have other problems on your cluster. Or does this happen on rolling upgrade or something?
   
   This situation does not happen often, but it will happen from time to time, causing many users to complain. Our cluster is relatively large, and the performance pressure of yarn is very high, so sometimes, some requests will respond very slowly. Then caused such issue.
   
   
   > Note, If we were to do this, then the YARN final status also wouldn't match because doesn't that show up as failed when it can't unregister? that could confuse the user.
   
   People always care application's running status more than the yarn's info. If job success, they won't check applications on yarn ui. For yarn's admin,  The information for this job is still the same as before
   
   
   Also we can add a legacy configuration for this
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tgravescs commented on a change in pull request #34366: [SPARK-37097][YARN] yarn-cluster mode don't need to retry when AM container exit code 0 but application failed.

Posted by GitBox <gi...@apache.org>.
tgravescs commented on a change in pull request #34366:
URL: https://github.com/apache/spark/pull/34366#discussion_r746690053



##########
File path: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
##########
@@ -1277,10 +1277,14 @@ private[spark] class Client(
     } else {
       val YarnAppReport(appState, finalState, diags) = monitorApplication(appId)
       if (appState == YarnApplicationState.FAILED || finalState == FinalApplicationStatus.FAILED) {
+        var amContainerSuccess = false
         diags.foreach { err =>
+          amContainerSuccess = err.contains("AM Container") && err.contains("exitCode: 0")
           logError(s"Application diagnostics message: $err")
         }
-        throw new SparkException(s"Application $appId finished with failed status")
+        if (!amContainerSuccess) {
+          throw new SparkException(s"Application $appId finished with failed status")
+        }

Review comment:
       so I guess I'm kind of wondering if we should do make any change here.
   This doesn't fail the application, it may be unwanted behavior in your case that it retries sometimes but if its doing it to often it seems like a cluster issue to me.  Hadoop client has built in retries and timeouts that are supposed to be configured to account for some of these, so if you are going outside of those that should be a out of the norm occurrence.  The Hadoop philosophy IMHO is to retry on failures so if this occasionally happens shouldn't be a big deal.  If it is, it's up to the application to handle. The Hadoop API for properly checking application status is what we are using and this trying to infer the status which could be brittle and just adds maintenance for Spark if we add yet another config.   
   
   Note Hadoop also has a shutdown hook configuration which I think is the 30 seconds you are referring to here: `hadoop.service.shutdown.timeout`.   You could increase that and increase the Hadoop retries/timeouts.   
   
   It would be nice to get more feedback on if this is a problem for other users or if devs have opinions.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34366: [SPARK-37097][YARN] yarn-cluster mode don't need to retry when AM container exit code 0 but application failed.

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34366:
URL: https://github.com/apache/spark/pull/34366#issuecomment-960437772


   **[Test build #144891 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144891/testReport)** for PR 34366 at commit [`bbdcb1f`](https://github.com/apache/spark/commit/bbdcb1faca24c1ae897de3f68c6c1e3b1fe97749).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mridulm commented on a change in pull request #34366: [SPARK-37097][YARN] yarn-cluster mode don't need to retry when AM container exit code 0 but application failed.

Posted by GitBox <gi...@apache.org>.
mridulm commented on a change in pull request #34366:
URL: https://github.com/apache/spark/pull/34366#discussion_r742248359



##########
File path: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
##########
@@ -1277,10 +1277,14 @@ private[spark] class Client(
     } else {
       val YarnAppReport(appState, finalState, diags) = monitorApplication(appId)
       if (appState == YarnApplicationState.FAILED || finalState == FinalApplicationStatus.FAILED) {
+        var amContainerSuccess = false
         diags.foreach { err =>
+          amContainerSuccess = err.contains("AM Container") && err.contains("exitCode: 0")
           logError(s"Application diagnostics message: $err")
         }
-        throw new SparkException(s"Application $appId finished with failed status")
+        if (!amContainerSuccess) {
+          throw new SparkException(s"Application $appId finished with failed status")
+        }

Review comment:
       Instead of parsing this from diagnostics, we can use AM's getContainerExitStatus and check for that being 0 ?

##########
File path: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
##########
@@ -1277,10 +1277,14 @@ private[spark] class Client(
     } else {
       val YarnAppReport(appState, finalState, diags) = monitorApplication(appId)
       if (appState == YarnApplicationState.FAILED || finalState == FinalApplicationStatus.FAILED) {
+        var amContainerSuccess = false
         diags.foreach { err =>
+          amContainerSuccess = err.contains("AM Container") && err.contains("exitCode: 0")
           logError(s"Application diagnostics message: $err")
         }
-        throw new SparkException(s"Application $appId finished with failed status")
+        if (!amContainerSuccess) {
+          throw new SparkException(s"Application $appId finished with failed status")
+        }

Review comment:
       Instead of parsing this from diagnostics, we can use AM's getContainerExitStatus and check for that being 0 ?
   Take a look at `getDriverLogsLink` on how to get to driver container

##########
File path: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
##########
@@ -1277,10 +1277,14 @@ private[spark] class Client(
     } else {
       val YarnAppReport(appState, finalState, diags) = monitorApplication(appId)
       if (appState == YarnApplicationState.FAILED || finalState == FinalApplicationStatus.FAILED) {
+        var amContainerSuccess = false
         diags.foreach { err =>
+          amContainerSuccess = err.contains("AM Container") && err.contains("exitCode: 0")
           logError(s"Application diagnostics message: $err")
         }
-        throw new SparkException(s"Application $appId finished with failed status")
+        if (!amContainerSuccess) {
+          throw new SparkException(s"Application $appId finished with failed status")
+        }

Review comment:
       Instead of parsing this from diagnostics, we can use AM's getContainerExitStatus and check for that being 0 ?
   Take a look at `getDriverLogsLink` on how to get to AM container




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AngersZhuuuu removed a comment on pull request #34366: [SPARK-37097][YARN] yarn-cluster mode don't need to retry when AM container exit code 0 but application failed.

Posted by GitBox <gi...@apache.org>.
AngersZhuuuu removed a comment on pull request #34366:
URL: https://github.com/apache/spark/pull/34366#issuecomment-949458437


   ping @tgravescs @mridulm .
   
   One consideration is , if we need to add a conf like `spark.yarn.retryOnlyAmContainerExitFailed`,  then only when AM container exit 0 and this conf is true, then won't retry.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34366: [SPARK-37097][YARN] yarn-cluster mode don't need to retry when AM container exit code 0 but application failed.

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34366:
URL: https://github.com/apache/spark/pull/34366#issuecomment-960424349


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49360/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34366: [SPARK-37097][YARN] yarn-cluster mode don't need to retry when AM container exit code 0 but application failed.

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34366:
URL: https://github.com/apache/spark/pull/34366#issuecomment-960429631


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144890/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34366: [SPARK-37097][YARN] yarn-cluster mode don't need to retry when AM container exit code 0 but application failed.

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34366:
URL: https://github.com/apache/spark/pull/34366#issuecomment-960447380


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49360/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34366: [SPARK-37097][YARN] yarn-cluster mode don't need to retry when AM container exit code 0 but application failed.

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34366:
URL: https://github.com/apache/spark/pull/34366#issuecomment-949465053


   **[Test build #144534 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144534/testReport)** for PR 34366 at commit [`beefd1f`](https://github.com/apache/spark/commit/beefd1fb4d0ccc51f1a250a9d727302f3891574c).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34366: [SPARK-37097][YARN] yarn-cluster mode don't need to retry when AM container exit code 0 but application failed.

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34366:
URL: https://github.com/apache/spark/pull/34366#issuecomment-949484832


   **[Test build #144534 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144534/testReport)** for PR 34366 at commit [`beefd1f`](https://github.com/apache/spark/commit/beefd1fb4d0ccc51f1a250a9d727302f3891574c).
    * This patch passes all tests.
    * This patch merges cleanly.
    * This patch adds no public classes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34366: [SPARK-37097][YARN] yarn-cluster mode don't need to retry when AM container exit code 0 but application failed.

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34366:
URL: https://github.com/apache/spark/pull/34366#issuecomment-949556008


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49005/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34366: [SPARK-37097][YARN] yarn-cluster mode don't need to retry when AM container exit code 0 but application failed.

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34366:
URL: https://github.com/apache/spark/pull/34366#issuecomment-960431066


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49361/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AngersZhuuuu commented on a change in pull request #34366: [SPARK-37097][YARN] yarn-cluster mode don't need to retry when AM container exit code 0 but application failed.

Posted by GitBox <gi...@apache.org>.
AngersZhuuuu commented on a change in pull request #34366:
URL: https://github.com/apache/spark/pull/34366#discussion_r744446115



##########
File path: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
##########
@@ -1277,10 +1277,14 @@ private[spark] class Client(
     } else {
       val YarnAppReport(appState, finalState, diags) = monitorApplication(appId)
       if (appState == YarnApplicationState.FAILED || finalState == FinalApplicationStatus.FAILED) {
+        var amContainerSuccess = false
         diags.foreach { err =>
+          amContainerSuccess = err.contains("AM Container") && err.contains("exitCode: 0")
           logError(s"Application diagnostics message: $err")
         }
-        throw new SparkException(s"Application $appId finished with failed status")
+        if (!amContainerSuccess) {
+          throw new SparkException(s"Application $appId finished with failed status")
+        }

Review comment:
       > Users complain because miss SLA or something? Retry should be an acceptable thing sometimes when failures happen. But I get it that its annoying.
   
   Yea, really annoying and miss SLA
   
   
   > What is the timeout like in this case? Is it the YARN call timed out but would have answered within a few seconds or is it YARN isn't going to respond for 10's of minutes or hours?
   
   There is a shutdown hook timeout of 30s.
   
   > The other thing we could possibly do is look at adding optional retry logic in Spark on top of YARN client retry that is configurable where we say we really want to unregister, so wait and retry in addition to the built in Hadoop retry logic. That would delay the client knowing its done but would NOT retry and would keep YARN state consistent. Obviously would need others input on this as well. This is one reason why I'm asking about when this happens and exact circumstances so that we could possibly come up with alternate solutions.
   
   Still need retry. The best way should be handling such case in yarn side, add an optional logic that when AM container exitcode 0, return SUCCESS final status.
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AngersZhuuuu commented on a change in pull request #34366: [SPARK-37097][YARN] yarn-cluster mode don't need to retry when AM container exit code 0 but application failed.

Posted by GitBox <gi...@apache.org>.
AngersZhuuuu commented on a change in pull request #34366:
URL: https://github.com/apache/spark/pull/34366#discussion_r742569341



##########
File path: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
##########
@@ -1277,10 +1277,14 @@ private[spark] class Client(
     } else {
       val YarnAppReport(appState, finalState, diags) = monitorApplication(appId)
       if (appState == YarnApplicationState.FAILED || finalState == FinalApplicationStatus.FAILED) {
+        var amContainerSuccess = false
         diags.foreach { err =>
+          amContainerSuccess = err.contains("AM Container") && err.contains("exitCode: 0")
           logError(s"Application diagnostics message: $err")
         }
-        throw new SparkException(s"Application $appId finished with failed status")
+        if (!amContainerSuccess) {
+          throw new SparkException(s"Application $appId finished with failed status")
+        }

Review comment:
       > Instead of parsing this from diagnostics, we can use AM's getContainerExitStatus and check for that being 0 ? Take a look at `getDriverLogsLink` on how to get to AM container
   
   Tested in our prod and check yarn's code
   
   ![image](https://user-images.githubusercontent.com/46485123/140268608-c93b4f90-0d99-43be-97c0-d66ed4f49169.png)
   
   RM will directly delete application attempts when done application.
   And throw exception when  use current code
   ```
   Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.yarn.exceptions.ContainerNotFoundException): Container with id 'container_e240_1632999510150_3975819_02_000001' doesn't exist in RM.
   	at org.apache.hadoop.yarn.server.resourcemanager.ClientRMService.getContainerReport(ClientRMService.java:532)
   	at org.apache.hadoop.yarn.api.impl.pb.service.ApplicationClientProtocolPBServiceImpl.getContainerReport(ApplicationClientProtocolPBServiceImpl.java:466)
   	at org.apache.hadoop.yarn.proto.ApplicationClientProtocol$ApplicationClientProtocolService$2.callBlockingMethod(ApplicationClientProtocol.java:645)
   	at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:622)
   	at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:590)
   	at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:574)
   	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1129)
   	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1056)
   ```
   
   May need to extract from diagnostics...




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mridulm commented on a change in pull request #34366: [SPARK-37097][YARN] yarn-cluster mode don't need to retry when AM container exit code 0 but application failed.

Posted by GitBox <gi...@apache.org>.
mridulm commented on a change in pull request #34366:
URL: https://github.com/apache/spark/pull/34366#discussion_r742248359



##########
File path: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
##########
@@ -1277,10 +1277,14 @@ private[spark] class Client(
     } else {
       val YarnAppReport(appState, finalState, diags) = monitorApplication(appId)
       if (appState == YarnApplicationState.FAILED || finalState == FinalApplicationStatus.FAILED) {
+        var amContainerSuccess = false
         diags.foreach { err =>
+          amContainerSuccess = err.contains("AM Container") && err.contains("exitCode: 0")
           logError(s"Application diagnostics message: $err")
         }
-        throw new SparkException(s"Application $appId finished with failed status")
+        if (!amContainerSuccess) {
+          throw new SparkException(s"Application $appId finished with failed status")
+        }

Review comment:
       Instead of parsing this from diagnostics, we can use AM's getContainerExitStatus and check for that being 0 ?

##########
File path: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
##########
@@ -1277,10 +1277,14 @@ private[spark] class Client(
     } else {
       val YarnAppReport(appState, finalState, diags) = monitorApplication(appId)
       if (appState == YarnApplicationState.FAILED || finalState == FinalApplicationStatus.FAILED) {
+        var amContainerSuccess = false
         diags.foreach { err =>
+          amContainerSuccess = err.contains("AM Container") && err.contains("exitCode: 0")
           logError(s"Application diagnostics message: $err")
         }
-        throw new SparkException(s"Application $appId finished with failed status")
+        if (!amContainerSuccess) {
+          throw new SparkException(s"Application $appId finished with failed status")
+        }

Review comment:
       Instead of parsing this from diagnostics, we can use AM's getContainerExitStatus and check for that being 0 ?
   Take a look at `getDriverLogsLink` on how to get to driver container

##########
File path: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
##########
@@ -1277,10 +1277,14 @@ private[spark] class Client(
     } else {
       val YarnAppReport(appState, finalState, diags) = monitorApplication(appId)
       if (appState == YarnApplicationState.FAILED || finalState == FinalApplicationStatus.FAILED) {
+        var amContainerSuccess = false
         diags.foreach { err =>
+          amContainerSuccess = err.contains("AM Container") && err.contains("exitCode: 0")
           logError(s"Application diagnostics message: $err")
         }
-        throw new SparkException(s"Application $appId finished with failed status")
+        if (!amContainerSuccess) {
+          throw new SparkException(s"Application $appId finished with failed status")
+        }

Review comment:
       Instead of parsing this from diagnostics, we can use AM's getContainerExitStatus and check for that being 0 ?
   Take a look at `getDriverLogsLink` on how to get to AM container

##########
File path: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
##########
@@ -1277,10 +1277,14 @@ private[spark] class Client(
     } else {
       val YarnAppReport(appState, finalState, diags) = monitorApplication(appId)
       if (appState == YarnApplicationState.FAILED || finalState == FinalApplicationStatus.FAILED) {
+        var amContainerSuccess = false
         diags.foreach { err =>
+          amContainerSuccess = err.contains("AM Container") && err.contains("exitCode: 0")
           logError(s"Application diagnostics message: $err")
         }
-        throw new SparkException(s"Application $appId finished with failed status")
+        if (!amContainerSuccess) {
+          throw new SparkException(s"Application $appId finished with failed status")
+        }

Review comment:
       Instead of parsing this from diagnostics, we can use AM's getContainerExitStatus and check for that being 0 ?

##########
File path: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
##########
@@ -1277,10 +1277,14 @@ private[spark] class Client(
     } else {
       val YarnAppReport(appState, finalState, diags) = monitorApplication(appId)
       if (appState == YarnApplicationState.FAILED || finalState == FinalApplicationStatus.FAILED) {
+        var amContainerSuccess = false
         diags.foreach { err =>
+          amContainerSuccess = err.contains("AM Container") && err.contains("exitCode: 0")
           logError(s"Application diagnostics message: $err")
         }
-        throw new SparkException(s"Application $appId finished with failed status")
+        if (!amContainerSuccess) {
+          throw new SparkException(s"Application $appId finished with failed status")
+        }

Review comment:
       Instead of parsing this from diagnostics, we can use AM's getContainerExitStatus and check for that being 0 ?
   Take a look at `getDriverLogsLink` on how to get to driver container

##########
File path: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
##########
@@ -1277,10 +1277,14 @@ private[spark] class Client(
     } else {
       val YarnAppReport(appState, finalState, diags) = monitorApplication(appId)
       if (appState == YarnApplicationState.FAILED || finalState == FinalApplicationStatus.FAILED) {
+        var amContainerSuccess = false
         diags.foreach { err =>
+          amContainerSuccess = err.contains("AM Container") && err.contains("exitCode: 0")
           logError(s"Application diagnostics message: $err")
         }
-        throw new SparkException(s"Application $appId finished with failed status")
+        if (!amContainerSuccess) {
+          throw new SparkException(s"Application $appId finished with failed status")
+        }

Review comment:
       Instead of parsing this from diagnostics, we can use AM's getContainerExitStatus and check for that being 0 ?
   Take a look at `getDriverLogsLink` on how to get to AM container




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mridulm commented on a change in pull request #34366: [SPARK-37097][YARN] yarn-cluster mode don't need to retry when AM container exit code 0 but application failed.

Posted by GitBox <gi...@apache.org>.
mridulm commented on a change in pull request #34366:
URL: https://github.com/apache/spark/pull/34366#discussion_r742974924



##########
File path: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
##########
@@ -1277,10 +1277,14 @@ private[spark] class Client(
     } else {
       val YarnAppReport(appState, finalState, diags) = monitorApplication(appId)
       if (appState == YarnApplicationState.FAILED || finalState == FinalApplicationStatus.FAILED) {
+        var amContainerSuccess = false
         diags.foreach { err =>
+          amContainerSuccess = err.contains("AM Container") && err.contains("exitCode: 0")
           logError(s"Application diagnostics message: $err")
         }
-        throw new SparkException(s"Application $appId finished with failed status")
+        if (!amContainerSuccess) {
+          throw new SparkException(s"Application $appId finished with failed status")
+        }

Review comment:
       Thanks for the details @AngersZhuuuu, this is unfortunate.
   I wanted to avoid parsing from diagnostic, given it is potentially brittle - without timeline service, I am not sure how to extract this without parsing the string.
   And thoughts @tgravescs ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AngersZhuuuu commented on a change in pull request #34366: [SPARK-37097][YARN] yarn-cluster mode don't need to retry when AM container exit code 0 but application failed.

Posted by GitBox <gi...@apache.org>.
AngersZhuuuu commented on a change in pull request #34366:
URL: https://github.com/apache/spark/pull/34366#discussion_r742495006



##########
File path: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
##########
@@ -1277,10 +1277,14 @@ private[spark] class Client(
     } else {
       val YarnAppReport(appState, finalState, diags) = monitorApplication(appId)
       if (appState == YarnApplicationState.FAILED || finalState == FinalApplicationStatus.FAILED) {
+        var amContainerSuccess = false
         diags.foreach { err =>
+          amContainerSuccess = err.contains("AM Container") && err.contains("exitCode: 0")
           logError(s"Application diagnostics message: $err")
         }
-        throw new SparkException(s"Application $appId finished with failed status")
+        if (!amContainerSuccess) {
+          throw new SparkException(s"Application $appId finished with failed status")
+        }

Review comment:
       > getDriverLogsLink
   
   Nice suggestion! updated




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] commented on pull request #34366: [SPARK-37097][YARN] yarn-cluster mode don't need to retry when AM container exit code 0 but application failed.

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #34366:
URL: https://github.com/apache/spark/pull/34366#issuecomment-1045409422


   We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34366: [SPARK-37097][YARN] yarn-cluster mode don't need to retry when AM container exit code 0 but application failed.

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34366:
URL: https://github.com/apache/spark/pull/34366#issuecomment-949498911


   Kubernetes integration test starting
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49005/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34366: [SPARK-37097][YARN] yarn-cluster mode don't need to retry when AM container exit code 0 but application failed.

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34366:
URL: https://github.com/apache/spark/pull/34366#issuecomment-949532487


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49005/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34366: [SPARK-37097][YARN] yarn-cluster mode don't need to retry when AM container exit code 0 but application failed.

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34366:
URL: https://github.com/apache/spark/pull/34366#issuecomment-949508684


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144534/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34366: [SPARK-37097][YARN] yarn-cluster mode don't need to retry when AM container exit code 0 but application failed.

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34366:
URL: https://github.com/apache/spark/pull/34366#issuecomment-960453844






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AngersZhuuuu commented on a change in pull request #34366: [SPARK-37097][YARN] yarn-cluster mode don't need to retry when AM container exit code 0 but application failed.

Posted by GitBox <gi...@apache.org>.
AngersZhuuuu commented on a change in pull request #34366:
URL: https://github.com/apache/spark/pull/34366#discussion_r743349589



##########
File path: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
##########
@@ -1277,10 +1277,14 @@ private[spark] class Client(
     } else {
       val YarnAppReport(appState, finalState, diags) = monitorApplication(appId)
       if (appState == YarnApplicationState.FAILED || finalState == FinalApplicationStatus.FAILED) {
+        var amContainerSuccess = false
         diags.foreach { err =>
+          amContainerSuccess = err.contains("AM Container") && err.contains("exitCode: 0")
           logError(s"Application diagnostics message: $err")
         }
-        throw new SparkException(s"Application $appId finished with failed status")
+        if (!amContainerSuccess) {
+          throw new SparkException(s"Application $appId finished with failed status")
+        }

Review comment:
       > Thanks for the details @AngersZhuuuu, this is unfortunate. I wanted to avoid parsing from diagnostic, given it is potentially brittle - without timeline service, I am not sure how to extract this without parsing the string. Any thoughts @tgravescs ?
   
   Yea, we can got this from timeline service, but it's also strange..

##########
File path: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
##########
@@ -1277,10 +1277,14 @@ private[spark] class Client(
     } else {
       val YarnAppReport(appState, finalState, diags) = monitorApplication(appId)
       if (appState == YarnApplicationState.FAILED || finalState == FinalApplicationStatus.FAILED) {
+        var amContainerSuccess = false
         diags.foreach { err =>
+          amContainerSuccess = err.contains("AM Container") && err.contains("exitCode: 0")
           logError(s"Application diagnostics message: $err")
         }
-        throw new SparkException(s"Application $appId finished with failed status")
+        if (!amContainerSuccess) {
+          throw new SparkException(s"Application $appId finished with failed status")
+        }

Review comment:
       > yeah unfortunately that final application status is what YARN really uses and advertises and it really wants you to unregister.
   >
   > How often does this really happen? it seems like if you are timing out talking to RM very often you have other problems on your cluster. Or does this happen on rolling upgrade or something?
   
   This situation does not happen often, but it will happen from time to time, causing many users to complain. Our cluster is relatively large, and the performance pressure of yarn is very high, so sometimes, some requests will respond very slowly. Then caused such issue.
   
   
   > Note, If we were to do this, then the YARN final status also wouldn't match because doesn't that show up as failed when it can't unregister? that could confuse the user.
   
   People always care application's running status more than the yarn's info. If job success, they won't check applications on yarn ui. For yarn's admin,  The information for this job is still the same as before
   
   
   Also we can add a legacy configuration for this
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tgravescs commented on a change in pull request #34366: [SPARK-37097][YARN] yarn-cluster mode don't need to retry when AM container exit code 0 but application failed.

Posted by GitBox <gi...@apache.org>.
tgravescs commented on a change in pull request #34366:
URL: https://github.com/apache/spark/pull/34366#discussion_r743858964



##########
File path: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
##########
@@ -1277,10 +1277,14 @@ private[spark] class Client(
     } else {
       val YarnAppReport(appState, finalState, diags) = monitorApplication(appId)
       if (appState == YarnApplicationState.FAILED || finalState == FinalApplicationStatus.FAILED) {
+        var amContainerSuccess = false
         diags.foreach { err =>
+          amContainerSuccess = err.contains("AM Container") && err.contains("exitCode: 0")
           logError(s"Application diagnostics message: $err")
         }
-        throw new SparkException(s"Application $appId finished with failed status")
+        if (!amContainerSuccess) {
+          throw new SparkException(s"Application $appId finished with failed status")
+        }

Review comment:
       Users complain because miss SLA or something?  Retry should be an acceptable thing sometimes when failures happen.  But I get it that its annoying.
   
   What is the timeout like in this case?  Is it the YARN call timed out but would have answered within a few 
   seconds or is it YARN isn't going to respond for 10's of minutes or hours?   
   
   The other thing we could possibly do is look at adding optional retry logic in Spark on top of YARN client retry that is configurable where we say we really want to unregister, so wait and retry in addition to the built in Hadoop retry logic.  That would delay the client knowing its done but would NOT retry and would keep YARN state consistent.  Obviously would need others input on this as well.   This is one reason why I'm asking about when this happens and exact circumstances so that we could possibly come up with alternate solutions.
   
   Like @mridulm  I'm not a fan of parsing the diagnostics.
   
   you can also increase your rpc timeout for talking to the resource manager. 
   
   the other way is to handle this outside of Spark.. application has retry logic instead of spark where it can check to see if for instance it wrote the expected output and if its there don't retry, otherwise retry.  
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tgravescs commented on a change in pull request #34366: [SPARK-37097][YARN] yarn-cluster mode don't need to retry when AM container exit code 0 but application failed.

Posted by GitBox <gi...@apache.org>.
tgravescs commented on a change in pull request #34366:
URL: https://github.com/apache/spark/pull/34366#discussion_r744864128



##########
File path: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
##########
@@ -1277,10 +1277,14 @@ private[spark] class Client(
     } else {
       val YarnAppReport(appState, finalState, diags) = monitorApplication(appId)
       if (appState == YarnApplicationState.FAILED || finalState == FinalApplicationStatus.FAILED) {
+        var amContainerSuccess = false
         diags.foreach { err =>
+          amContainerSuccess = err.contains("AM Container") && err.contains("exitCode: 0")
           logError(s"Application diagnostics message: $err")
         }
-        throw new SparkException(s"Application $appId finished with failed status")
+        if (!amContainerSuccess) {
+          throw new SparkException(s"Application $appId finished with failed status")
+        }

Review comment:
       > There is a shutdown hook timeout of 30s.
   
   This doesn't answer my question.  How long does it take for YARN to actually respond in the cases this is slow?
   
   The way YARN API is supposed to work, the application master is meant to unregister, it is not meant to go off of exit codes, if you can't unregister, YARN doesn't really know what happened.  Yes perhaps YARN should expose container exit status better, feel free to go propose a better API in YARN that we can use.
   
   honestly this is kind of a YARN cluster setup issue in my opinion, if your ResourceManager is not responding for long periods of times frequently, you need to fix that.  It would be great if applications could work around that but I don't want to put something into Spark that is brittle and could cause other issues.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] github-actions[bot] closed pull request #34366: [SPARK-37097][YARN] yarn-cluster mode don't need to retry when AM container exit code 0 but application failed.

Posted by GitBox <gi...@apache.org>.
github-actions[bot] closed pull request #34366:
URL: https://github.com/apache/spark/pull/34366


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34366: [SPARK-37097][YARN] yarn-cluster mode don't need to retry when AM container exit code 0 but application failed.

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34366:
URL: https://github.com/apache/spark/pull/34366#issuecomment-949556008


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49005/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #34366: [SPARK-37097][YARN] yarn-cluster mode don't need to retry when AM container exit code 0 but application failed.

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34366:
URL: https://github.com/apache/spark/pull/34366#issuecomment-949465053


   **[Test build #144534 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144534/testReport)** for PR 34366 at commit [`beefd1f`](https://github.com/apache/spark/commit/beefd1fb4d0ccc51f1a250a9d727302f3891574c).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tgravescs commented on a change in pull request #34366: [SPARK-37097][YARN] yarn-cluster mode don't need to retry when AM container exit code 0 but application failed.

Posted by GitBox <gi...@apache.org>.
tgravescs commented on a change in pull request #34366:
URL: https://github.com/apache/spark/pull/34366#discussion_r743114402



##########
File path: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
##########
@@ -1277,10 +1277,14 @@ private[spark] class Client(
     } else {
       val YarnAppReport(appState, finalState, diags) = monitorApplication(appId)
       if (appState == YarnApplicationState.FAILED || finalState == FinalApplicationStatus.FAILED) {
+        var amContainerSuccess = false
         diags.foreach { err =>
+          amContainerSuccess = err.contains("AM Container") && err.contains("exitCode: 0")
           logError(s"Application diagnostics message: $err")
         }
-        throw new SparkException(s"Application $appId finished with failed status")
+        if (!amContainerSuccess) {
+          throw new SparkException(s"Application $appId finished with failed status")
+        }

Review comment:
       yeah unfortunately that final application status is what YARN really uses and advertises and it really wants you to unregister.
   
   How often does this really happen?  it seems like if you are timing out talking to RM very often you have other problems on your cluster.  Or does this happen on rolling upgrade or something?
   
   Note, If we were to do this, then the YARN final status also wouldn't match because doesn't that show up as failed when it can't unregister?  that could confuse the user.
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AngersZhuuuu commented on a change in pull request #34366: [SPARK-37097][YARN] yarn-cluster mode don't need to retry when AM container exit code 0 but application failed.

Posted by GitBox <gi...@apache.org>.
AngersZhuuuu commented on a change in pull request #34366:
URL: https://github.com/apache/spark/pull/34366#discussion_r743351517



##########
File path: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
##########
@@ -1277,10 +1277,14 @@ private[spark] class Client(
     } else {
       val YarnAppReport(appState, finalState, diags) = monitorApplication(appId)
       if (appState == YarnApplicationState.FAILED || finalState == FinalApplicationStatus.FAILED) {
+        var amContainerSuccess = false
         diags.foreach { err =>
+          amContainerSuccess = err.contains("AM Container") && err.contains("exitCode: 0")
           logError(s"Application diagnostics message: $err")
         }
-        throw new SparkException(s"Application $appId finished with failed status")
+        if (!amContainerSuccess) {
+          throw new SparkException(s"Application $appId finished with failed status")
+        }

Review comment:
       > yeah unfortunately that final application status is what YARN really uses and advertises and it really wants you to unregister.
   >
   > How often does this really happen? it seems like if you are timing out talking to RM very often you have other problems on your cluster. Or does this happen on rolling upgrade or something?
   
   This situation does not happen often, but it will happen from time to time, causing many users to complain. Our cluster is relatively large, and the performance pressure of yarn is very high, so sometimes, some requests will respond very slowly. Then caused such issue.
   
   
   > Note, If we were to do this, then the YARN final status also wouldn't match because doesn't that show up as failed when it can't unregister? that could confuse the user.
   
   People always care application's running status more than the yarn's info. If job success, they won't check applications on yarn ui. For yarn's admin,  The information for this job is still the same as before
   
   
   Also we can add a legacy configuration for this
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mridulm commented on a change in pull request #34366: [SPARK-37097][YARN] yarn-cluster mode don't need to retry when AM container exit code 0 but application failed.

Posted by GitBox <gi...@apache.org>.
mridulm commented on a change in pull request #34366:
URL: https://github.com/apache/spark/pull/34366#discussion_r742248359



##########
File path: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
##########
@@ -1277,10 +1277,14 @@ private[spark] class Client(
     } else {
       val YarnAppReport(appState, finalState, diags) = monitorApplication(appId)
       if (appState == YarnApplicationState.FAILED || finalState == FinalApplicationStatus.FAILED) {
+        var amContainerSuccess = false
         diags.foreach { err =>
+          amContainerSuccess = err.contains("AM Container") && err.contains("exitCode: 0")
           logError(s"Application diagnostics message: $err")
         }
-        throw new SparkException(s"Application $appId finished with failed status")
+        if (!amContainerSuccess) {
+          throw new SparkException(s"Application $appId finished with failed status")
+        }

Review comment:
       Instead of parsing this from diagnostics, we can use AM's getContainerExitStatus and check for that being 0 ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34366: [SPARK-37097][YARN] yarn-cluster mode don't need to retry when AM container exit code 0 but application failed.

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34366:
URL: https://github.com/apache/spark/pull/34366#issuecomment-960430164


   **[Test build #144891 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144891/testReport)** for PR 34366 at commit [`bbdcb1f`](https://github.com/apache/spark/commit/bbdcb1faca24c1ae897de3f68c6c1e3b1fe97749).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34366: [SPARK-37097][YARN] yarn-cluster mode don't need to retry when AM container exit code 0 but application failed.

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34366:
URL: https://github.com/apache/spark/pull/34366#issuecomment-960429631


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144890/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AngersZhuuuu commented on a change in pull request #34366: [SPARK-37097][YARN] yarn-cluster mode don't need to retry when AM container exit code 0 but application failed.

Posted by GitBox <gi...@apache.org>.
AngersZhuuuu commented on a change in pull request #34366:
URL: https://github.com/apache/spark/pull/34366#discussion_r743349589



##########
File path: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
##########
@@ -1277,10 +1277,14 @@ private[spark] class Client(
     } else {
       val YarnAppReport(appState, finalState, diags) = monitorApplication(appId)
       if (appState == YarnApplicationState.FAILED || finalState == FinalApplicationStatus.FAILED) {
+        var amContainerSuccess = false
         diags.foreach { err =>
+          amContainerSuccess = err.contains("AM Container") && err.contains("exitCode: 0")
           logError(s"Application diagnostics message: $err")
         }
-        throw new SparkException(s"Application $appId finished with failed status")
+        if (!amContainerSuccess) {
+          throw new SparkException(s"Application $appId finished with failed status")
+        }

Review comment:
       > Thanks for the details @AngersZhuuuu, this is unfortunate. I wanted to avoid parsing from diagnostic, given it is potentially brittle - without timeline service, I am not sure how to extract this without parsing the string. Any thoughts @tgravescs ?
   
   Yea, we can got this from timeline service, but it's also strange..




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #34366: [SPARK-37097][YARN] yarn-cluster mode don't need to retry when AM container exit code 0 but application failed.

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34366:
URL: https://github.com/apache/spark/pull/34366#issuecomment-960397477


   **[Test build #144890 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144890/testReport)** for PR 34366 at commit [`edd8f2e`](https://github.com/apache/spark/commit/edd8f2e5157d6e916348431da2c40493c88033e7).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34366: [SPARK-37097][YARN] yarn-cluster mode don't need to retry when AM container exit code 0 but application failed.

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34366:
URL: https://github.com/apache/spark/pull/34366#issuecomment-960454747


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49361/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34366: [SPARK-37097][YARN] yarn-cluster mode don't need to retry when AM container exit code 0 but application failed.

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34366:
URL: https://github.com/apache/spark/pull/34366#issuecomment-960453844






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tgravescs commented on a change in pull request #34366: [SPARK-37097][YARN] yarn-cluster mode don't need to retry when AM container exit code 0 but application failed.

Posted by GitBox <gi...@apache.org>.
tgravescs commented on a change in pull request #34366:
URL: https://github.com/apache/spark/pull/34366#discussion_r743114402



##########
File path: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
##########
@@ -1277,10 +1277,14 @@ private[spark] class Client(
     } else {
       val YarnAppReport(appState, finalState, diags) = monitorApplication(appId)
       if (appState == YarnApplicationState.FAILED || finalState == FinalApplicationStatus.FAILED) {
+        var amContainerSuccess = false
         diags.foreach { err =>
+          amContainerSuccess = err.contains("AM Container") && err.contains("exitCode: 0")
           logError(s"Application diagnostics message: $err")
         }
-        throw new SparkException(s"Application $appId finished with failed status")
+        if (!amContainerSuccess) {
+          throw new SparkException(s"Application $appId finished with failed status")
+        }

Review comment:
       yeah unfortunately that final application status is what YARN really uses and advertises and it really wants you to unregister.
   
   How often does this really happen?  it seems like if you are timing out talking to RM very often you have other problems on your cluster.  Or does this happen on rolling upgrade or something?
   
   Note, If we were to do this, then the YARN final status also wouldn't match because doesn't that show up as failed when it can't unregister?  that could confuse the user.
   
   

##########
File path: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
##########
@@ -1277,10 +1277,14 @@ private[spark] class Client(
     } else {
       val YarnAppReport(appState, finalState, diags) = monitorApplication(appId)
       if (appState == YarnApplicationState.FAILED || finalState == FinalApplicationStatus.FAILED) {
+        var amContainerSuccess = false
         diags.foreach { err =>
+          amContainerSuccess = err.contains("AM Container") && err.contains("exitCode: 0")
           logError(s"Application diagnostics message: $err")
         }
-        throw new SparkException(s"Application $appId finished with failed status")
+        if (!amContainerSuccess) {
+          throw new SparkException(s"Application $appId finished with failed status")
+        }

Review comment:
       Users complain because miss SLA or something?  Retry should be an acceptable thing sometimes when failures happen.  But I get it that its annoying.
   
   What is the timeout like in this case?  Is it the YARN call timed out but would have answered within a few 
   seconds or is it YARN isn't going to respond for 10's of minutes or hours?   
   
   The other thing we could possibly do is look at adding optional retry logic in Spark on top of YARN client retry that is configurable where we say we really want to unregister, so wait and retry in addition to the built in Hadoop retry logic.  That would delay the client knowing its done but would NOT retry and would keep YARN state consistent.  Obviously would need others input on this as well.   This is one reason why I'm asking about when this happens and exact circumstances so that we could possibly come up with alternate solutions.
   
   Like @mridulm  I'm not a fan of parsing the diagnostics.
   
   you can also increase your rpc timeout for talking to the resource manager. 
   
   the other way is to handle this outside of Spark.. application has retry logic instead of spark where it can check to see if for instance it wrote the expected output and if its there don't retry, otherwise retry.  
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] tgravescs commented on a change in pull request #34366: [SPARK-37097][YARN] yarn-cluster mode don't need to retry when AM container exit code 0 but application failed.

Posted by GitBox <gi...@apache.org>.
tgravescs commented on a change in pull request #34366:
URL: https://github.com/apache/spark/pull/34366#discussion_r743114402



##########
File path: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
##########
@@ -1277,10 +1277,14 @@ private[spark] class Client(
     } else {
       val YarnAppReport(appState, finalState, diags) = monitorApplication(appId)
       if (appState == YarnApplicationState.FAILED || finalState == FinalApplicationStatus.FAILED) {
+        var amContainerSuccess = false
         diags.foreach { err =>
+          amContainerSuccess = err.contains("AM Container") && err.contains("exitCode: 0")
           logError(s"Application diagnostics message: $err")
         }
-        throw new SparkException(s"Application $appId finished with failed status")
+        if (!amContainerSuccess) {
+          throw new SparkException(s"Application $appId finished with failed status")
+        }

Review comment:
       yeah unfortunately that final application status is what YARN really uses and advertises and it really wants you to unregister.
   
   How often does this really happen?  it seems like if you are timing out talking to RM very often you have other problems on your cluster.  Or does this happen on rolling upgrade or something?
   
   Note, If we were to do this, then the YARN final status also wouldn't match because doesn't that show up as failed when it can't unregister?  that could confuse the user.
   
   

##########
File path: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
##########
@@ -1277,10 +1277,14 @@ private[spark] class Client(
     } else {
       val YarnAppReport(appState, finalState, diags) = monitorApplication(appId)
       if (appState == YarnApplicationState.FAILED || finalState == FinalApplicationStatus.FAILED) {
+        var amContainerSuccess = false
         diags.foreach { err =>
+          amContainerSuccess = err.contains("AM Container") && err.contains("exitCode: 0")
           logError(s"Application diagnostics message: $err")
         }
-        throw new SparkException(s"Application $appId finished with failed status")
+        if (!amContainerSuccess) {
+          throw new SparkException(s"Application $appId finished with failed status")
+        }

Review comment:
       Users complain because miss SLA or something?  Retry should be an acceptable thing sometimes when failures happen.  But I get it that its annoying.
   
   What is the timeout like in this case?  Is it the YARN call timed out but would have answered within a few 
   seconds or is it YARN isn't going to respond for 10's of minutes or hours?   
   
   The other thing we could possibly do is look at adding optional retry logic in Spark on top of YARN client retry that is configurable where we say we really want to unregister, so wait and retry in addition to the built in Hadoop retry logic.  That would delay the client knowing its done but would NOT retry and would keep YARN state consistent.  Obviously would need others input on this as well.   This is one reason why I'm asking about when this happens and exact circumstances so that we could possibly come up with alternate solutions.
   
   Like @mridulm  I'm not a fan of parsing the diagnostics.
   
   you can also increase your rpc timeout for talking to the resource manager. 
   
   the other way is to handle this outside of Spark.. application has retry logic instead of spark where it can check to see if for instance it wrote the expected output and if its there don't retry, otherwise retry.  
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AngersZhuuuu commented on pull request #34366: [SPARK-37097][YARN] yarn-cluster mode don't need to retry when AM container exit code 0 but application failed.

Posted by GitBox <gi...@apache.org>.
AngersZhuuuu commented on pull request #34366:
URL: https://github.com/apache/spark/pull/34366#issuecomment-955970249


   ping @tgravescs @mridulm .
   
   One consideration is , if we need to add a conf like `spark.yarn.retryOnlyAmContainerExitFailed`,  then only when AM container exit 0 and this conf is true, then won't retry.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34366: [SPARK-37097][YARN] yarn-cluster mode don't need to retry when AM container exit code 0 but application failed.

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34366:
URL: https://github.com/apache/spark/pull/34366#issuecomment-960397477


   **[Test build #144890 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144890/testReport)** for PR 34366 at commit [`edd8f2e`](https://github.com/apache/spark/commit/edd8f2e5157d6e916348431da2c40493c88033e7).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mridulm commented on a change in pull request #34366: [SPARK-37097][YARN] yarn-cluster mode don't need to retry when AM container exit code 0 but application failed.

Posted by GitBox <gi...@apache.org>.
mridulm commented on a change in pull request #34366:
URL: https://github.com/apache/spark/pull/34366#discussion_r742248359



##########
File path: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
##########
@@ -1277,10 +1277,14 @@ private[spark] class Client(
     } else {
       val YarnAppReport(appState, finalState, diags) = monitorApplication(appId)
       if (appState == YarnApplicationState.FAILED || finalState == FinalApplicationStatus.FAILED) {
+        var amContainerSuccess = false
         diags.foreach { err =>
+          amContainerSuccess = err.contains("AM Container") && err.contains("exitCode: 0")
           logError(s"Application diagnostics message: $err")
         }
-        throw new SparkException(s"Application $appId finished with failed status")
+        if (!amContainerSuccess) {
+          throw new SparkException(s"Application $appId finished with failed status")
+        }

Review comment:
       Instead of parsing this from diagnostics, we can use AM's getContainerExitStatus and check for that being 0 ?
   Take a look at `getDriverLogsLink` on how to get to driver container

##########
File path: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
##########
@@ -1277,10 +1277,14 @@ private[spark] class Client(
     } else {
       val YarnAppReport(appState, finalState, diags) = monitorApplication(appId)
       if (appState == YarnApplicationState.FAILED || finalState == FinalApplicationStatus.FAILED) {
+        var amContainerSuccess = false
         diags.foreach { err =>
+          amContainerSuccess = err.contains("AM Container") && err.contains("exitCode: 0")
           logError(s"Application diagnostics message: $err")
         }
-        throw new SparkException(s"Application $appId finished with failed status")
+        if (!amContainerSuccess) {
+          throw new SparkException(s"Application $appId finished with failed status")
+        }

Review comment:
       Instead of parsing this from diagnostics, we can use AM's getContainerExitStatus and check for that being 0 ?
   Take a look at `getDriverLogsLink` on how to get to AM container




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] mridulm commented on a change in pull request #34366: [SPARK-37097][YARN] yarn-cluster mode don't need to retry when AM container exit code 0 but application failed.

Posted by GitBox <gi...@apache.org>.
mridulm commented on a change in pull request #34366:
URL: https://github.com/apache/spark/pull/34366#discussion_r742974924



##########
File path: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
##########
@@ -1277,10 +1277,14 @@ private[spark] class Client(
     } else {
       val YarnAppReport(appState, finalState, diags) = monitorApplication(appId)
       if (appState == YarnApplicationState.FAILED || finalState == FinalApplicationStatus.FAILED) {
+        var amContainerSuccess = false
         diags.foreach { err =>
+          amContainerSuccess = err.contains("AM Container") && err.contains("exitCode: 0")
           logError(s"Application diagnostics message: $err")
         }
-        throw new SparkException(s"Application $appId finished with failed status")
+        if (!amContainerSuccess) {
+          throw new SparkException(s"Application $appId finished with failed status")
+        }

Review comment:
       Thanks for the details @AngersZhuuuu, this is unfortunate.
   I wanted to avoid parsing from diagnostic, given it is potentially brittle - without timeline service, I am not sure how to extract this without parsing the string.
   Any thoughts @tgravescs ?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA removed a comment on pull request #34366: [SPARK-37097][YARN] yarn-cluster mode don't need to retry when AM container exit code 0 but application failed.

Posted by GitBox <gi...@apache.org>.
SparkQA removed a comment on pull request #34366:
URL: https://github.com/apache/spark/pull/34366#issuecomment-960430164


   **[Test build #144891 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/144891/testReport)** for PR 34366 at commit [`bbdcb1f`](https://github.com/apache/spark/commit/bbdcb1faca24c1ae897de3f68c6c1e3b1fe97749).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins removed a comment on pull request #34366: [SPARK-37097][YARN] yarn-cluster mode don't need to retry when AM container exit code 0 but application failed.

Posted by GitBox <gi...@apache.org>.
AmplabJenkins removed a comment on pull request #34366:
URL: https://github.com/apache/spark/pull/34366#issuecomment-960454747


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder-K8s/49361/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] SparkQA commented on pull request #34366: [SPARK-37097][YARN] yarn-cluster mode don't need to retry when AM container exit code 0 but application failed.

Posted by GitBox <gi...@apache.org>.
SparkQA commented on pull request #34366:
URL: https://github.com/apache/spark/pull/34366#issuecomment-960454737


   Kubernetes integration test status failure
   URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/49361/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AngersZhuuuu commented on a change in pull request #34366: [SPARK-37097][YARN] yarn-cluster mode don't need to retry when AM container exit code 0 but application failed.

Posted by GitBox <gi...@apache.org>.
AngersZhuuuu commented on a change in pull request #34366:
URL: https://github.com/apache/spark/pull/34366#discussion_r745245862



##########
File path: resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
##########
@@ -1277,10 +1277,14 @@ private[spark] class Client(
     } else {
       val YarnAppReport(appState, finalState, diags) = monitorApplication(appId)
       if (appState == YarnApplicationState.FAILED || finalState == FinalApplicationStatus.FAILED) {
+        var amContainerSuccess = false
         diags.foreach { err =>
+          amContainerSuccess = err.contains("AM Container") && err.contains("exitCode: 0")
           logError(s"Application diagnostics message: $err")
         }
-        throw new SparkException(s"Application $appId finished with failed status")
+        if (!amContainerSuccess) {
+          throw new SparkException(s"Application $appId finished with failed status")
+        }

Review comment:
       > > There is a shutdown hook timeout of 30s.
   > 
   > This doesn't answer my question. How long does it take for YARN to actually respond in the cases this is slow?
   > 
   > The way YARN API is supposed to work, the application master is meant to unregister, it is not meant to go off of exit codes, if you can't unregister, YARN doesn't really know what happened. Yes perhaps YARN should expose container exit status better, feel free to go propose a better API in YARN that we can use.
   > 
   > honestly this is kind of a YARN cluster setup issue in my opinion, if your ResourceManager is not responding for long periods of times frequently, you need to fix that. It would be great if applications could work around that but I don't want to put something into Spark that is brittle and could cause other issues.
   
   In the previous yarn performance degradation, this situation occurred. That time, you can check the RPC response time by opening the debug log, but this time it was an occasional situation, so it was impossible to test. Previously, there was a case where yarn accepted the unregister request for more than 30 seconds and did not return.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] AmplabJenkins commented on pull request #34366: [SPARK-37097][YARN] yarn-cluster mode don't need to retry when AM container exit code 0 but application failed.

Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on pull request #34366:
URL: https://github.com/apache/spark/pull/34366#issuecomment-949508684


   
   Refer to this link for build results (access rights to CI server needed): 
   https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/144534/
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org