You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Nikita Gorbachevski (JIRA)" <ji...@apache.org> on 2019/07/08 16:19:00 UTC

[jira] [Commented] (SPARK-22876) spark.yarn.am.attemptFailuresValidityInterval does not work correctly

    [ https://issues.apache.org/jira/browse/SPARK-22876?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16880508#comment-16880508 ] 

Nikita Gorbachevski commented on SPARK-22876:
---------------------------------------------

Hi, i believe we should reopen this issue cause it still exists. At least documentation for spark.yarn.am.attemptFailuresValidityInterval is misleading and should be updated cause this parameter affects nothing. Most likely it shouldn't be mentioned at all.

However absence of this parameter is a serious flaw for long running spark streaming applications on YARN.

I propose to fix the documentation in the scope of this ticket and create two more tickets in YARN and SPARK projects in order to employ this feature.

[~jerryshao] [~lucasmf] [~hyukjin.kwon] what do you think ?

> spark.yarn.am.attemptFailuresValidityInterval does not work correctly
> ---------------------------------------------------------------------
>
>                 Key: SPARK-22876
>                 URL: https://issues.apache.org/jira/browse/SPARK-22876
>             Project: Spark
>          Issue Type: Bug
>          Components: YARN
>    Affects Versions: 2.2.0
>         Environment: hadoop version 2.7.3
>            Reporter: Jinhan Zhong
>            Priority: Minor
>              Labels: bulk-closed
>
> I assume we can use spark.yarn.maxAppAttempts together with spark.yarn.am.attemptFailuresValidityInterval to make a long running application avoid stopping  after acceptable number of failures.
> But after testing, I found that the application always stops after failing n times ( n is minimum value of spark.yarn.maxAppAttempts and yarn.resourcemanager.am.max-attempts from client yarn-site.xml)
> for example, following setup will allow the application master to fail 20 times.
> * spark.yarn.am.attemptFailuresValidityInterval=1s
> * spark.yarn.maxAppAttempts=20
> * yarn client: yarn.resourcemanager.am.max-attempts=20
> * yarn resource manager: yarn.resourcemanager.am.max-attempts=3
> And after checking the source code, I found in source file ApplicationMaster.scala https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala#L293
> there's a ShutdownHook that checks the attempt id against the maxAppAttempts, if attempt id >= maxAppAttempts, it will try to unregister the application and the application will finish.
> is this a expected design or a bug?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org