You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Kevin Wikant (Jira)" <ji...@apache.org> on 2022/08/09 15:15:00 UTC

[jira] [Created] (YARN-11249) YARN Client Retry Policy inadvertently changed several years ago

Kevin Wikant created YARN-11249:
-----------------------------------

             Summary: YARN Client Retry Policy inadvertently changed several years ago
                 Key: YARN-11249
                 URL: https://issues.apache.org/jira/browse/YARN-11249
             Project: Hadoop YARN
          Issue Type: Bug
          Components: client
            Reporter: Kevin Wikant


A non-HA & non-Federation YARN client should be using [OtherThanRemoteAndSaslExceptionDependentRetry|[https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/RMProxy.java#L303]] (previously named OtherThanRemoteExceptionDependentRetry).

An HA or Federation YARN client should be using [FailoverOnNetworkExceptionRetry|[https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/RMProxy.java#L267]].

However, the logic surrounding which Retry Policy gets used (by YARN client) has been inconsistent/unexpected due to past changes in the code.

[This PR|[https://github.com/apache/hadoop/commit/bdfad4523f1a5a776e20773c371ce99d0c538ac1]] introduced what is arguably a bug where FailoverOnNetworkExceptionRetry would get used for non-HA & non-Federation YARN clusters because of the default yarn-site configuration ["yarn.federation.failover.enabled=true"|[https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/RMProxy.java#L102]]. This was arguably a bug because "yarn.federation.failover.enabled=true" even if YARN federation is not enabled at all. The consequence of this change is that all over the world non-HA & non-Federation YARN clusters using a Hadoop release from after 2017-09 were unexpectedly using FailoverOnNetworkExceptionRetry when in reality they should be using OtherThanRemoteAndSaslExceptionDependentRetry. The only YARN clusters (using a Hadoop release from after 2017-09) that would not be impacted by the retry policy change would be those with a custom yarn-site configuration of "yarn.federation.failover.enabled=false"; I assume this is probably a fairly small fraction of YARN clusters around the world and so I suspect this retry policy change affected the majority of clusters around the world using an affected Hadoop release.

The code was recently corrected to change the Retry Policy used for non-HA & non-Federation YARN clusters; in 2022-07 this [change|[https://github.com/apache/hadoop/commit/213ea037589d8a69c34e1f98413ddbc06dfd42bf]] was committed which causes the correct Retry Policy of OtherThanRemoteAndSaslExceptionDependentRetry to be used for non-HA & non-Federation YARN clusters. The change was to return that [federation failover is not enabled unless federation is enabled|[https://github.com/apache/hadoop/blob/213ea037589d8a69c34e1f98413ddbc06dfd42bf/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api/src/main/java/org/apache/hadoop/yarn/conf/HAUtil.java#L70]].

Hadoop clusters based on Hadoop releases created between the following approximate time ranges will have a different "default" YARN client retry policy (i.e. "default" retry policy in the sense that you will get it with a default yarn-site file containing "yarn.federation.failover.enabled=true"):
 - before 2017-09 ; YARN uses OtherThanRemoteAndSaslExceptionDependentRetry by default
 - 2017-09 to 2022-07 ; YARN uses FailoverOnNetworkExceptionRetry by default
 - 2022-07 into future ; YARN uses OtherThanRemoteAndSaslExceptionDependentRetry by default

Part of the problem here is that [FailoverOnNetworkExceptionRetry|[https://github.com/apache/hadoop/blob/213ea037589d8a69c34e1f98413ddbc06dfd42bf/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/retry/RetryPolicies.java#L698]] is arguably a superior retry policy than [OtherThanRemoteAndSaslExceptionDependentRetry|[https://github.com/apache/hadoop/blob/213ea037589d8a69c34e1f98413ddbc06dfd42bf/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/retry/RetryPolicies.java#L610]] because:
 - FailoverOnNetworkExceptionRetry will not retry on certain non-retryable exceptions such as [InvalidToken|[https://github.com/apache/hadoop/blob/213ea037589d8a69c34e1f98413ddbc06dfd42bf/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/retry/RetryPolicies.java#L729]] & [AccessControlException|[https://github.com/apache/hadoop/blob/213ea037589d8a69c34e1f98413ddbc06dfd42bf/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/io/retry/RetryPolicies.java#L732]]
 - OtherThanRemoteAndSaslExceptionDependentRetry will retry for 15 minutes on certain non-retryable exceptions such as [InvalidToken & AccessControlException|[https://github.com/apache/hadoop/blob/trunk/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common/src/main/java/org/apache/hadoop/yarn/client/RMProxy.java#L291]]

My concern is that YARN users who were previously seeing their YARN client not retry on InvalidToken or AccessControlException will now start to see the YARN client retry for 15 minutes on these exceptions because the Retry Policy for non-HA & non-Federation YARN client has been reverted back to OtherThanRemoteAndSaslExceptionDependentRetry.

I see a few options to correct this:
 - remove OtherThanRemoteAndSaslExceptionDependentRetry entirely & always use FailoverOnNetworkExceptionRetry
 -- the benefit of this approach is that it reduces lines of code & helps avoid cases of exception handling code divergence by maintaining multiple different retry policies
 -- I don't see any major risk to doing this because all Hadoop releases since after 2017-09 have been using FailoverOnNetworkExceptionRetry without the sysadmins even being aware of it & without any issues reported
 - bring OtherThanRemoteAndSaslExceptionDependentRetry to parity with FailoverOnNetworkExceptionRetry in terms of not retrying on InvalidToken or AccessControlException
 -- downside is that the exception handling in these retry policies may diverge again in the future
 - refactor so that both retry policies share the same exception handling code but may differ in other areas such as the RetryAction being returned
 -- may be as simple as using FailoverOnNetworkExceptionRetry's exception handling logic in a parent class but to return different RetryAction depending on subclass (i.e. FAILOVER_AND_RETRY for HA/federation vs RETRY for non-HA/non-federation)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org