You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Prabhu Joseph (Jira)" <ji...@apache.org> on 2022/10/20 08:05:00 UTC
[jira] [Assigned] (YARN-11355) YARN Client Failovers immediately to rm2 but takes ~30000ms to rm3
[ https://issues.apache.org/jira/browse/YARN-11355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Prabhu Joseph reassigned YARN-11355:
------------------------------------
Assignee: Vineeth Naroju (was: Prabhu Joseph)
> YARN Client Failovers immediately to rm2 but takes ~30000ms to rm3
> ------------------------------------------------------------------
>
> Key: YARN-11355
> URL: https://issues.apache.org/jira/browse/YARN-11355
> Project: Hadoop YARN
> Issue Type: Bug
> Components: client
> Affects Versions: 3.4.0
> Reporter: Prabhu Joseph
> Assignee: Vineeth Naroju
> Priority: Major
>
> YARN Client Failovers immediately to rm2 but takes ~30000ms to rm3 during initial retry.
> *Repro:*
> {code:java}
> 1. YARN Cluster with three master nodes rm1,rm2 and rm3
> 2. rm3 is active
> 3. yarn node -list or any other yarn client calls takes more than 30 seconds.
> {code}
> The initial failover to rm2 is immediate but then the failover to rm3 is after ~30000 ms. Current RetryPolicy does not honor the number of master nodes. It has to perform atleast one immediate failover to every rm.
> {code:java}
> 2022-10-20 06:37:44,123 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm2
> 2022-10-20 06:37:44,129 INFO retry.RetryInvocationHandler: java.net.ConnectException: Call From local to remote:8032 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused, while invoking ApplicationClientProtocolPBClientImpl.getClusterNodes over rm2 after 1 failover attempts. Trying to failover after sleeping for 21139ms.
> {code}
>
> *Workaround:*
> Reduce yarn.resourcemanager.connect.retry-interval.ms from 30000 to like 100. This will do immediate failover to rm3 but there will be too many retries when there is no active resourcemanager.
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org