You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "chaoli (Jira)" <ji...@apache.org> on 2019/09/09 14:00:00 UTC

[jira] [Commented] (YARN-9823) NodeManager cannot get right ResourceTrack address in Federation mode

    [ https://issues.apache.org/jira/browse/YARN-9823?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16925714#comment-16925714 ] 

chaoli commented on YARN-9823:
------------------------------

This is bug.  because RM has not deliver resource tracker address in federation heartbeat service

there is also no process code in *FederationRMFailoverProxyProvider.updateRMAddress*

!image-2019-09-09-21-58-09-017.png!

> NodeManager cannot get right ResourceTrack address in Federation mode
> ---------------------------------------------------------------------
>
>                 Key: YARN-9823
>                 URL: https://issues.apache.org/jira/browse/YARN-9823
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: federation, nodemanager
>    Affects Versions: 2.9.2
>         Environment: h2. Hadoop:
> Hadoop 2.9.2 (some line number may not be right because we have merged some 3.0+ patch)
> Security with Kerberos
> configure from [https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/Federation.html]
> h2. Java:
> Java(TM) SE Runtime Environment (build 1.8.0_77-b03)
> Java HotSpot(TM) 64-Bit Server VM (build 25.77-b03, mixed mode)
> Kerberos:
>  
>  
>            Reporter: qiwei huang
>            Priority: Major
>
> {{the NM will infinitely try to connect the wrong RM's resource tracker port}}
> {quote}{{INFO [main:RetryInvocationHandler@411] - java.net.ConnectException: Call From standby.rm.server/10.122.138.139 to }}{{standby.rm.server}}{{:8031 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused, while invoking ResourceTrackerPBClientImpl.registerNodeManager over dev1 after 19 failover attempts. Trying to failover after sleeping for 40497ms.}}
> {quote}
>  
> {{After change *yarn.client.failover-proxy-provider* to *org.apache.hadoop.yarn.server.federation.failover.FederationRMFailoverProxyProvider*, the ** NodeManager cannot find the right ResourceTracker address:}}
> {quote}{{getRMHAId:233, HAUtil (org.apache.hadoop.yarn.conf)}}
> {{getConfKeyForRMInstance:294, HAUtil (org.apache.hadoop.yarn.conf)}}
> {{getConfValueForRMInstance:302, HAUtil (org.apache.hadoop.yarn.conf)}}
> {{getConfValueForRMInstance:314, HAUtil (org.apache.hadoop.yarn.conf)}}
> {{getSocketAddr:3341, YarnConfiguration (org.apache.hadoop.yarn.conf)}}
> {{getRMAddress:77, ServerRMProxy (org.apache.hadoop.yarn.server.api)}}
> {{run:144, FederationRMFailoverProxyProvider$1 (org.apache.hadoop.yarn.server.federation.failover)}}
> {{doPrivileged:-1, AccessController (java.security)}}
> {{doAs:422, Subject (javax.security.auth)}}
> {{doAs:1893, UserGroupInformation (org.apache.hadoop.security)}}
> {{getProxyInternal:141, FederationRMFailoverProxyProvider (org.apache.hadoop.yarn.server.federation.failover)}}
> {{performFailover:192, FederationRMFailoverProxyProvider (org.apache.hadoop.yarn.server.federation.failover)}}
> {{failover:217, RetryInvocationHandler$ProxyDescriptor (org.apache.hadoop.io.retry)}}
> {{processRetryInfo:149, RetryInvocationHandler$Call (org.apache.hadoop.io.retry)}}
> {{processWaitTimeAndRetryInfo:142, RetryInvocationHandler$Call (org.apache.hadoop.io.retry)}}
> {{invokeOnce:107, RetryInvocationHandler$Call (org.apache.hadoop.io.retry)}}
> {{invoke:359, RetryInvocationHandler (org.apache.hadoop.io.retry)}}
> {{registerNodeManager:-1, $Proxy85 (com.sun.proxy)}}
> {{registerWithRM:378, NodeStatusUpdaterImpl (org.apache.hadoop.yarn.server.nodemanager)}}
> {{serviceStart:252, NodeStatusUpdaterImpl (org.apache.hadoop.yarn.server.nodemanager)}}
> {{start:194, AbstractService (org.apache.hadoop.service)}}
> {{serviceStart:121, CompositeService (org.apache.hadoop.service)}}
> {{start:194, AbstractService (org.apache.hadoop.service)}}
> {{initAndStartNodeManager:864, NodeManager (org.apache.hadoop.yarn.server.nodemanager)}}
> {{main:931, NodeManager (org.apache.hadoop.yarn.server.nodemanager)}}
> {quote}
> {{the Provider will try to find the main RM address on }}*{{getRMHAId:233,}}* {{but it cannot find the right address because it can just return the local Address: }}{{}}
> {quote}{{if (!s.isUnresolved() && NetUtils.isLocalAddress(s.getAddress())) {}}
> {{ currentRMId = rmId.trim();}}
> {{ found++;}}
> {{}}}
> {quote}
> {{If the NM and RM is on the same node, and the this RM is in standby situation, the NM will }}{{infinitely}}{{ call RPC to RM}}



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org