You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Takanobu Asanuma (JIRA)" <ji...@apache.org> on 2018/06/12 00:56:00 UTC

[jira] [Moved] (YARN-8416) YARN in HA not failing over to a new resource manager.

     [ https://issues.apache.org/jira/browse/YARN-8416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Takanobu Asanuma moved HDFS-13669 to YARN-8416:
-----------------------------------------------

    Affects Version/s:     (was: 2.7.1)
                       2.7.1
                  Key: YARN-8416  (was: HDFS-13669)
              Project: Hadoop YARN  (was: Hadoop HDFS)

> YARN in HA not failing over to a new resource manager.
> ------------------------------------------------------
>
>                 Key: YARN-8416
>                 URL: https://issues.apache.org/jira/browse/YARN-8416
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.7.1
>            Reporter: Danil Serdyuchenko
>            Priority: Major
>
> We are running YARN in HA mode. (rm1 and rm2) We hit an issue when recreating one of the RMs.
>  # Recreated a standby RM (rm2), which gave it a new IP
>  # Stopped the active RM (rm1)
>  # NMs tried to failover to rm2, but were timing out because of the old ip.
>  # NMs reach the configured 30 failover retries and shutdown.
> We get the following logs.
> {noformat}
> 18/06/06 15:36:32 WARN ipc.Client: Address change detected. Old: yarnrm2/x.x.x.x:8031 New: yarnrm2/y.y.y.y:8031
> 18/06/06 15:36:32 INFO retry.RetryInvocationHandler: Exception while invoking nodeHeartbeat of class ResourceTrackerPBClientImpl over rm2 after 25 fail over attempts. Trying to fail over after sleeping for 37191ms.
> org.apache.hadoop.net.ConnectTimeoutException: Call From ip-a-a-a-a/a.a.a.a to yarnrm2:8031 failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=yarnrm2/x.x.x.x:8031]; For more details see:  http://wiki.apache.org/hadoop/SocketTimeout
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>         at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>         at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>         at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:751)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1480)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1407)
>         at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
>         at com.sun.proxy.$Proxy28.nodeHeartbeat(Unknown Source)
>         at org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
>         at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:606)
>         at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
>         at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
>         at com.sun.proxy.$Proxy29.nodeHeartbeat(Unknown Source)
>         at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:596)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=yarnrm2/x.x.x.x:8031]
>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534)
>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
>         at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:609)
>         at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:707)
>         at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:370)
>         at org.apache.hadoop.ipc.Client.getConnection(Client.java:1529)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1446)
>         ... 12 more{noformat}
> We get this and failover back to rm1 30 times until:
> {noformat}
> 18/06/06 15:39:44 WARN retry.RetryInvocationHandler: Exception while invoking class org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat over rm1. Not retrying because failovers (30) exceeded maximum allowed (30){noformat}
> From the logs it appears that the timeouts happen because it's trying to connect to the old ip (x.x.x.x in the logs). Looking at the code of the Client class, following the updateAddress method call we should expect a retry with the new server ip ("Retrying connect to server ..." log) up to 
> ipc.client.connect.max.retries.on.timeouts times. However we never see the retry logs and it just fails with exception. The above setting is set to default 45 for all of our NMs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org