You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-dev@hadoop.apache.org by "Danil Serdyuchenko (JIRA)" <ji...@apache.org> on 2018/06/11 12:47:00 UTC
[jira] [Created] (HDFS-13669) YARN in HA not failing over to a new resource manager.

Danil Serdyuchenko created HDFS-13669:
-----------------------------------------

             Summary: YARN in HA not failing over to a new resource manager.
                 Key: HDFS-13669
                 URL: https://issues.apache.org/jira/browse/HDFS-13669
             Project: Hadoop HDFS
          Issue Type: Bug
    Affects Versions: 2.7.1
            Reporter: Danil Serdyuchenko


We are running YARN in HA mode. (rm1 and rm2) We hit an issue when recreating one of the RMs.
 # Recreated a standby RM (rm2), which gave it a new IP
 # Stopped the active RM (rm1)
 # NMs tried to failover to rm2, but were timing out because of the old ip.
 # NMs reach the configured 30 failover retries and shutdown.

We get the following logs.
{noformat}
18/06/06 15:36:32 WARN ipc.Client: Address change detected. Old: yarnrm2/x.x.x.x:8031 New: yarnrm2/y.y.y.y:8031
18/06/06 15:36:32 INFO retry.RetryInvocationHandler: Exception while invoking nodeHeartbeat of class ResourceTrackerPBClientImpl over rm2 after 25 fail over attempts. Trying to fail over after sleeping for 37191ms.
org.apache.hadoop.net.ConnectTimeoutException: Call From ip-a-a-a-a/a.a.a.a to yarnrm2:8031 failed on socket timeout exception: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=yarnrm2/x.x.x.x:8031]; For more details see:  http://wiki.apache.org/hadoop/SocketTimeout
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
        at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792)
        at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:751)
        at org.apache.hadoop.ipc.Client.call(Client.java:1480)
        at org.apache.hadoop.ipc.Client.call(Client.java:1407)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
        at com.sun.proxy.$Proxy28.nodeHeartbeat(Unknown Source)
        at org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
        at sun.reflect.GeneratedMethodAccessor2.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
        at com.sun.proxy.$Proxy29.nodeHeartbeat(Unknown Source)
        at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:596)
        at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while waiting for channel to be ready for connect. ch : java.nio.channels.SocketChannel[connection-pending remote=yarnrm2.grappler.eu-west-1.prod.aws.skyscanner.local/10.51.104.136:8031]
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:534)
        at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
        at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:609)
        at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:707)
        at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:370)
        at org.apache.hadoop.ipc.Client.getConnection(Client.java:1529)
        at org.apache.hadoop.ipc.Client.call(Client.java:1446)
        ... 12 more{noformat}
We get this and failover back to rm1 30 times until:
{noformat}
18/06/06 15:39:44 WARN retry.RetryInvocationHandler: Exception while invoking class org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat over rm1. Not retrying because failovers (30) exceeded maximum allowed (30){noformat}
From the logs it appears that the timeouts happen because it's trying to connect to the old ip (x.x.x.x in the logs). Looking at the code of the Client class, following the updateAddress method call we should expect a retry with the new server ip ("Retrying connect to server ..." log) up to 

ipc.client.connect.max.retries.on.timeouts times. However we never see the retry logs and it just fails with exception. The above setting is set to default 45 for all of our NMs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-help@hadoop.apache.org