You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "zhenzhao wang (Jira)" <ji...@apache.org> on 2020/08/08 02:15:00 UTC

[jira] [Commented] (HADOOP-17068) client fails forever when namenode ipaddr changed

    [ https://issues.apache.org/jira/browse/HADOOP-17068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17173544#comment-17173544 ] 

zhenzhao wang commented on HADOOP-17068:
----------------------------------------

We had seen the problem multiple times too. One workaround we had been using for years is to increase dfs.client.failover.connection.retries.on.timeouts to 3. It will help in the previous HDFS client version.

> client fails forever when namenode ipaddr changed
> -------------------------------------------------
>
>                 Key: HADOOP-17068
>                 URL: https://issues.apache.org/jira/browse/HADOOP-17068
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: hdfs-client
>            Reporter: Sean Chow
>            Assignee: Sean Chow
>            Priority: Major
>             Fix For: 3.4.0
>
>         Attachments: HADOOP-17068.001.patch, HDFS-15390.01.patch
>
>
> For machine replacement, I replace my standby namenode with a new ipaddr and keep the same hostname. Also update the client's hosts to make it resolve correctly
> When I try to run failover to transite the new namenode(let's say nn2), the client will fail to read or write forever until it's restarted.
> That make yarn nodemanager in sick state. Even the new tasks will encounter this exception  too. Until all nodemanager restart.
>  
> {code:java}
> 20/06/02 15:12:25 WARN ipc.Client: Address change detected. Old: nn2-192-168-1-100/192.168.1.100:9000 New: nn2-192-168-1-100/192.168.1.200:9000
> 20/06/02 15:12:25 DEBUG ipc.Client: closing ipc connection to nn2-192-168-1-100/192.168.1.200:9000: Connection refused
> java.net.ConnectException: Connection refused
>         at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
>         at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
>         at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530)
>         at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494)
>         at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:608)
>         at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:707)
>         at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:368)
>         at org.apache.hadoop.ipc.Client.getConnection(Client.java:1517)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1440)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1401)
>         at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232)
>         at com.sun.proxy.$Proxy9.addBlock(Unknown Source)
>         at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:399)
>         at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source)
>         at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>         at java.lang.reflect.Method.invoke(Method.java:498)
>         at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:193)
>         at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> {code}
>  
> We can see the client has {{Address change detected}}, but it still fails. I find out that's because when method {{updateAddress()}} return true,  the {{handleConnectionFailure()}} thow an exception that break the next retry with the right ipaddr.
> Client.java: setupConnection()
> {code:java}
>         } catch (ConnectTimeoutException toe) {
>           /* Check for an address change and update the local reference.
>            * Reset the failure counter if the address was changed
>            */
>           if (updateAddress()) {
>             timeoutFailures = ioFailures = 0;
>           }
>           handleConnectionTimeout(timeoutFailures++,
>               maxRetriesOnSocketTimeouts, toe);
>         } catch (IOException ie) {
>           if (updateAddress()) {
>             timeoutFailures = ioFailures = 0;
>           }
> // because the namenode ip changed in updateAddress(), the old namenode ipaddress cannot be accessed now
> // handleConnectionFailure will thow an exception, the next retry never have a chance to use the right server updated in updateAddress()
>           handleConnectionFailure(ioFailures++, ie);
>         }
> {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org