You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Kihwal Lee (JIRA)" <ji...@apache.org> on 2011/08/02 00:00:50 UTC
[jira] [Commented] (HADOOP-7472) RPC client should deal with the IP
address changes
[ https://issues.apache.org/jira/browse/HADOOP-7472?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13073815#comment-13073815 ]
Kihwal Lee commented on HADOOP-7472:
------------------------------------
The new patch is very close to what Suresh intially suggested. I decided not to do the check at response timeout because it changes the service semantics.
Following is the result of my testing described above.
{quote}
$ ./hadoop fs -ls
11/08/01 16:53:20 INFO ipc.Client: Retrying connect to server: testhost/127.0.0.1:9000. Already tried 0 time(s).
11/08/01 16:53:21 INFO ipc.Client: Retrying connect to server: testhost/127.0.0.1:9000. Already tried 1 time(s).
11/08/01 16:53:22 INFO ipc.Client: Retrying connect to server: testhost/127.0.0.1:9000. Already tried 2 time(s).
11/08/01 16:53:23 INFO ipc.Client: Retrying connect to server: testhost/127.0.0.1:9000. Already tried 3 time(s).
11/08/01 16:53:23 WARN ipc.Client: Address change detected. Host: testhost OldAddr: 127.0.0.1 NewAddr: testhost/10.xx.xx.xx
11/08/01 16:53:24 INFO ipc.Client: Retrying connect to server: testhost/10.xx.xx.xx:9000. Already tried 0 time(s).
11/08/01 16:53:25 INFO ipc.Client: Retrying connect to server: testhost/10.xx.xx.xx:9000. Already tried 1 time(s).
11/08/01 16:53:26 INFO ipc.Client: Retrying connect to server: testhost/10.xx.xx.xx:9000. Already tried 2 time(s).
11/08/01 16:53:27 INFO ipc.Client: Retrying connect to server: testhost/10.xx.xx.xx:9000. Already tried 3 time(s).
Found 1 items
-rw-r--r-- 1 kihwal supergroup 327499776 2011-07-22 11:30 /user/kihwal/ddd
{quote}
The result of test-patch:
[exec] -1 overall.
[exec]
[exec] +1 @author. The patch does not contain any @author tags.
[exec]
[exec] -1 tests included. The patch doesn't appear to include any new or modified tests.
[exec] Please justify why no tests are needed for this patch.
[exec]
[exec] +1 javadoc. The javadoc tool did not generate any warning messages.
[exec]
[exec] +1 javac. The applied patch does not increase the total number of javac compiler warnings.
[exec]
[exec] +1 findbugs. The patch does not introduce any new Findbugs (version 1.3.9) warnings.
When the patch for MAPREDUCE-1824 becomes available, there might be an overlap where connect() is called.
> RPC client should deal with the IP address changes
> --------------------------------------------------
>
> Key: HADOOP-7472
> URL: https://issues.apache.org/jira/browse/HADOOP-7472
> Project: Hadoop Common
> Issue Type: Improvement
> Components: ipc
> Affects Versions: 0.20.205.0
> Reporter: Kihwal Lee
> Assignee: Kihwal Lee
> Priority: Minor
> Fix For: 0.20.205.0
>
> Attachments: addr_change_dfs-1.patch.txt, addr_change_dfs-2.patch.txt, addr_change_dfs.patch.txt
>
>
> The current RPC client implementation and the client-side callers assume that the hostname-address mappings of servers never change. The resolved address is stored in an immutable InetSocketAddress object above/outside RPC, and the reconnect logic in the RPC Connection implementation also trusts the resolved address that was passed down.
> If the NN suffers a failure that requires migration, it may be started on a different node with a different IP address. In this case, even if the name-address mapping is updated in DNS, the cluster is stuck trying old address until the whole cluster is restarted.
> The RPC client-side should detect this situation and exit or try to recover.
> Updating ConnectionId within the Client implementation may get the system work for the moment, there always is a risk of the cached address:port become connectable again unintentionally. The real solution will be notifying upper layer of the address change so that they can re-resolve and retry or re-architecture the system as discussed in HDFS-34.
> For 0.20 lines, some type of compromise may be acceptable. For example, raise a custom exception for some well-defined high-impact upper layer to do re-resolve/retry, while other will have to restart. For TRUNK, the HA work will most likely determine what needs to be done. So this Jira won't cover the solutions for TRUNK.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira