You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-issues@hadoop.apache.org by "Chris Nauroth (Jira)" <ji...@apache.org> on 2021/08/28 04:11:00 UTC

[jira] [Comment Edited] (HADOOP-15129) Datanode caches namenode DNS lookup failure and cannot startup

    [ https://issues.apache.org/jira/browse/HADOOP-15129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17406116#comment-17406116 ] 

Chris Nauroth edited comment on HADOOP-15129 at 8/28/21, 4:10 AM:
------------------------------------------------------------------

This remains a problem for cloud infrastructure deployments, so I'd like to pick it up and see if we can get it completed.  I've sent a pull request with the following changes compared to the prior revision:

* Remove older code for a throw of {{UnknownHostException}}.  This lies outside the retry loop, so even though the earlier patch did the right thing by placing the throw inside the retry loop, this remaining code perpetuated the problem of an infinite unresolved host.
* Make minor formatting changes in the test to resolve Checkstyle issues flagged in the last Yetus run.

Additionally, I've confirmed testing of the patch in moderate-sized (200-node) Dataproc cluster deployments.

[~stevel@apache.org], [~arp], [~raviprak], [~ajayydv], and [~shahrs87], can we please work on getting this reviewed and committed?  I'm interested in merging this down to branch-3.3, branch-3.2, branch-2.10 and branch-2.9.  The patch as-is won't apply cleanly to 2.x.  If you approve, then I'll prepare separate pull requests for those branches.

Also, BTW, hello everyone.  :-)


was (Author: cnauroth):
This remains a problem for cloud infrastructure deployments, so I'd like to pick it up and see if we can get it completed.  I've sent a pull request with the following changes compared to the prior revision:

* Remove older code for a throw of {{UnknownHostException}}.  This lies outside the retry loop, so even though the earlier patch did the right thing by placing the throw inside the retry loop, this remaining code perpetuated the problem of an infinite unresolved host.
* Make minor formatting changes in the test to resolve Checkstyle issues flagged in the last Yetus run.

Additionally, I've confirmed testing of the patch in moderate-sized (200-node) Dataproc cluster deployments.

[~stevel@apache.org], [~arp], [~raviprak], [~ajayydv], and [~shahrs87], can we please work on getting this reviewed and committed?  I'm interested in merging this down to branch-3.3, branch-3.2, branch-2.10 and branch-2.10.  The patch as-is won't apply cleanly to 2.x.  If you approve, then I'll prepare separate pull requests for those branches.

Also, BTW, hello everyone.  :-)

> Datanode caches namenode DNS lookup failure and cannot startup
> --------------------------------------------------------------
>
>                 Key: HADOOP-15129
>                 URL: https://issues.apache.org/jira/browse/HADOOP-15129
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: ipc
>    Affects Versions: 2.8.2
>         Environment: Google Compute Engine.
> I'm using Java 8, Debian 8, Hadoop 2.8.2.
>            Reporter: Karthik Palaniappan
>            Assignee: Chris Nauroth
>            Priority: Minor
>              Labels: pull-request-available
>         Attachments: HADOOP-15129.001.patch, HADOOP-15129.002.patch
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> On startup, the Datanode creates an InetSocketAddress to register with each namenode. Though there are retries on connection failure throughout the stack, the same InetSocketAddress is reused.
> InetSocketAddress is an interesting class, because it resolves DNS names to IP addresses on construction, and it is never refreshed. Hadoop re-creates an InetSocketAddress in some cases just in case the remote IP has changed for a particular DNS name: https://issues.apache.org/jira/browse/HADOOP-7472.
> Anyway, on startup, you cna see the Datanode log: "Namenode...remains unresolved" -- referring to the fact that DNS lookup failed.
> {code:java}
> 2017-11-02 16:01:55,115 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Refresh request received for nameservices: null
> 2017-11-02 16:01:55,153 WARN org.apache.hadoop.hdfs.DFSUtilClient: Namenode for null remains unresolved for ID null. Check your hdfs-site.xml file to ensure namenodes are configured properly.
> 2017-11-02 16:01:55,156 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Starting BPOfferServices for nameservices: <default>
> 2017-11-02 16:01:55,169 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Block pool <registering> (Datanode Uuid unassigned) service to cluster-32f5-m:8020 starting to offer service
> {code}
> The Datanode then proceeds to use this unresolved address, as it may work if the DN is configured to use a proxy. Since I'm not using a proxy, it forever prints out this message:
> {code:java}
> 2017-12-15 00:13:40,712 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Problem connecting to server: cluster-32f5-m:8020
> 2017-12-15 00:13:45,712 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Problem connecting to server: cluster-32f5-m:8020
> 2017-12-15 00:13:50,712 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Problem connecting to server: cluster-32f5-m:8020
> 2017-12-15 00:13:55,713 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Problem connecting to server: cluster-32f5-m:8020
> 2017-12-15 00:14:00,713 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Problem connecting to server: cluster-32f5-m:8020
> {code}
> Unfortunately, the log doesn't contain the exception that triggered it, but the culprit is actually in IPC Client: https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/ipc/Client.java#L444.
> This line was introduced in https://issues.apache.org/jira/browse/HADOOP-487 to give a clear error message when somebody mispells an address.
> However, the fix in HADOOP-7472 doesn't apply here, because that code happens in Client#getConnection after the Connection is constructed.
> My proposed fix (will attach a patch) is to move this exception out of the constructor and into a place that will trigger HADOOP-7472's logic to re-resolve addresses. If the DNS failure was temporary, this will allow the connection to succeed. If not, the connection will fail after ipc client retries (default 10 seconds worth of retries).
> I want to fix this in ipc client rather than just in Datanode startup, as this fixes temporary DNS issues for all of Hadoop.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: common-issues-help@hadoop.apache.org