You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-issues@hadoop.apache.org by "Jan Van Besien (Jira)" <ji...@apache.org> on 2022/07/25 11:25:00 UTC
[jira] [Comment Edited] (HDFS-4957) NameNode failover should not fail because a DNS entry for a quorum node cannot be resolved

    [ https://issues.apache.org/jira/browse/HDFS-4957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17570770#comment-17570770 ] 

Jan Van Besien edited comment on HDFS-4957 at 7/25/22 11:24 AM:
----------------------------------------------------------------

I am also faced by this problem (in a Hadoop deployment on Kubernetes). It is also not limited to namenode failover. Simply restarting a namenode won't work either (cfr the problem described in HDFS-10719).

In contrast to what [~jzhuge] writes earlier, can't the solution simply be:
 * when formatting a namenode, all journal nodes need to be available (cfr HDFS-4210)
 * in all other operations, including namenode failover, only a majority of journal nodes needs to be available

That sounds reasonably straightforward to implement?

I understand there is also a problem with journal nodes not immediately rejoining the quorum after a journal node restart (cfr HDFS-3867), but that seems to be a separate problem that we should not take into account here?


was (Author: janvanbesien):
I am also faced by this problem (in a Hadoop deployment on Kubernetes).

In contrast to what [~jzhuge] writes earlier, can't the solution simply be:
 * when formatting a namenode, all journal nodes need to be available (cfr HDFS-4210)
 * in all other operations, including namenode failover, only a majority of journal nodes need to be available

That sounds reasonably straightforward to implement?

I understand there is also a problem with journal nodes not immediately rejoining the quorum after a journal node restart (cfr HDFS-3867), but that seems to be a separate problem that we should not take into account here?

> NameNode failover should not fail because a DNS entry for a quorum node cannot be resolved
> ------------------------------------------------------------------------------------------
>
>                 Key: HDFS-4957
>                 URL: https://issues.apache.org/jira/browse/HDFS-4957
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: qjm
>    Affects Versions: 2.3.0, 2.6.0
>            Reporter: Colin McCabe
>            Assignee: John Zhuge
>            Priority: Major
>
> When a StandbyNameNode is becoming active, we should not bail out because a DNS entry for a quorum node cannot be resolved.  Currently it does fail in this scenario, with a message like this:
> {code}
> 2013-07-03 21:28:40,576 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Starting services required for active state
> 2013-07-03 21:28:40,579 FATAL org.apache.hadoop.hdfs.server.namenode.NameNode: Error encountered requiring NN shutdown. Shutting down immediately.
> java.lang.IllegalArgumentException: Unable to construct journal, qjournal://hadoop-mm:8485;hadoop-nn-0:8485;hadoop-nn-1:8485/hadoop
> at org.apache.hadoop.hdfs.server.namenode.FSEditLog.createJournal(FSEditLog.java:1254)
> at org.apache.hadoop.hdfs.server.namenode.FSEditLog.initJournals(FSEditLog.java:226)
> at org.apache.hadoop.hdfs.server.namenode.FSEditLog.initJournalsForWrite(FSEditLog.java:193)
> at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startActiveServices(FSNamesystem.java:722)
> <etc>
> {code}
> reported by Matt Bookman



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org