You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-dev@hadoop.apache.org by "Andrew Timonin (Jira)" <ji...@apache.org> on 2019/12/13 11:29:00 UTC

[jira] [Created] (HDFS-15060) namenode doesn't retry JN when other JN goes down

Andrew Timonin created HDFS-15060:
-------------------------------------

             Summary: namenode doesn't retry JN when other JN goes down
                 Key: HDFS-15060
                 URL: https://issues.apache.org/jira/browse/HDFS-15060
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: namenode
    Affects Versions: 3.1.1
            Reporter: Andrew Timonin


When I upgrade hadoop to new version (using for ex. [https://hadoop.apache.org/docs/r3.1.3/hadoop-project-dist/hadoop-hdfs/HdfsRollingUpgrade.html#namenode_-rollingUpgrade] as instruction) I've got a situation:

I'm upgrading JN's one by one.
 # Upgrade and restart JN1
 # NN see JN offline: WARN client.QuorumJournalManager: Remote journal 10.73.67.132:8485 failed to write txns 1205396-1205399. Will try to write to this JN again after the next log roll.
 # No log roll for some time (at least 1min)
 # Upgrade and restart JN2
 # NN see it again: WARN client.QuorumJournalManager: Remote journal 10.73.67.68:8485 failed to write txns 1205799-1205800. Will try to write to this JN again after the next log roll.
 # BUT! At this time we have no JN quorum: FATAL namenode.FSEditLog: Error: flush failed for required journal (JournalAndStream(mgr=QJM to [10.73.67.212:8485, 10.73.67.132:8485, 10.73.67.68:8485], stream=QuorumOutputStream starting at txid 1205246)) although JN1 is online already

It looks like NN should retry JN's marked as offline before giving up.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-help@hadoop.apache.org