You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-issues@hadoop.apache.org by "yuyanlei (Jira)" <ji...@apache.org> on 2022/09/15 03:58:00 UTC

[jira] [Commented] (HDFS-13596) NN restart fails after RollingUpgrade from 2.x to 3.x

    [ https://issues.apache.org/jira/browse/HDFS-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17605091#comment-17605091 ] 

yuyanlei commented on HDFS-13596:
---------------------------------

hi [Hui Fei|https://issues.apache.org/jira/secure/ViewProfile.jspa?name=ferhui] ,I had the same problem recently,

I test Hadoop2.7.2 rolling upgrade Hadoop3.3.4 after the downgrade test,

Error reported when degrading Namenode startup:ArrayIndexOutOfBoundsException:536870913,

So I merged COMMIT in Hadoop2.7.2:Fix potential FSImage corruption(8a41edb089fbdedc5e7d9a2aeec63d126afea49f),

However, the startup still failed with an error:NullPointerException,

The Owner and Group of the HDFS directory are null(The number of test data blocks is 68905183),

Later I found Hadoop2.7.2:

enum PermissionStatusFormat implements LongBitFormat.Enum {
    MODE(null, 16),
    GROUP(MODE.BITS, 24),
    USER(GROUP.BITS, 24);

And Hadoop3.3.4 and Fix potential FSImage corruption :

enum PermissionStatusFormat implements LongBitFormat.Enum {
    MODE(null, 16),
    GROUP(MODE.BITS, 25),
    USER(GROUP.BITS, 23);

After I changed the GROUP and USER in Hadoop2.7.2 to 24, 24, the downgrade was successful。

This commit:https://github.com/lucasaytt/hadoop/commit/8a41edb089fbdedc5e7d9a2aeec63d126afea49f(Fix potential FSImage corruption),Can it be merged into Hadoop2.7.2?   Is there any hidden danger?

> NN restart fails after RollingUpgrade from 2.x to 3.x
> -----------------------------------------------------
>
>                 Key: HDFS-13596
>                 URL: https://issues.apache.org/jira/browse/HDFS-13596
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs
>            Reporter: Hanisha Koneru
>            Assignee: Hui Fei
>            Priority: Blocker
>             Fix For: 3.3.0, 3.2.1, 3.1.3
>
>         Attachments: HDFS-13596.001.patch, HDFS-13596.002.patch, HDFS-13596.003.patch, HDFS-13596.004.patch, HDFS-13596.005.patch, HDFS-13596.006.patch, HDFS-13596.007.patch, HDFS-13596.008.patch, HDFS-13596.009.patch, HDFS-13596.010.patch
>
>
> After rollingUpgrade NN from 2.x and 3.x, if the NN is restarted, it fails while replaying edit logs.
>  * After NN is started with rollingUpgrade, the layoutVersion written to editLogs (before finalizing the upgrade) is the pre-upgrade layout version (so as to support downgrade).
>  * When writing transactions to log, NN writes as per the current layout version. In 3.x, erasureCoding bits are added to the editLog transactions.
>  * So any edit log written after the upgrade and before finalizing the upgrade will have the old layout version but the new format of transactions.
>  * When NN is restarted and the edit logs are replayed, the NN reads the old layout version from the editLog file. When parsing the transactions, it assumes that the transactions are also from the previous layout and hence skips parsing the erasureCoding bits.
>  * This cascades into reading the wrong set of bits for other fields and leads to NN shutting down.
> Sample error output:
> {code:java}
> java.lang.IllegalArgumentException: Invalid clientId - length is 0 expected length 16
>  at com.google.common.base.Preconditions.checkArgument(Preconditions.java:88)
>  at org.apache.hadoop.ipc.RetryCache$CacheEntry.<init>(RetryCache.java:74)
>  at org.apache.hadoop.ipc.RetryCache$CacheEntry.<init>(RetryCache.java:86)
>  at org.apache.hadoop.ipc.RetryCache$CacheEntryWithPayload.<init>(RetryCache.java:163)
>  at org.apache.hadoop.ipc.RetryCache.addCacheEntryWithPayload(RetryCache.java:322)
>  at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.addCacheEntryWithPayload(FSNamesystem.java:960)
>  at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:397)
>  at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:249)
>  at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323)
>  at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086)
>  at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:632)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:694)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:937)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:910)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1643)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1710)
> 2018-05-17 19:10:06,522 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Encountered exception loading fsimage
> java.io.IOException: java.lang.IllegalStateException: Cannot skip to less than the current value (=16389), where newValue=16388
>  at org.apache.hadoop.hdfs.server.namenode.FSDirectory.resetLastInodeId(FSDirectory.java:1945)
>  at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:298)
>  at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:158)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:888)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:745)
>  at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:323)
>  at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1086)
>  at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:714)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:632)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:694)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:937)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:910)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1643)
>  at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1710)
> Caused by: java.lang.IllegalStateException: Cannot skip to less than the current value (=16389), where newValue=16388
>  at org.apache.hadoop.util.SequentialNumber.skipTo(SequentialNumber.java:58)
>  at org.apache.hadoop.hdfs.server.namenode.FSDirectory.resetLastInodeId(FSDirectory.java:1943)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org