You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-dev@hadoop.apache.org by "Stanislav Antic (JIRA)" <ji...@apache.org> on 2015/11/10 12:08:11 UTC

[jira] [Created] (HDFS-9406) FSImage corruption after taking snapshot

Stanislav Antic created HDFS-9406:
-------------------------------------

             Summary: FSImage corruption after taking snapshot
                 Key: HDFS-9406
                 URL: https://issues.apache.org/jira/browse/HDFS-9406
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: namenode
    Affects Versions: 2.6.0
         Environment: CentOS 6 amd64, CDH 5.4.4-1
2xCPU: Intel(R) Xeon(R) CPU E5-2640 v3
Memory: 32GB
Namenode blocks: ~700_000 blocks, no HA setup
            Reporter: Stanislav Antic


FSImage corruption happened after HDFS snapshots were taken. Cluster was not used
at that time.

When namenode restarts it reported NULL pointer exception:
{code}
15/11/07 10:01:15 INFO namenode.FileJournalManager: Recovering unfinalized segments in /tmp/fsimage_checker_5857/fsimage/current
15/11/07 10:01:15 INFO namenode.FSImage: No edit log streams selected.
15/11/07 10:01:18 INFO namenode.FSImageFormatPBINode: Loading 1370277 INodes.
15/11/07 10:01:27 ERROR namenode.NameNode: Failed to start namenode.
java.lang.NullPointerException
        at org.apache.hadoop.hdfs.server.namenode.INodeDirectory.addChild(INodeDirectory.java:531)
        at org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.addToParent(FSImageFormatPBINode.java:252)
        at org.apache.hadoop.hdfs.server.namenode.FSImageFormatPBINode$Loader.loadINodeDirectorySection(FSImageFormatPBINode.java:202)
        at org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.loadInternal(FSImageFormatProtobuf.java:261)
        at org.apache.hadoop.hdfs.server.namenode.FSImageFormatProtobuf$Loader.load(FSImageFormatProtobuf.java:180)
        at org.apache.hadoop.hdfs.server.namenode.FSImageFormat$LoaderDelegator.load(FSImageFormat.java:226)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:929)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:913)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImageFile(FSImage.java:732)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:668)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:281)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1061)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:765)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:584)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:643)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:810)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:794)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1487)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1553)
15/11/07 10:01:27 INFO util.ExitUtil: Exiting with status 1
{code}

Corruption happened after "07.11.2015 00:15", and after that time blocks ~9300 blocks were invalidated that shouldn't be.
After recovering FSimage I discovered that around ~9300 blocks were missing.

I also attached log of namenode before and after corruption happened.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)