You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-dev@hadoop.apache.org by "Tsz Wo (Nicholas), SZE (JIRA)" <ji...@apache.org> on 2013/01/30 03:57:13 UTC
[jira] [Resolved] (HDFS-4423) Checkpoint exception causes fatal
damage to fsimage.
[ https://issues.apache.org/jira/browse/HDFS-4423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tsz Wo (Nicholas), SZE resolved HDFS-4423.
------------------------------------------
Resolution: Fixed
Fix Version/s: 1.1.2
I have committed this. Thanks, Chris!
> Checkpoint exception causes fatal damage to fsimage.
> ----------------------------------------------------
>
> Key: HDFS-4423
> URL: https://issues.apache.org/jira/browse/HDFS-4423
> Project: Hadoop HDFS
> Issue Type: Bug
> Components: namenode
> Affects Versions: 1.0.4, 1.1.1
> Environment: CentOS 6.2
> Reporter: ChenFolin
> Assignee: Chris Nauroth
> Priority: Blocker
> Fix For: 1.1.2
>
> Attachments: HDFS-4423-branch-1.1.patch
>
> Original Estimate: 72h
> Remaining Estimate: 72h
>
> The impact of class is org.apache.hadoop.hdfs.server.namenode.FSImage.java
> {code}
> boolean loadFSImage(MetaRecoveryContext recovery) throws IOException {
> ...
> latestNameSD.read();
> needToSave |= loadFSImage(getImageFile(latestNameSD, NameNodeFile.IMAGE));
> LOG.info("Image file of size " + imageSize + " loaded in "
> + (FSNamesystem.now() - startTime)/1000 + " seconds.");
>
> // Load latest edits
> if (latestNameCheckpointTime > latestEditsCheckpointTime)
> // the image is already current, discard edits
> needToSave |= true;
> else // latestNameCheckpointTime == latestEditsCheckpointTime
> needToSave |= (loadFSEdits(latestEditsSD, recovery) > 0);
>
> return needToSave;
> }
> {code}
> If it is the normal flow of the checkpoint,the value of latestNameCheckpointTime is equal to the value of latestEditsCheckpointTime,and it will exec “else”.
> The problem is that,latestNameCheckpointTime > latestEditsCheckpointTime:
> SecondNameNode starts checkpoint,
> ...
> NameNode:rollFSImage,NameNode shutdown after write latestNameCheckpointTime and before write latestEditsCheckpointTime.
> Start NameNode:because latestNameCheckpointTime > latestEditsCheckpointTime,so the value of needToSave is true, and it will not update “rootDir”'s nsCount that is the cluster's file number(update exec at loadFSEdits “FSNamesystem.getFSNamesystem().dir.updateCountForINodeWithQuota()”),and then “saveNamespace” will write file number to fsimage whit default value “1”。
> The next time,loadFSImage will fail.
> Maybe,it will work:
> {code}
> boolean loadFSImage(MetaRecoveryContext recovery) throws IOException {
> ...
> latestNameSD.read();
> needToSave |= loadFSImage(getImageFile(latestNameSD, NameNodeFile.IMAGE));
> LOG.info("Image file of size " + imageSize + " loaded in "
> + (FSNamesystem.now() - startTime)/1000 + " seconds.");
>
> // Load latest edits
> if (latestNameCheckpointTime > latestEditsCheckpointTime){
> // the image is already current, discard edits
> needToSave |= true;
> FSNamesystem.getFSNamesystem().dir.updateCountForINodeWithQuota();
> }
> else // latestNameCheckpointTime == latestEditsCheckpointTime
> needToSave |= (loadFSEdits(latestEditsSD, recovery) > 0);
>
> return needToSave;
> }
> {code}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira