You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-dev@hadoop.apache.org by "He Xiaoqiao (JIRA)" <ji...@apache.org> on 2015/09/13 08:07:45 UTC

[jira] [Created] (HDFS-9068) SBN checkpoint could not work after the only name directory recovery from failure

He Xiaoqiao created HDFS-9068:
---------------------------------

             Summary: SBN checkpoint could not work after the only name directory recovery from failure
                 Key: HDFS-9068
                 URL: https://issues.apache.org/jira/browse/HDFS-9068
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: namenode
    Affects Versions: 2.4.1
            Reporter: He Xiaoqiao


SBN does checkpoint to {{dfs.namenode.name.dir}} peroidly, but the checkpointer could not work when there is only one directory in configuration item {{dfs.namenode.name.dir}} and the disk which the directory located recoveries from failure.
The impact of class is org.apache.hadoop.hdfs.server.namenode.FSImage.java
{code:title=org.apache.hadoop.hdfs.server.namenode.FSImage.java|borderStyle=solid}
@Override
public void run() {
  try {
    saveFSImage(context, sd, nnf);
  } catch (SaveNamespaceCancelledException snce) {
    LOG.info("Cancelled image saving for " + sd.getRoot() +
        ": " + snce.getMessage());
    // don't report an error on the storage dir!
  } catch (Throwable t) {
    LOG.error("Unable to save image for " + sd.getRoot(), t);
    context.reportErrorOnStorageDirectory(sd);
  }
}
{code}
sd is added to errorSDs: {{context.reportErrorOnStorageDirectory(sd)}}, it will never be used when {{saveFSImage(context, sd, nnf)}} failed becasue storage is Not available or failed even if it recovers from failure. Then JournalNode will accumulate a large number of editlog files since checkpointer failed and NameNode will restart for log time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)