You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-dev@hadoop.apache.org by "Xiao Chen (JIRA)" <ji...@apache.org> on 2017/08/28 22:57:00 UTC

[jira] [Created] (HDFS-12369) Edit log corruption due to hard lease recovery of not-closed file

Xiao Chen created HDFS-12369:
--------------------------------

             Summary: Edit log corruption due to hard lease recovery of not-closed file
                 Key: HDFS-12369
                 URL: https://issues.apache.org/jira/browse/HDFS-12369
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: namenode
            Reporter: Xiao Chen
            Assignee: Xiao Chen


HDFS-6257 and HDFS-7707 worked hard to prevent corruption from combinations of client operations.

Recently, we have observed NN not able to start with the following exception:
{noformat}
2017-08-17 14:32:18,418 ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: Failed to start namenode.
java.io.FileNotFoundException: File does not exist: /home/Events/CancellationSurvey_MySQL/2015/12/31/.part-00000.9nlJ3M
        at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:66)
        at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:56)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.applyEditLogOp(FSEditLogLoader.java:429)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadEditRecords(FSEditLogLoader.java:232)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.loadFSEdits(FSEditLogLoader.java:141)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:897)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.loadFSImage(FSImage.java:750)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:318)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFSImage(FSNamesystem.java:1125)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.loadFromDisk(FSNamesystem.java:789)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.loadNamesystem(NameNode.java:614)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:676)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:844)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:823)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1547)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1615)
{noformat}

Quoting a nicely analysed edits:
{quote}
In the edits logged about 1 hour later, we see this failing OP_CLOSE. The sequence in the edits shows the file going through:

  OPEN
  ADD_BLOCK
  CLOSE
  ADD_BLOCK # perhaps this was an append
  DELETE
  (about 1 hour later) CLOSE

It is interesting that there was no CLOSE logged before the delete.
{quote}

Grepping that file name, it turns out the close was triggered by lease reaching hard limit.
{noformat}
2017-08-16 15:05:45,927 INFO org.apache.hadoop.hdfs.server.namenode.FSNamesystem: 
  Recovering [Lease.  Holder: DFSClient_NONMAPREDUCE_-1997177597_28, pending creates: 75], 
  src=/home/Events/CancellationSurvey_MySQL/2015/12/31/.part-00000.9nlJ3M
2017-08-16 15:05:45,927 WARN org.apache.hadoop.hdfs.StateChange: BLOCK* 
  internalReleaseLease: All existing blocks are COMPLETE, lease removed, file 
  /home/Events/CancellationSurvey_MySQL/2015/12/31/.part-00000.9nlJ3M closed.
{noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-help@hadoop.apache.org