You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-dev@hadoop.apache.org by "Thanh Do (JIRA)" <ji...@apache.org> on 2010/09/08 00:28:32 UTC

[jira] Created: (HDFS-1382) A transient failure with edits log and a corrupted fstime together could lead to a data loss

A transient failure with edits log and a corrupted fstime together could lead to a data loss
--------------------------------------------------------------------------------------------

                 Key: HDFS-1382
                 URL: https://issues.apache.org/jira/browse/HDFS-1382
             Project: Hadoop HDFS
          Issue Type: Bug
          Components: name-node
            Reporter: Thanh Do


We experienced a data loss situation that due to double failures.
One is transient disk failure with edits logs and the other is corrupted fstime.
 
Here is the detail:
 
1. NameNode has 2 edits directory (say edit0 and edit1)
 
2. During an update to edit0, there is a transient disk failure,
making NameNode bump the fstime and mark edit0 as stale
and continue working with edit1. 
 
3. NameNode is shut down. Now, and unluckily fstime in edit0
is corrupted. Hence during NameNode startup, the log in edit0
is replayed, hence data loss.

This bug was found by our Failure Testing Service framework:
http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
For questions, please email us: Thanh Do (thanhdo@cs.wisc.edu) and 
Haryadi Gunawi (haryadi@eecs.berkeley.edu)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.