You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-dev@hadoop.apache.org by "Allen Wittenauer (JIRA)" <ji...@apache.org> on 2014/07/30 22:24:40 UTC

[jira] [Resolved] (HDFS-1225) Block lost when primary crashes in recoverBlock

     [ https://issues.apache.org/jira/browse/HDFS-1225?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Allen Wittenauer resolved HDFS-1225.
------------------------------------

    Resolution: Incomplete

append got overhauled in 2.x. closing.

> Block lost when primary crashes in recoverBlock
> -----------------------------------------------
>
>                 Key: HDFS-1225
>                 URL: https://issues.apache.org/jira/browse/HDFS-1225
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode
>    Affects Versions: 0.20-append
>            Reporter: Thanh Do
>
> - Summary: Block is lost if primary datanode crashes in the middle tryUpdateBlock.
>  
> - Setup:
> # available datanode = 2
> # replica = 2
> # disks / datanode = 1
> # failures = 1
> # failure type = crash
> When/where failure happens = (see below)
>  
> - Details:
>  Suppose we have 2 datanodes: dn1 and dn2 and dn1 is primary.
> Client appends to blk_X_1001 and crash happens during dn1.recoverBlock,
> at the point after blk_X_1001.meta is renamed to blk_X_1001.meta_tmp1002
> **Interesting**, this case, the block X is lost eventually. Why?
> After dn1.recoverBlock crashes at rename, what left at dn1 current directory is:
> 1) blk_X                                                                                                                                                                                                         
> 2) blk_X_1001.meta_tmp1002
> ==> this is an invalid block, because it has no meta file associated with it.
> dn2 (after dn1 crash) now contains:
> 1) blk_X                                                                                                                                                                                                         
> 2) blk_X_1002.meta
> (note that the rename at dn2 is completed, because dn1 called dn2.updateBlock() before
> calling its own updateBlock())
> But the command namenode.commitBlockSynchronization is not reported to namenode,
> because dn1 is crashed. Therefore, from namenode point of view, the block X has GS 1001.
> Hence, the block is lost.
> This bug was found by our Failure Testing Service framework:
> http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
> For questions, please email us: Thanh Do (thanhdo@cs.wisc.edu) and 
> Haryadi Gunawi (haryadi@eecs.berkeley.edu)



--
This message was sent by Atlassian JIRA
(v6.2#6252)