You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-dev@hadoop.apache.org by "Allen Wittenauer (JIRA)" <ji...@apache.org> on 2014/07/30 22:28:40 UTC
[jira] [Resolved] (HDFS-1238) A block is stuck in ongoingRecovery due to exception not propagated

     [ https://issues.apache.org/jira/browse/HDFS-1238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Allen Wittenauer resolved HDFS-1238.
------------------------------------

    Resolution: Incomplete

Stale.

> A block is stuck in ongoingRecovery due to exception not propagated 
> --------------------------------------------------------------------
>
>                 Key: HDFS-1238
>                 URL: https://issues.apache.org/jira/browse/HDFS-1238
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs-client
>    Affects Versions: 0.20.1
>            Reporter: Thanh Do
>
> - Setup:
> +  # datanodes = 2
> + replication factor = 2
> + failure type = transient (i.e. a java I/O call that throws I/O Exception or returns false)
> + # failures = 2
> + When/where failures happen: (This is a subtle bug) The first failure is a transient failure at a datanode during the second phase. Due to the first failure, the DFSClient will call recoverBlock.  The second failure is injected during this recover block process (i.e. another failure during the recovery process).
>  
> - Details:
>  
> The expectation here is that since the DFSClient performs lots of retries,
> two transient failures should be masked properly by the retries.
> We found one case, where the failures are not transparent to the users.
>  
> Here's the stack trace of when/where the two failures happen (please ignore the line number).
>  
> 1. The first failure:
> Exception is thrown at
>     call(void java.io.DataOutputStream.flush())
>     SourceLoc: org/apache/hadoop/hdfs/server/datanode/BlockReceiver.java(252)
> Stack Trace:
>       [0] datanode.BlockReceiver (flush:252)
>       [1] datanode.BlockReceiver (receivePacket:660)
>       [2] datanode.BlockReceiver (receiveBlock:743)
>       [3] datanode.DataXceiver (writeBlock:468)
>       [4] datanode.DataXceiver (run:119)
>  
> 2. The second failure:
> False is returned at
>    call(boolean java.io.File.renameTo(File))
>    SourceLoc: org/apache/hadoop/hdfs/server/datanode/FSDataset.java(105)
> Stack Trace:
>       [0] datanode.FSDataset (tryUpdateBlock:1008)
>       [1] datanode.FSDataset (updateBlock:859)
>       [2] datanode.DataNode (updateBlock:1780)
>       [3] datanode.DataNode (syncBlock:2032)
>       [4] datanode.DataNode (recoverBlock:1962)
>       [5] datanode.DataNode (recoverBlock:2101)
>  
> This is what we found out:
> The first failure causes the DFSClient to somehow calls recoverBlock,
> which will force us to see the 2nd failure. The 2nd failure makes
> renameTo returns false, which then causes an IOException to be thrown
> from the function that calls renameTo.
> But this IOException is not propagated properly!
> It is dropped inside DN.syncBlock(). Specifically DN.syncBlock
> calls DN.updateBlock() which gets the exception. But syncBlock
> only catches that and prints a warning without propagating the exception
> properly.  Thus syncBlock returns without any exception,
> and thus recoverBlock returns without executing the finally{} block
> (see below).
>  
> Now, the client will retry recoverBlock for 3-5 more times,
> but this retries always see exceptions! The reason is that the first
> time we call recoverBlock(blk), this blk is being put into
> an ongoingRecovery list inside DN.recoverBlock().
> Normally, blk is only removed (ongoingRecovery.remove(blk)) inside the finally{} block.
> But since the exception is not propagated properly, this finally{}
> block is never called, thus the blk is stuck
> forever inside the ongoingRecovery list, and hence the next time
> client performs the retry, it gets this error message
> "Block ... is already being recovered" and recoverBlock() throws
> IOException.  As a result, the client which calls this whole
> process in the context of processDatanodeError will return
> from the pde function with closed = true, and hence it never
> retries the whole thing again from the beginning, and instead
> just returns error.
> This bug was found by our Failure Testing Service framework:
> http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
> For questions, please email us: Thanh Do (thanhdo@cs.wisc.edu) and
> Haryadi Gunawi (haryadi@eecs.berkeley.edu)



--
This message was sent by Atlassian JIRA
(v6.2#6252)