You are viewing a plain text version of this content. The canonical link for it is here.

Posted to hdfs-dev@hadoop.apache.org by "Thanh Do (JIRA)" <ji...@apache.org> on 2010/06/17 07:44:25 UTC

[jira] Created: (HDFS-1236) Client uselessly retries recoverBlock 5 times

Client uselessly retries recoverBlock 5 times
---------------------------------------------

                 Key: HDFS-1236
                 URL: https://issues.apache.org/jira/browse/HDFS-1236
             Project: Hadoop HDFS
          Issue Type: Bug
    Affects Versions: 0.20.1
            Reporter: Thanh Do


Summary:
Client uselessly retries recoverBlock 5 times
The same behavior is also seen in append protocol (HDFS-1229)

The setup:
# available datanodes = 4
Replication factor = 2 (hence there are 2 datanodes in the pipeline)
Failure type = Bad disk at datanode (not crashes)
# failures = 2
# disks / datanode = 1
Where/when the failures happen: This is a scenario where each disk of the two datanodes in the pipeline go bad at the same time during the 2nd phase of the pipeline (the data transfer phase).
 
Details:
 
In this case, the client will call processDatanodeError
which will call datanode.recoverBlock to those two datanodes.
But since these two datanodes have bad disks (although they're still alive),
then recoverBlock() will fail.
For this one, the client's retry logic ends when streamer is closed (close == true).
But before this happen, the client will retry 5 times
(maxRecoveryErrorCount) and will fail all the time, until
it finishes.  What is interesting is that
during each retry, there is a wait of 1 second in
DataStreamer.run (i.e. dataQueue.wait(1000)).
So it will be a 5-second total wait before declaring it fails.
 
This is a different bug than HDFS-1235, where the client retries
3 times for 6 seconds (resulting in 25 seconds wait time).
In this experiment, what we get for the total wait time is only
12 seconds (not sure why it is 12). So the DFSClient quits without
contacting the namenode again (say to ask for a new set of
two datanodes).
So interestingly we find another
bug that shows client retry logic is complex and not deterministic
depending on where and when failures happen.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (HDFS-1236) Client uselessly retries recoverBlock 5 times

Posted by "Todd Lipcon (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/HDFS-1236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Todd Lipcon resolved HDFS-1236.
-------------------------------

    Resolution: Invalid

I don't consider the retry useless - there may be transient errors preventing recovery (eg network errors). The 6 second sleep is addressed by HDFS-1054

> Client uselessly retries recoverBlock 5 times
> ---------------------------------------------
>
>                 Key: HDFS-1236
>                 URL: https://issues.apache.org/jira/browse/HDFS-1236
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: hdfs client
>    Affects Versions: 0.20.1
>            Reporter: Thanh Do
>
> Summary:
> Client uselessly retries recoverBlock 5 times
> The same behavior is also seen in append protocol (HDFS-1229)
> The setup:
> + # available datanodes = 4
> + Replication factor = 2 (hence there are 2 datanodes in the pipeline)
> + Failure type = Bad disk at datanode (not crashes)
> + # failures = 2
> + # disks / datanode = 1
> + Where/when the failures happen: This is a scenario where each disk of the two datanodes in the pipeline go bad at the same time during the 2nd phase of the pipeline (the data transfer phase).
>  
> Details:
>  
> In this case, the client will call processDatanodeError
> which will call datanode.recoverBlock to those two datanodes.
> But since these two datanodes have bad disks (although they're still alive),
> then recoverBlock() will fail.
> For this one, the client's retry logic ends when streamer is closed (close == true).
> But before this happen, the client will retry 5 times
> (maxRecoveryErrorCount) and will fail all the time, until
> it finishes.  What is interesting is that
> during each retry, there is a wait of 1 second in
> DataStreamer.run (i.e. dataQueue.wait(1000)).
> So it will be a 5-second total wait before declaring it fails.
>  
> This is a different bug than HDFS-1235, where the client retries
> 3 times for 6 seconds (resulting in 25 seconds wait time).
> In this experiment, what we get for the total wait time is only
> 12 seconds (not sure why it is 12). So the DFSClient quits without
> contacting the namenode again (say to ask for a new set of
> two datanodes).
> So interestingly we find another
> bug that shows client retry logic is complex and not deterministic
> depending on where and when failures happen.
> This bug was found by our Failure Testing Service framework:
> http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-98.html
> For questions, please email us: Thanh Do (thanhdo@cs.wisc.edu) and
> Haryadi Gunawi (haryadi@eecs.berkeley.edu)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.