You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-issues@hadoop.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2022/06/05 09:59:00 UTC
[jira] [Work logged] (HDFS-16601) Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try

     [ https://issues.apache.org/jira/browse/HDFS-16601?focusedWorklogId=778442&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778442 ]

ASF GitHub Bot logged work on HDFS-16601:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 05/Jun/22 09:58
            Start Date: 05/Jun/22 09:58
    Worklog Time Spent: 10m 
      Work Description: ZanderXu commented on PR #4369:
URL: https://github.com/apache/hadoop/pull/4369#issuecomment-1146776005

   @Hexiaoqiao @MingXiangLi can you help me review this patch? thanks~




Issue Time Tracking
-------------------

    Worklog Id:     (was: 778442)
    Time Spent: 0.5h  (was: 20m)

> Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-16601
>                 URL: https://issues.apache.org/jira/browse/HDFS-16601
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: ZanderXu
>            Assignee: ZanderXu
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> In our production environment, we found a bug and stack like:
> {code:java}
> java.io.IOException: Failed to replace a bad datanode on the existing pipeline due to no more good datanodes being available to try. (Nodes: current=[DatanodeInfoWithStorage[127.0.0.1:59687,DS-b803febc-7b22-4144-9b39-7bf521cdaa8d,DISK], DatanodeInfoWithStorage[127.0.0.1:59670,DS-0d652bc2-1784-430d-961f-750f80a290f1,DISK]], original=[DatanodeInfoWithStorage[127.0.0.1:59670,DS-0d652bc2-1784-430d-961f-750f80a290f1,DISK], DatanodeInfoWithStorage[127.0.0.1:59687,DS-b803febc-7b22-4144-9b39-7bf521cdaa8d,DISK]]). The current failed datanode replacement policy is DEFAULT, and a client may configure this via 'dfs.client.block.write.replace-datanode-on-failure.policy' in its configuration.
> 	at org.apache.hadoop.hdfs.DataStreamer.findNewDatanode(DataStreamer.java:1418)
> 	at org.apache.hadoop.hdfs.DataStreamer.addDatanode2ExistingPipeline(DataStreamer.java:1478)
> 	at org.apache.hadoop.hdfs.DataStreamer.handleDatanodeReplacement(DataStreamer.java:1704)
> 	at org.apache.hadoop.hdfs.DataStreamer.setupPipelineInternal(DataStreamer.java:1605)
> 	at org.apache.hadoop.hdfs.DataStreamer.setupPipelineForAppendOrRecovery(DataStreamer.java:1587)
> 	at org.apache.hadoop.hdfs.DataStreamer.processDatanodeOrExternalError(DataStreamer.java:1371)
> 	at org.apache.hadoop.hdfs.DataStreamer.run(DataStreamer.java:674)
> {code}
> And the root cause is that DFSClient cannot  perceive the exception of TransferBlock during PipelineRecovery. If failed during TransferBlock, the DFSClient will retry all datanodes in the cluster and then failed.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org