You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-issues@hadoop.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2022/06/06 14:23:00 UTC
[jira] [Work logged] (HDFS-16064) HDFS-721 causes DataNode decommissioning to get stuck indefinitely

     [ https://issues.apache.org/jira/browse/HDFS-16064?focusedWorklogId=778639&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778639 ]

ASF GitHub Bot logged work on HDFS-16064:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 06/Jun/22 14:22
            Start Date: 06/Jun/22 14:22
    Worklog Time Spent: 10m 
      Work Description: KevinWikant opened a new pull request, #4410:
URL: https://github.com/apache/hadoop/pull/4410

   HDFS-16064. Determine when to invalidate corrupt replicas based on number of usable replicas
   
   ### Description of PR
   
   Bug fix for a re-occurring HDFS bug which can result in datanodes being unable to complete decommissioning indefinitely. In short, the bug is a chicken & egg problem where:
   - in order for a datanode to be decommissioned its blocks must be sufficiently replicated
   - datanode cannot sufficiently replicate some block(s) because of corrupt block replicas on target datanodes
   - corrupt block replicas will not be invalidated because the block(s) are not sufficiently replicated
   
   In this scenario, the block(s) are sufficiently replicated but the logic the Namenode uses to determine if a block is sufficiently replicated is flawed.
   
   To understand the bug further we must first establish some background information.
   
   #### Background Information
   
   Givens:
   - FSDataOutputStream is being used to write the HDFS file, under the hood this uses a class DataStreamer
   - for the sake of example we will say the HDFS file has a replication factor of 2, though this is not a requirement to reproduce the issue
   - the file is being appended to intermittently over an extended period of time (in general, this issue needs minutes/hours  to reproduce)
   - HDFS is configured with typical default configurations
   
   Under certain scenarios the DataStreamer client can detect a bad link when trying to append to the block pipeline, in this case the DataStreamer client can shift the block pipeline by replacing the bad link with a new datanode. When this happens the replica on the datanode that was shifted away from becomes corrupted because it no longer has the latest generation stamp for the block. As a more concrete example:
   - DataStreamer client creates block pipeline on datanodes A & B, each have a block replica with generation stamp 1
   - DataStreamer client tries to append the block pipeline by sending block transfer (with generation stamp 2) to datanode A
   - Datanode A succeeds in writing the block transfer & then attempts to forward the transfer to datanode B
   - Datanode B fails the transfer for some reason and responds with a PipelineAck failure code
   - Datanode A sends a PipelineAck to DataStreamer indicating datanode A succeeded in the append & datanode B failed in the append. The DataStreamer detects datanode B as a bad link which will be replaced before the next append operation
   - at this point datanode A has live replica with generation stamp 2 & datanode B has corrupt replica with generation stamp 1
   - the next time DataStreamer tries to append the block it will call Namenode "getAdditionalDatanode" API which returns some other datanode C
   - DataStreamer sends data transfer (with generation stamp 3) to the new block pipeline containing datanodes A & C, the append succeeds to both datanodes
   - end state is that:
     - datanodes A & C have live replicas with latest generation stamp 3
     - datanode B has a corrupt replica because its lagging behind with generation stamp 1
   
   The key behavior being highlighted here is that when the DataStreamer client shifts the block pipeline due to append failures on a subset of the datanodes in the pipeline, a corrupt block replica gets leftover on the datanode that was shifted away from.
   
   This corrupt block replica makes the datanode ineligible as a replication target for the block because of the following exception:
   
   ```
   2021-06-06 10:38:23,604 INFO org.apache.hadoop.hdfs.server.datanode.DataNode (DataXceiver for client  at /DN2:45654 [Receiving block BP-YYY:blk_XXX]): DN3:9866:DataXceiver error processing WRITE_BLOCK operation  src: /DN2:45654 dst: /DN3:9866; org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-YYY:blk_XXX already exists in state FINALIZED and thus cannot be created.
   ```
   
   What typically will occur is that these corrupt block replicas will be invalidated by the Namenode which will cause the corrupt replica to the be deleted on the datanode, thus allowing the datanode to once again be a replication target for the block. Note that the Namenode will not identify the corrupt block replica until the datanode sends its next block report, this can take up to 6 hours with the default block report interval.
   
   As long as there is 1 live replica of the block, all the corrupt replicas should be invalidated based on this condition: https://github.com/apache/hadoop/blob/7bd7725532fd139d2e0e1662df7700f7ab95067a/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java#L1928
   
   When there are 0 live replicas the corrupt replicas are not invalidated, I assume the reasoning behind this is that its better to have some corrupt copies of the block then to have no copies at all.
   
   #### Description of Problem
   
   The problem comes into play when the aforementioned behavior is coupled with decommissioning and/or entering maintenance.
   
   Consider the following scenario:
   - block has replication factor of 2
   - there are 3 datanodes A, B, & C
   - datanode A has decommissioning replica
   - datanodes B & C have corrupt replicas
   
   This scenario is essentially a decommissioning & replication deadlock because:
   - corrupt replicas on B & C will not be invalidated because there are 0 live replicas (as per Namenode logic)
   - datanode A cannot finish decommissioning until the block is replicated to datanodes B & C
   - the block cannot be replicated to datanodes B & C until their corrupt replicas are invalidated
   
   This does not need to be a deadlock, the corrupt replicas could be invalidated & the live replica could be transferred from A to B & C.
   
   The same problem can exist on a larger scale, the requirements are that:
   - liveReplicas < minReplicationFactor (minReplicationFactor=1 by default)
   - decommissioningAndEnteringMaintenanceReplicas > 0
   - liveReplicas + decommissioningAndEnteringMaintenanceReplicas + corruptReplicas = numberOfDatanodes
   
   In this case the corrupt replicas will not be invalidated by the Namenode which means that the decommissioning and entering maintenance replicas will never be sufficiently replicated and therefore will never finish decommissioning or entering maintenance.
   
   The symptom of this issue in the logs is that right after a node with a corrupt replica sends its block report, rather than the block being invalidated it just gets counted as a corrupt replica:
   
   ```
   TODO
   ```
   
   #### Proposed Solution
   
   Rather than checking if minReplicationSatisfied based on live replicas, check based on usable replicas (i.e. live + decommissioning + enteringMaintenance). This will allow the corrupt replicas to be invalidated & the live replica(s) on the decommissioning node(s) to be sufficiently replicated.
   
   The only perceived risk here would be that the corrupt blocks are invalidated at around the same time the decommissioning and entering maintenance nodes are decommissioned. This could in theory bring the overall number of replicas below the minReplicationFactor (to 0 in the worst case). This is however not an actual risk because the decommissioning and entering maintenance nodes will not finish decommissioning until they have a sufficient number of live replicas; so there is no possibility that the decommissioning and entering maintenance nodes will be decommissioned prematurely.
   
   ### How was this patch tested?
   
   Added a unit test "testDeleteCorruptReplicaForUnderReplicatedBlock"
   
   - TODO
   
   ### For code changes:
   
   - [X] Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')?
   - [n/a] Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation?
   - [n/a] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)?
   - [n/a] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, `NOTICE-binary` files?
   
   




Issue Time Tracking
-------------------

            Worklog Id:     (was: 778639)
    Remaining Estimate: 0h
            Time Spent: 10m

> HDFS-721 causes DataNode decommissioning to get stuck indefinitely
> ------------------------------------------------------------------
>
>                 Key: HDFS-16064
>                 URL: https://issues.apache.org/jira/browse/HDFS-16064
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode, namenode
>    Affects Versions: 3.2.1
>            Reporter: Kevin Wikant
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Seems that https://issues.apache.org/jira/browse/HDFS-721 was resolved as a non-issue under the assumption that if the namenode & a datanode get into an inconsistent state for a given block pipeline, there should be another datanode available to replicate the block to
> While testing datanode decommissioning using "dfs.exclude.hosts", I have encountered a scenario where the decommissioning gets stuck indefinitely
> Below is the progression of events:
>  * there are initially 4 datanodes DN1, DN2, DN3, DN4
>  * scale-down is started by adding DN1 & DN2 to "dfs.exclude.hosts"
>  * HDFS block pipelines on DN1 & DN2 must now be replicated to DN3 & DN4 in order to satisfy their minimum replication factor of 2
>  * during this replication process https://issues.apache.org/jira/browse/HDFS-721 is encountered which causes the following inconsistent state:
>  ** DN3 thinks it has the block pipeline in FINALIZED state
>  ** the namenode does not think DN3 has the block pipeline
> {code:java}
> 2021-06-06 10:38:23,604 INFO org.apache.hadoop.hdfs.server.datanode.DataNode (DataXceiver for client  at /DN2:45654 [Receiving block BP-YYY:blk_XXX]): DN3:9866:DataXceiver error processing WRITE_BLOCK operation  src: /DN2:45654 dst: /DN3:9866; org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-YYY:blk_XXX already exists in state FINALIZED and thus cannot be created.
> {code}
>  * the replication is attempted again, but:
>  ** DN4 has the block
>  ** DN1 and/or DN2 have the block, but don't count towards the minimum replication factor because they are being decommissioned
>  ** DN3 does not have the block & cannot have the block replicated to it because of HDFS-721
>  * the namenode repeatedly tries to replicate the block to DN3 & repeatedly fails, this continues indefinitely
>  * therefore DN4 is the only live datanode with the block & the minimum replication factor of 2 cannot be satisfied
>  * because the minimum replication factor cannot be satisfied for the block(s) being moved off DN1 & DN2, the datanode decommissioning can never be completed 
> {code:java}
> 2021-06-06 10:39:10,106 INFO BlockStateChange (DatanodeAdminMonitor-0): Block: blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 2, maintenance replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is Open File: false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 , Current Datanode: DN1:9866, Is current datanode decommissioning: true, Is current datanode entering maintenance: false
> ...
> 2021-06-06 10:57:10,105 INFO BlockStateChange (DatanodeAdminMonitor-0): Block: blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 2, maintenance replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is Open File: false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 , Current Datanode: DN2:9866, Is current datanode decommissioning: true, Is current datanode entering maintenance: false
> {code}
> Being stuck in decommissioning state forever is not an intended behavior of DataNode decommissioning
> A few potential solutions:
>  * Address the root cause of the problem which is an inconsistent state between namenode & datanode: https://issues.apache.org/jira/browse/HDFS-721
>  * Detect when datanode decommissioning is stuck due to lack of available datanodes for satisfying the minimum replication factor, then recover by re-enabling the datanodes being decommissioned
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org