You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-issues@hadoop.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2023/05/05 03:36:00 UTC
[jira] [Commented] (HDFS-16987) NameNode should remove all invalid corrupted blocks when starting active service

    [ https://issues.apache.org/jira/browse/HDFS-16987?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17719616#comment-17719616 ] 

ASF GitHub Bot commented on HDFS-16987:
---------------------------------------

ZanderXu commented on PR #5583:
URL: https://github.com/apache/hadoop/pull/5583#issuecomment-1535657249

   @Hexiaoqiao @ayushtkn Master, after deep thinking, maybe we can only fix this problem when processAllPendingDNMessages, because namenode doesn't know whether this report is consistent with the actual replica storage information of the DataNode.
   
   **Case1: This report with small GS is postponed report, which is different from the actual replica of the datanode.**
   For example:
   
   - The actual replica of DN is: blk_1024_1002
   - The postponed report is: blk_1024_1001
   
   For this case, namenode can ignore this postponed report and doesn't mark it as a corrupted replica. 
   
   **Case2: This report with small GS is the newest report, which is same with the actual replica of the datanode.**
   For example:
   
   - The actual replica of DN is: blk_1024_1001
   - The report is: blk_1024_1001
   - The storages of this block in namenode already contains this DN
   
   For this case, namenode shouldn't ignore this report, and it should mark this replica as a corrupted replica.  Manually modifying block storage files on DataNode may cause this problem.
   
   
   At present,  namenode can only consider that each report is the newest report, and then modify the status of the block in the memory of namenode, because datanode reports the state  to NN through block report or blockReceiveAndDelete. 
   
   
   If we modify the logic of `markBlockAsCorrupt`, namenode will can not mark the replica as a corrupted replica for case2.
   If we modify the logic of `processAllPendingDNMessages`, the postponed message will be temporarily ignored for case 2, and active namenode will mark it as a corrupted replica in the next block report of corressponding DN.




> NameNode should remove all invalid corrupted blocks when starting active service
> --------------------------------------------------------------------------------
>
>                 Key: HDFS-16987
>                 URL: https://issues.apache.org/jira/browse/HDFS-16987
>             Project: Hadoop HDFS
>          Issue Type: Bug
>            Reporter: ZanderXu
>            Assignee: ZanderXu
>            Priority: Critical
>              Labels: pull-request-available
>
> In our prod environment, we encountered an incident where HA failover caused some new corrupted blocks, causing some jobs to fail.
>  
> Traced down and found a bug in the processing of all pending DN messages when starting active services.
> The steps to reproduce are as follows:
>  # Suppose NN1 is Active and NN2 is Standby, Active works well and Standby is unstable
>  # Timing 1, client create a file, write some data and close it.
>  # Timing 2, client append this file, write some data and close it.
>  # Timing 3, Standby replayed the second closing edits of this file
>  # Timing 4, Standby processes the blockReceivedAndDeleted of the first create operation
>  # Timing 5, Standby processed the blockReceivedAndDeleted of the second append operation
>  # Timing 6, Admin switched the active namenode from NN1 to NN2
>  # Timing 7, client failed to append some data to this file.
> {code:java}
> org.apache.hadoop.ipc.RemoteException(java.io.IOException): append: lastBlock=blk_1073741825_1002 of src=/testCorruptedBlockAfterHAFailover is not sufficiently replicated yet.
>     at org.apache.hadoop.hdfs.server.namenode.FSDirAppendOp.appendFile(FSDirAppendOp.java:138)
>     at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:2992)
>     at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:858)
>     at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:527)
>     at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
>     at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:621)
>     at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:589)
>     at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573)
>     at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1227)
>     at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1221)
>     at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1144)
>     at java.security.AccessController.doPrivileged(Native Method)
>     at javax.security.auth.Subject.doAs(Subject.java:422)
>     at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1953)
>     at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3170) {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org