You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-issues@hadoop.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2023/05/05 06:49:00 UTC
[jira] [Commented] (HDFS-16989) Large scale block transfer causes too many excess blocks

    [ https://issues.apache.org/jira/browse/HDFS-16989?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17719690#comment-17719690 ] 

ASF GitHub Bot commented on HDFS-16989:
---------------------------------------

hfutatzhanghb commented on PR #5593:
URL: https://github.com/apache/hadoop/pull/5593#issuecomment-1535794874

   hi, @goiri ~. could you please take a look at this if you have time， thanks a lot.




> Large scale block transfer causes too many excess blocks
> --------------------------------------------------------
>
>                 Key: HDFS-16989
>                 URL: https://issues.apache.org/jira/browse/HDFS-16989
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 3.4.0, 3.3.5
>            Reporter: farmmamba
>            Priority: Critical
>              Labels: pull-request-available
>
> Recently, we change the replication factor of a directory which has 1.6PB  from 2 to 3.  There are 76 million  blocks in this directory. After we execute the setrep cmd, the active namenode prints lots of logs like below:
> {code:java}
> PendingReconstructionMonitor timed out blk_xxxx_260285131{code}
> and many datanodes prints lots of duplicated logs like below: 
> {code:java}
> 2023-04-21 13:58:17,627 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(1.1.1.1:50010, datanodeUuid=f3081eac-983f-4c3f-99c8-e4830640ee90, infoPort=50075, infoSecurePort=0, ipcPort=8010, storageInfo=lv=-57;cid=yj-hdfs2;nsid=1882889931;c=1667291826362) Starting thread to transfer BP-578784987-x.x.x.x-1667291826362:blk_1333463885_260285131 to 2.2.2.2:50010 
> 2023-04-21 14:21:21,296 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DataTransfer, at 1.1.1.1:50010: Transmitted BP-578784987-x.x.x.x-1667291826362:blk_1333463885_260285131 (numBytes=524384907) to /2.2.2.2:50010
> 14:34:19,134 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(1.1.1.1:50010, datanodeUuid=f3081eac-983f-4c3f-99c8-e4830640ee90, infoPort=50075, infoSecurePort=0, ipcPort=8010, storageInfo=lv=-57;cid=yj-hdfs2;nsid=1882889931;c=1667291826362) Starting thread to transfer BP-578784987-x.x.x.x-1667291826362:blk_1333463885_260285131 to 2.2.2.2:50010
> 2023-04-21 14:37:58,207 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(1.1.1.1:50010, datanodeUuid=f3081eac-983f-4c3f-99c8-e4830640ee90, infoPort=50075, infoSecurePort=0, ipcPort=8010, storageInfo=lv=-57;cid=yj-hdfs2;nsid=1882889931;c=1667291826362) Starting thread to transfer BP-578784987-x.x.x.x-1667291826362:blk_1333463885_260285131 to 2.2.2.2:50010
> 14:40:46,817 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(1.1.1.1:50010, datanodeUuid=f3081eac-983f-4c3f-99c8-e4830640ee90, infoPort=50075, infoSecurePort=0, ipcPort=8010, storageInfo=lv=-57;cid=yj-hdfs2;nsid=1882889931;c=1667291826362) Starting thread to transfer BP-578784987-x.x.x.x-1667291826362:blk_1333463885_260285131 to 2.2.2.2:50010
>  {code}
> It is funny that an same block blk_1333463885_260285131 was transfer multiple times even though it has been transmitted successfully. the excess transfer request will trigger lots of ReplicaAlreadyExistsException in target datanode, because the replica has already been transmitted and the state of the replica is FINALIZED.
>  
> The reason is in   RedundancyMonitor#processPendingReconstructions and BlcokManager#validateReconstructionWork.
> 1、RedundancyMonitor#computeDatanodeWork() generates transfer tasks using neededReconstruction and addTaskToDatanode, and put the tasks into pendingReconstruction.
> 2、We set *dfs.namenode.replication.work.multiplier.per.iteration = 200,*  the specific cluster has 400 datanodes. So, RedundancyMonitor may generate 80K block transfer tasks per iteration at most.  After dfs.namenode.reconstruction.pending.timeout-sec (5min)， the requests in  pendingReconstruction will time-out, and  PendingReconstructionMonitor thread will put the time-out requests into timedOutItems.
> 3、RedundancyMonitor#processPendingReconstructions() will put the requests in timedOutItems into neededReconstruction again.
> 4、TimeUnit.MILLISECONDS.sleep(redundancyRecheckIntervalMs);
> 5、next iteration of while loop，RedundancyMonitor#computeDatanodeWork() will generates transfer tasks using neededReconstruction again,  here will generate repeated task or different target node task(due to the chooseTarget method).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-help@hadoop.apache.org