You are viewing a plain text version of this content. The canonical link for it is here.
Posted to hdfs-dev@hadoop.apache.org by "gurmukh singh (JIRA)" <ji...@apache.org> on 2019/03/06 04:52:00 UTC
[jira] [Resolved] (HDFS-7134) Replication count for a block should not update till the blocks have settled on Datanodes

     [ https://issues.apache.org/jira/browse/HDFS-7134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

gurmukh singh resolved HDFS-7134.
---------------------------------
       Resolution: Fixed
    Fix Version/s: 3.1.0
     Release Note: This is resolved in 3.1

> Replication count for a block should not update till the blocks have settled on Datanodes
> -----------------------------------------------------------------------------------------
>
>                 Key: HDFS-7134
>                 URL: https://issues.apache.org/jira/browse/HDFS-7134
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode, hdfs
>    Affects Versions: 1.2.1, 2.6.0, 2.7.3
>         Environment: Linux nn1.cluster1.com 2.6.32-431.20.3.el6.x86_64 #1 SMP Thu Jun 19 21:14:45 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
> [hadoop@nn1 conf]$ cat /etc/redhat-release
> CentOS release 6.5 (Final)
>            Reporter: gurmukh singh
>            Priority: Critical
>              Labels: HDFS
>             Fix For: 3.1.0
>
>
> The count for the number of replica's for a block should not change till the blocks have settled on the datanodes.
> Test Case:
> Hadoop Cluster with 1 namenode and 3 datanodes.
> nn1.cluster1.com(192.168.1.70)
> dn1.cluster1.com(192.168.1.72)
> dn2.cluster1.com(192.168.1.73)
> dn3.cluster1.com(192.168.1.74)
> Cluster up and running fine with replication set to "1" for parameter "dfs.replication on all nodes"
> <property>
> <name>dfs.replication</name>
> <value>1</value>
> </property>
> To reduce the wait time, have reduced the dfs.heartbeat and recheck parameters.
> on datanode2 (192.168.1.72)
> [hadoop@dn2 ~]$ hadoop fs -Ddfs.replication=2 -put from_dn2 /
> [hadoop@dn2 ~]$ hadoop fs -ls /from_dn2
> Found 1 items
> -rw-r--r--   2 hadoop supergroup         17 2014-09-23 13:33 /from_dn2
> On Namenode
> ===========
> As expected, copy was done from datanode2, one copy will go locally.
> [hadoop@nn1 conf]$ hadoop fsck /from_dn2 -files -blocks -locations
> FSCK started by hadoop from /192.168.1.70 for path /from_dn2 at Tue Sep 23 13:53:16 IST 2014
> /from_dn2 17 bytes, 1 block(s):  OK
> 0. blk_8132629811771280764_1175 len=17 repl=2 [192.168.1.74:50010, 192.168.1.73:50010]
> Can see the blocks on the data nodes disks as well under the "current" directory.
> Now, shutdown datanode2(192.168.1.73) and as expected block moves to another datanode to maintain a replication of 2
> [hadoop@nn1 conf]$ hadoop fsck /from_dn2 -files -blocks -locations
> FSCK started by hadoop from /192.168.1.70 for path /from_dn2 at Tue Sep 23 13:54:21 IST 2014
> /from_dn2 17 bytes, 1 block(s):  OK
> 0. blk_8132629811771280764_1175 len=17 repl=2 [192.168.1.74:50010, 192.168.1.72:50010]
> But, now if i bring back the datanode2, and although the namenode see that this block is at 3 places now and fires a invalidate command for datanode1(192.168.1.72) but the replication on the namenode is bumped to 3 immediately.
> [hadoop@nn1 conf]$ hadoop fsck /from_dn2 -files -blocks -locations
> FSCK started by hadoop from /192.168.1.70 for path /from_dn2 at Tue Sep 23 13:56:12 IST 2014
> /from_dn2 17 bytes, 1 block(s):  OK
> 0. blk_8132629811771280764_1175 len=17 repl=3 [192.168.1.74:50010, 192.168.1.72:50010, 192.168.1.73:50010]
> on Datanode1 - The invalidate command has been fired immediately and the block deleted.
> =============
> 2014-09-23 13:54:17,483 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Receiving blk_8132629811771280764_1175 src: /192.168.1.74:38099 dest: /192.168.1.72:50010
> 2014-09-23 13:54:17,502 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Received blk_8132629811771280764_1175 src: /192.168.1.74:38099 dest: /192.168.1.72:50010 size 17
> 2014-09-23 13:55:28,720 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling blk_8132629811771280764_1175 file /space/disk1/current/blk_8132629811771280764 for deletion
> 2014-09-23 13:55:28,721 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted blk_8132629811771280764_1175 at file /space/disk1/current/blk_8132629811771280764
> The namenode still shows 3 replica's. even if one has been deleted, even after more then 30 mins.
> [hadoop@nn1 conf]$ hadoop fsck /from_dn2 -files -blocks -locations
> FSCK started by hadoop from /192.168.1.70 for path /from_dn2 at Tue Sep 23 14:21:27 IST 2014
> /from_dn2 17 bytes, 1 block(s):  OK
> 0. blk_8132629811771280764_1175 len=17 repl=3 [192.168.1.74:50010, 192.168.1.72:50010, 192.168.1.73:50010]
> This could be a dangerous, if someone remove or other 2 datanodes fail.
> On Datanode 1
> =============
> Before, the datanode1 is brought back
> [hadoop@dn1 conf]$ ls -l /space/disk*/current
> /space/disk1/current:
> total 28
> -rw-rw-r-- 1 hadoop hadoop   13 Sep 21 09:09 blk_2278001646987517832
> -rw-rw-r-- 1 hadoop hadoop   11 Sep 21 09:09 blk_2278001646987517832_1171.meta
> -rw-rw-r-- 1 hadoop hadoop   17 Sep 23 13:54 blk_8132629811771280764
> -rw-rw-r-- 1 hadoop hadoop   11 Sep 23 13:54 blk_8132629811771280764_1175.meta
> -rw-rw-r-- 1 hadoop hadoop 5299 Sep 21 10:04 dncp_block_verification.log.curr
> -rw-rw-r-- 1 hadoop hadoop  157 Sep 23 13:51 VERSION
> After, starting datanode daemon
> [hadoop@dn1 conf]$ ls -l /space/disk*/current
> /space/disk1/current:
> total 20
> -rw-rw-r-- 1 hadoop hadoop   13 Sep 21 09:09 blk_2278001646987517832
> -rw-rw-r-- 1 hadoop hadoop   11 Sep 21 09:09 blk_2278001646987517832_1171.meta
> -rw-rw-r-- 1 hadoop hadoop 5299 Sep 21 10:04 dncp_block_verification.log.curr
> -rw-rw-r-- 1 hadoop hadoop  157 Sep 23 13:51 VERSION
> As expected the block is deleted as seen in the logs as well, but namenode does not update it.
> rc: /192.168.1.74:38099 dest: /192.168.1.72:50010 size 17
> 2014-09-23 13:55:28,720 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Scheduling blk_8132629811771280764_1175 file /space/disk1/current/blk_8132629811771280764 for deletion
> 2014-09-23 13:55:28,721 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: Deleted blk_8132629811771280764_1175 at file /space/disk1/current/blk_8132629811771280764
> Thanks



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-help@hadoop.apache.org